[lttng-dev] Relayd trace drops

Fri Dec 11 13:03:22 EST 2015

Hi Aravind,

On 2015-12-11 12:16 PM, Aravind HT wrote:
> Hi,
>
> Did you get a chance to reproduce the problem with the scripts ? Let 
> me know if you need any help running it.

Time is a scarce resource unfortunately. But until I have the time to 
look at it you might want to investigate if you are able to reproduce it 
on 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would 
really help.

Cheers!

>
> Regards,
> Aravind.
>
> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht at gmail.com 
> <mailto:aravind.ht at gmail.com>> wrote:
>
>     Sorry about that, not sure how it got missed.
>     Here it is.
>
>
>     # This needs a two node set up (1. local current node 2. remote node )
>     # relayd runs on the current node where traces are captured from
>     the remote node
>     # remote node runs test applications which generate traces.
>     # the launch_script_RN is executed on the current node and uses
>     ssh  to execute commands on the remote node. So this part may not
>     work in every case and may prompt for a password.
>     # if experiencing problems with ssh , kindly check
>     http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>
>     # ====================       To Run  =============================
>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>
>
>
>     # configFile.txt is the file which has configuration params that
>     launchScript
>     # needs to configure lttng sessions. Below is an explanation of
>     the different options.
>     # ===================     configFile.txt =============================
>
>     [section1]
>     # final out put file path
>     OutputFile = /tmp/Final_report.txt
>     # the remote node hostname on which test applications run and the
>     test sessions will be created; this should be something that could
>     be used with ssh. Traces will be transported from this node to the
>     lttng_relayd running on the current node.
>     Node = MY_REMOTE_NODE
>     # Sub buffer size to start this with
>     SubBufSize = 16k
>     # Sub buffer count
>     SubBufCount = 4
>     # per uid buffer
>     BufferScheme = --buffers-uid
>     # yes
>     EnableTracing = yes
>     # Bit rate of the test applications. Comman seperated example "1,
>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>     traces.
>     # So with the below, we just start with 1 test application
>     producing 10 kb/s
>     TestApps = 10
>     # session life time in seonds
>     TestTime = 10
>     # Max number of successive sessions to configure. if n then n-1
>     sessions are run, ex MaxRun = 2 will run 1 session.
>     MaxRun = 100
>
>
>     # ====================  Place the following files under
>     ===============
>
>     # /tmp on the remote node
>     clean_RemNode_apps.sh
>     report_lttng_script.sh
>
>     # rest of the scripts under /usr/sbin on the current local node on
>     which lttng_realyd runs
>     # Define a trace point MY_TRACE to take a single string arg with
>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>     and place it under /usr/sbin of the remote host
>
>     # in launch_script_RN.py change currentNodeIP to the IP address on
>     which relayd is receiving, default ports are used.
>
>     # lttng_relayd is started as
>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>
>     # lttng_sessiond is started as
>     /usr/bin/lttng-sessiond --consumerd32-path
>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>     /usr/lib/ --consumerd64-path
>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>     /usr/lib64/ -b --no-kernel
>
>
>
>
>     Regards,
>     Aravind.
>
>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>     <Jonathan.rajotte-julien at efficios.com
>     <mailto:Jonathan.rajotte-julien at efficios.com>> wrote:
>
>         Hi Aravind,
>
>         There is no README in the archive you sent.
>
>         Cheers
>
>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>
>             Hi,
>
>             I am trying to upgrade in parallel, but this issue may
>             still be present after I upgrade or may be temporarily
>             masked. So I need to find the root cause for this and then
>             see if its available on the latest before committing to
>             upgrade.
>
>             There is another issue i'm hitting, the lttng list command
>             hangs after lttng destroy session when running the profiling.
>
>             I found that consumerd 64 goes into an infinite loop
>             waiting to flush metadata in
>             lttng_ustconsumer_recv_metadata() :: while
>             (consumer_metadata_cache_flushed(channel, offset + len,
>             timer)) .
>             In consumer_metadata_cache,
>             channel->metadata_stream->endpoint_status is
>             CONSUMER_ENDPOINT_ACTIVE,
>             metadata_stream->ust_metadata_pushed is 0 with offset
>             having some value. This call always returns a 1 from the
>             last else{} block resulting in an infinite loop. Upon
>             searching the forum, I found the same issue being reported
>             here :
>             https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>
>             Regards,
>             Aravind.
>
>
>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>             <jonathan.r.julien at gmail.com
>             <mailto:jonathan.r.julien at gmail.com>> wrote:
>
>                 Hi Aravind,
>
>                 Did you have the chance to upgrade to 2.6.1.If so
>             where you able
>                 to reproduce?
>
>                 Cheers
>
>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>             <aravind.ht at gmail.com <mailto:aravind.ht at gmail.com>>
>                 wrote:
>
>                     Hi,
>
>                     I have attached the complete profiling scripts
>             here, its a bit
>                     shabby, im new to python.
>
>                     There is a README which has the details on how to
>             execute it.
>                     Im using a Yocto 1.6 on x86_64 platforms on both
>             the nodes.
>
>
>                     Running this script when there are other sessions
>             running
>                     seems to reproduce this problem easily.
>                     Please try it and let me know if you have any issues
>                     reproducing the problem.
>
>                     Regards,
>                     Aravind.
>
>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>                     <jeremie.galarneau at efficios.com
>             <mailto:jeremie.galarneau at efficios.com>
>                     <mailto:jeremie.galarneau at efficios.com
>             <mailto:jeremie.galarneau at efficios.com>>> wrote:
>
>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>                         <aravind.ht at gmail.com
>             <mailto:aravind.ht at gmail.com> <mailto:aravind.ht at gmail.com
>             <mailto:aravind.ht at gmail.com>>> wrote:
>                         > I am using 2.6.0 .I will try to share the
>             code that I'm
>                         using here in some
>                         > time. If there are any specific fixes that
>             are relevant
>                         to this issue, see
>                         > if you can provide a link to them. I would
>             ideally like
>                         to try them out
>                         > before trying a full upgrade to the latest
>             versions.
>
>                         Hi,
>
>                         Can you provide more information on the
>             system? Which
>                         distribution,
>                         architecture, kernel version?
>
>                         The verbose sessiond logs might help pinpoint any
>                         unexpected behaviour
>                         here (are all applications registering as
>             expected?).
>
>                         Jérémie
>
>                         >
>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>             Galarneau
>                         > <jeremie.galarneau at efficios.com
>             <mailto:jeremie.galarneau at efficios.com>
>                         <mailto:jeremie.galarneau at efficios.com
>             <mailto:jeremie.galarneau at efficios.com>>> wrote:
>                         >>
>                         >> Hi Aravind,
>                         >>
>                         >> Can't say I have looked at everything you
>             sent yet, but
>                         as a
>                         >> preemptive question, which version are we
>             talking about
>                         here? 2.6.0 or
>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>             fixes.
>                         >>
>                         >> Thanks,
>                         >> Jérémie
>                         >>
>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>                         <aravind.ht at gmail.com
>             <mailto:aravind.ht at gmail.com>> wrote:
>                         >> > Hi,
>                         >> >
>                         >> > I am trying to obtain the performance
>             characteristics
>                         of lttng with the
>                         >> > use
>                         >> > of test applications. Traces are being
>             produced on a
>                         local node and
>                         >> > delivered to relayd that is running on a
>             separate
>                         node for storage.
>                         >> >
>                         >> > An lttng session with the test
>             applications producing
>                         an initial bit
>                         >> > rate of
>                         >> > 10 kb/s is started and run for about 30
>             seconds. The
>                         starting sub-buffer
>                         >> > size is kept at 128 kb and sub-buf count
>             at 4. The
>                         session is then
>                         >> > stopped
>                         >> > and destroyed and traces are analyzed to
>             see if there
>                         are any drops.
>                         >> > This is
>                         >> > being done in a loop with every
>             subsequent session
>                         having an increment
>                         >> > of 2
>                         >> > kb/s as long as there are no drops. If
>             there are
>                         drops, I increase the
>                         >> > buffer size by a factor of x2 without
>             incrementing
>                         the bit rate.
>                         >> >
>                         >> > I see trace drops happening consistently
>             with test
>                         apps producing traces
>                         >> > at
>                         >> > less than 40 kb/s, it doesnt seem to help
>             even if I
>                         started with 1mb x 4
>                         >> > sub-buffers.
>                         >> >
>                         >> > Analysis :
>                         >> >
>                         >> > I have attached the lttng_relayd ,
>             lttng_consumerd_64
>                         logs and the
>                         >> > entire
>                         >> > trace directory, hope you will be able to
>             view it.
>                         >> > I have modified lttng_relayd code to dump
>             the traces
>                         being captured in
>                         >> > the
>                         >> > lttng_relayd logs along with debug info.
>                         >> >
>                         >> > Each test app is producing logs in the
>             form of  :
>                         >> > "TraceApp PID - 31940 THID - 31970
>             @threadRate - 1032
>                         b/s appRate - 2079
>                         >> > b/s
>                         >> > threadTraceNum - 9 appTraceNum - 18 
>             sleepTime - 192120"
>                         >> >
>                         >> > The test application PID, test
>             application thread id,
>                         thread bit rate,
>                         >> > test
>                         >> > app bit rate, thread trace number and
>             application
>                         trace number s are
>                         >> > part of
>                         >> > the trace. So in the above trace, the
>             thread is
>                         producing at 1 kb/s and
>                         >> > the
>                         >> > whole test app is producing at 2 kb/s.
>                         >> >
>                         >> > If we look at the babeltrace out put, we
>             see that the
>                         Trace with
>                         >> > TraceApp
>                         >> > PID - 31940 appTraceNum 2 is missing ,
>             with 1, 3, 4,
>                         5 and so on being
>                         >> > successfully captured.
>                         >> > I looked at the lttng_relayd logs and
>             found that
>                         trace of "appTraceNum
>                         >> > 2" is
>                         >> > not delivered/generated by the consumerd
>             to the
>                         relayd in sequence with
>                         >> > other traces. To rule out that this is
>             not a test
>                         application problem,
>                         >> > you
>                         >> > can look at line ltttng_relayd log :
>             12778 and see
>                         traces from
>                         >> > appTraceNum -
>                         >> > 1 to appTraceNum - 18 including the
>             appTraceNum 2 are
>                         "re-delivered" by
>                         >> > the
>                         >> > consumerd to the relayd.
>                         >> > Essentially, I see appTraceNum 1 through
>             appTraceNum
>                         18 being delivered
>                         >> > twice, once individually where
>             appTraceNum 2 is
>                         missing and once as a
>                         >> > group
>                         >> > at line 12778 where its present.
>                         >> >
>                         >> >
>                         >> > Request help with
>                         >> > 1. why traces are delivered twice, is it
>             by design or
>                         a genuine problem
>                         >> > ?
>                         >> > 2. how to avoid traces being dropped even
>             though
>                         buffers are
>                         >> > sufficiently
>                         >> > large enough ?
>                         >> >
>                         >> >
>                         >> > Regards,
>                         >> > Aravind.
>                         >> >
>                         >> >
>             _______________________________________________
>                         >> > lttng-dev mailing list
>                         >> > lttng-dev at lists.lttng.org
>             <mailto:lttng-dev at lists.lttng.org>
>                         >> >
>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>                         >> >
>                         >>
>                         >>
>                         >>
>                         >> --
>                         >> Jérémie Galarneau
>                         >> EfficiOS Inc.
>                         >> http://www.efficios.com
>                         >
>                         >
>
>
>
>                         --
>                         Jérémie Galarneau
>                         EfficiOS Inc.
>             http://www.efficios.com
>
>
>
>             _______________________________________________
>                     lttng-dev mailing list
>             lttng-dev at lists.lttng.org
>             <mailto:lttng-dev at lists.lttng.org>
>             <mailto:lttng-dev at lists.lttng.org
>             <mailto:lttng-dev at lists.lttng.org>>
>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>
>
>                 --     Jonathan Rajotte Julien
>
>
>
>
>             _______________________________________________
>             lttng-dev mailing list
>             lttng-dev at lists.lttng.org <mailto:lttng-dev at lists.lttng.org>
>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>         -- 
>         Jonathan R. Julien
>         Efficios
>
>
>

-- 
Jonathan R. Julien
Efficios