[lttng-dev] Relayd trace drops

Fri Dec 11 13:42:36 EST 2015

NPBS understand it, I will continue to investigate this, got a bit side
tracked with the sessiond hang issue that I mentioned earlier.
Making some progress on that, will update as soon as I find the root cause,
it looks like a mutex deadlock between
the following threads in the consumerd daemon

#1  0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26,
buf=buf at entry=0x7f16acca57f0,
len=len at entry=4476) at unix.c:175
#2  0x000000000041dc96 in lttng_ustconsumer_request_metadata
(ctx=ctx at entry=0x640d10,
channel=0x7f16980192b0, timer=timer at entry=0, wait=wait at entry=0) at
ust-consumer.c:2331
#3  0x000000000041ff3c in lttng_ustconsumer_sync_metadata
(ctx=ctx at entry=0x640d10,
metadata=metadata at entry=0x7f169801aa40) at ust-consumer.c:1829
#4  0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
metadata=<optimized out>) at consumer-stream.c:468
#5  consumer_stream_sync_metadata (ctx=ctx at entry=0x640d10, session_id=8) at
consumer-stream.c:545
#6  0x000000000041d316 in lttng_ustconsumer_read_subbuffer
(stream=stream at entry=0x7f1698018fa0, ctx=ctx at entry=0x640d10) at
ust-consumer.c:2063
#7  0x000000000040b613 in lttng_consumer_read_subbuffer
(stream=0x7f1698018fa0, ctx=0x640d10) at consumer.c:3203
#8  0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
consumer.c:2557
#9  0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
pthread_create.c:312
#10 0x00007f16af3dcafd in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

and

consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging

This again can be reproduced by the same script, just running consecutive
sessions of short duration (<10 seconds) with traces getting captured on a
remote node.
I hit this every time I run the script.

Regards,
Aravind.

On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
Jonathan.rajotte-julien at efficios.com> wrote:

> Hi Aravind,
>
> On 2015-12-11 12:16 PM, Aravind HT wrote:
>
>> Hi,
>>
>> Did you get a chance to reproduce the problem with the scripts ? Let me
>> know if you need any help running it.
>>
>
> Time is a scarce resource unfortunately. But until I have the time to look
> at it you might want to investigate if you are able to reproduce it on
> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would
> really help.
>
> Cheers!
>
>
>> Regards,
>> Aravind.
>>
>>
>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht at gmail.com <mailto:
>> aravind.ht at gmail.com>> wrote:
>>
>>     Sorry about that, not sure how it got missed.
>>     Here it is.
>>
>>
>>     # This needs a two node set up (1. local current node 2. remote node )
>>     # relayd runs on the current node where traces are captured from
>>     the remote node
>>     # remote node runs test applications which generate traces.
>>     # the launch_script_RN is executed on the current node and uses
>>     ssh  to execute commands on the remote node. So this part may not
>>     work in every case and may prompt for a password.
>>     # if experiencing problems with ssh , kindly check
>>
>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>>
>>     # ====================       To Run  =============================
>>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>>
>>
>>
>>     # configFile.txt is the file which has configuration params that
>>     launchScript
>>     # needs to configure lttng sessions. Below is an explanation of
>>     the different options.
>>     # ===================     configFile.txt =============================
>>
>>     [section1]
>>     # final out put file path
>>     OutputFile = /tmp/Final_report.txt
>>     # the remote node hostname on which test applications run and the
>>     test sessions will be created; this should be something that could
>>     be used with ssh. Traces will be transported from this node to the
>>     lttng_relayd running on the current node.
>>     Node = MY_REMOTE_NODE
>>     # Sub buffer size to start this with
>>     SubBufSize = 16k
>>     # Sub buffer count
>>     SubBufCount = 4
>>     # per uid buffer
>>     BufferScheme = --buffers-uid
>>     # yes
>>     EnableTracing = yes
>>     # Bit rate of the test applications. Comman seperated example "1,
>>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>     traces.
>>     # So with the below, we just start with 1 test application
>>     producing 10 kb/s
>>     TestApps = 10
>>     # session life time in seonds
>>     TestTime = 10
>>     # Max number of successive sessions to configure. if n then n-1
>>     sessions are run, ex MaxRun = 2 will run 1 session.
>>     MaxRun = 100
>>
>>
>>     # ====================  Place the following files under
>>     ===============
>>
>>     # /tmp on the remote node
>>     clean_RemNode_apps.sh
>>     report_lttng_script.sh
>>
>>     # rest of the scripts under /usr/sbin on the current local node on
>>     which lttng_realyd runs
>>     # Define a trace point MY_TRACE to take a single string arg with
>>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>     and place it under /usr/sbin of the remote host
>>
>>     # in launch_script_RN.py change currentNodeIP to the IP address on
>>     which relayd is receiving, default ports are used.
>>
>>     # lttng_relayd is started as
>>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>>
>>     # lttng_sessiond is started as
>>     /usr/bin/lttng-sessiond --consumerd32-path
>>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>     /usr/lib/ --consumerd64-path
>>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>     /usr/lib64/ -b --no-kernel
>>
>>
>>
>>
>>     Regards,
>>     Aravind.
>>
>>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>     <Jonathan.rajotte-julien at efficios.com
>>     <mailto:Jonathan.rajotte-julien at efficios.com>> wrote:
>>
>>         Hi Aravind,
>>
>>         There is no README in the archive you sent.
>>
>>         Cheers
>>
>>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>>
>>             Hi,
>>
>>             I am trying to upgrade in parallel, but this issue may
>>             still be present after I upgrade or may be temporarily
>>             masked. So I need to find the root cause for this and then
>>             see if its available on the latest before committing to
>>             upgrade.
>>
>>             There is another issue i'm hitting, the lttng list command
>>             hangs after lttng destroy session when running the profiling.
>>
>>             I found that consumerd 64 goes into an infinite loop
>>             waiting to flush metadata in
>>             lttng_ustconsumer_recv_metadata() :: while
>>             (consumer_metadata_cache_flushed(channel, offset + len,
>>             timer)) .
>>             In consumer_metadata_cache,
>>             channel->metadata_stream->endpoint_status is
>>             CONSUMER_ENDPOINT_ACTIVE,
>>             metadata_stream->ust_metadata_pushed is 0 with offset
>>             having some value. This call always returns a 1 from the
>>             last else{} block resulting in an infinite loop. Upon
>>             searching the forum, I found the same issue being reported
>>             here :
>>
>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>
>>             Regards,
>>             Aravind.
>>
>>
>>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>             <jonathan.r.julien at gmail.com
>>             <mailto:jonathan.r.julien at gmail.com>> wrote:
>>
>>                 Hi Aravind,
>>
>>                 Did you have the chance to upgrade to 2.6.1.If so
>>             where you able
>>                 to reproduce?
>>
>>                 Cheers
>>
>>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>             <aravind.ht at gmail.com <mailto:aravind.ht at gmail.com>>
>>                 wrote:
>>
>>                     Hi,
>>
>>                     I have attached the complete profiling scripts
>>             here, its a bit
>>                     shabby, im new to python.
>>
>>                     There is a README which has the details on how to
>>             execute it.
>>                     Im using a Yocto 1.6 on x86_64 platforms on both
>>             the nodes.
>>
>>
>>                     Running this script when there are other sessions
>>             running
>>                     seems to reproduce this problem easily.
>>                     Please try it and let me know if you have any issues
>>                     reproducing the problem.
>>
>>                     Regards,
>>                     Aravind.
>>
>>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>                     <jeremie.galarneau at efficios.com
>>             <mailto:jeremie.galarneau at efficios.com>
>>                     <mailto:jeremie.galarneau at efficios.com
>>             <mailto:jeremie.galarneau at efficios.com>>> wrote:
>>
>>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>                         <aravind.ht at gmail.com
>>             <mailto:aravind.ht at gmail.com> <mailto:aravind.ht at gmail.com
>>
>>             <mailto:aravind.ht at gmail.com>>> wrote:
>>                         > I am using 2.6.0 .I will try to share the
>>             code that I'm
>>                         using here in some
>>                         > time. If there are any specific fixes that
>>             are relevant
>>                         to this issue, see
>>                         > if you can provide a link to them. I would
>>             ideally like
>>                         to try them out
>>                         > before trying a full upgrade to the latest
>>             versions.
>>
>>                         Hi,
>>
>>                         Can you provide more information on the
>>             system? Which
>>                         distribution,
>>                         architecture, kernel version?
>>
>>                         The verbose sessiond logs might help pinpoint any
>>                         unexpected behaviour
>>                         here (are all applications registering as
>>             expected?).
>>
>>                         Jérémie
>>
>>                         >
>>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>             Galarneau
>>                         > <jeremie.galarneau at efficios.com
>>             <mailto:jeremie.galarneau at efficios.com>
>>                         <mailto:jeremie.galarneau at efficios.com
>>             <mailto:jeremie.galarneau at efficios.com>>> wrote:
>>                         >>
>>                         >> Hi Aravind,
>>                         >>
>>                         >> Can't say I have looked at everything you
>>             sent yet, but
>>                         as a
>>                         >> preemptive question, which version are we
>>             talking about
>>                         here? 2.6.0 or
>>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>             fixes.
>>                         >>
>>                         >> Thanks,
>>                         >> Jérémie
>>                         >>
>>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>                         <aravind.ht at gmail.com
>>             <mailto:aravind.ht at gmail.com>> wrote:
>>                         >> > Hi,
>>                         >> >
>>                         >> > I am trying to obtain the performance
>>             characteristics
>>                         of lttng with the
>>                         >> > use
>>                         >> > of test applications. Traces are being
>>             produced on a
>>                         local node and
>>                         >> > delivered to relayd that is running on a
>>             separate
>>                         node for storage.
>>                         >> >
>>                         >> > An lttng session with the test
>>             applications producing
>>                         an initial bit
>>                         >> > rate of
>>                         >> > 10 kb/s is started and run for about 30
>>             seconds. The
>>                         starting sub-buffer
>>                         >> > size is kept at 128 kb and sub-buf count
>>             at 4. The
>>                         session is then
>>                         >> > stopped
>>                         >> > and destroyed and traces are analyzed to
>>             see if there
>>                         are any drops.
>>                         >> > This is
>>                         >> > being done in a loop with every
>>             subsequent session
>>                         having an increment
>>                         >> > of 2
>>                         >> > kb/s as long as there are no drops. If
>>             there are
>>                         drops, I increase the
>>                         >> > buffer size by a factor of x2 without
>>             incrementing
>>                         the bit rate.
>>                         >> >
>>                         >> > I see trace drops happening consistently
>>             with test
>>                         apps producing traces
>>                         >> > at
>>                         >> > less than 40 kb/s, it doesnt seem to help
>>             even if I
>>                         started with 1mb x 4
>>                         >> > sub-buffers.
>>                         >> >
>>                         >> > Analysis :
>>                         >> >
>>                         >> > I have attached the lttng_relayd ,
>>             lttng_consumerd_64
>>                         logs and the
>>                         >> > entire
>>                         >> > trace directory, hope you will be able to
>>             view it.
>>                         >> > I have modified lttng_relayd code to dump
>>             the traces
>>                         being captured in
>>                         >> > the
>>                         >> > lttng_relayd logs along with debug info.
>>                         >> >
>>                         >> > Each test app is producing logs in the
>>             form of  :
>>                         >> > "TraceApp PID - 31940 THID - 31970
>>             @threadRate - 1032
>>                         b/s appRate - 2079
>>                         >> > b/s
>>                         >> > threadTraceNum - 9 appTraceNum - 18
>>    sleepTime - 192120"
>>                         >> >
>>                         >> > The test application PID, test
>>             application thread id,
>>                         thread bit rate,
>>                         >> > test
>>                         >> > app bit rate, thread trace number and
>>             application
>>                         trace number s are
>>                         >> > part of
>>                         >> > the trace. So in the above trace, the
>>             thread is
>>                         producing at 1 kb/s and
>>                         >> > the
>>                         >> > whole test app is producing at 2 kb/s.
>>                         >> >
>>                         >> > If we look at the babeltrace out put, we
>>             see that the
>>                         Trace with
>>                         >> > TraceApp
>>                         >> > PID - 31940 appTraceNum 2 is missing ,
>>             with 1, 3, 4,
>>                         5 and so on being
>>                         >> > successfully captured.
>>                         >> > I looked at the lttng_relayd logs and
>>             found that
>>                         trace of "appTraceNum
>>                         >> > 2" is
>>                         >> > not delivered/generated by the consumerd
>>             to the
>>                         relayd in sequence with
>>                         >> > other traces. To rule out that this is
>>             not a test
>>                         application problem,
>>                         >> > you
>>                         >> > can look at line ltttng_relayd log :
>>             12778 and see
>>                         traces from
>>                         >> > appTraceNum -
>>                         >> > 1 to appTraceNum - 18 including the
>>             appTraceNum 2 are
>>                         "re-delivered" by
>>                         >> > the
>>                         >> > consumerd to the relayd.
>>                         >> > Essentially, I see appTraceNum 1 through
>>             appTraceNum
>>                         18 being delivered
>>                         >> > twice, once individually where
>>             appTraceNum 2 is
>>                         missing and once as a
>>                         >> > group
>>                         >> > at line 12778 where its present.
>>                         >> >
>>                         >> >
>>                         >> > Request help with
>>                         >> > 1. why traces are delivered twice, is it
>>             by design or
>>                         a genuine problem
>>                         >> > ?
>>                         >> > 2. how to avoid traces being dropped even
>>             though
>>                         buffers are
>>                         >> > sufficiently
>>                         >> > large enough ?
>>                         >> >
>>                         >> >
>>                         >> > Regards,
>>                         >> > Aravind.
>>                         >> >
>>                         >> >
>>             _______________________________________________
>>                         >> > lttng-dev mailing list
>>                         >> > lttng-dev at lists.lttng.org
>>             <mailto:lttng-dev at lists.lttng.org>
>>                         >> >
>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>                         >> >
>>                         >>
>>                         >>
>>                         >>
>>                         >> --
>>                         >> Jérémie Galarneau
>>                         >> EfficiOS Inc.
>>                         >> http://www.efficios.com
>>                         >
>>                         >
>>
>>
>>
>>                         --
>>                         Jérémie Galarneau
>>                         EfficiOS Inc.
>>             http://www.efficios.com
>>
>>
>>
>>             _______________________________________________
>>                     lttng-dev mailing list
>>             lttng-dev at lists.lttng.org
>>             <mailto:lttng-dev at lists.lttng.org>
>>             <mailto:lttng-dev at lists.lttng.org
>>             <mailto:lttng-dev at lists.lttng.org>>
>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>>
>>
>>                 --     Jonathan Rajotte Julien
>>
>>
>>
>>
>>             _______________________________________________
>>             lttng-dev mailing list
>>             lttng-dev at lists.lttng.org <mailto:lttng-dev at lists.lttng.org>
>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>>         --         Jonathan R. Julien
>>         Efficios
>>
>>
>>
>>
> --
> Jonathan R. Julien
> Efficios
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20151212/dcf6c6d7/attachment-0001.html>