[lttng-dev] Relayd trace drops

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Fri Dec 11 17:43:07 EST 2015


Hi, 

Please upgrade to 2.6.1 or 2.7.0, those are old issues. 

Thanks, 

Mathieu 

----- On Dec 11, 2015, at 1:42 PM, Aravind HT <aravind.ht at gmail.com> wrote: 

> NPBS understand it, I will continue to investigate this, got a bit side tracked
> with the sessiond hang issue that I mentioned earlier.
> Making some progress on that, will update as soon as I find the root cause, it
> looks like a mutex deadlock between
> the following threads in the consumerd daemon

> #1 0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26,
> buf=buf at entry=0x7f16acca57f0, len=len at entry=4476) at unix.c:175
> #2 0x000000000041dc96 in lttng_ustconsumer_request_metadata
> (ctx=ctx at entry=0x640d10, channel=0x7f16980192b0, timer=timer at entry=0,
> wait=wait at entry=0) at ust-consumer.c:2331
> #3 0x000000000041ff3c in lttng_ustconsumer_sync_metadata
> (ctx=ctx at entry=0x640d10, metadata=metadata at entry=0x7f169801aa40) at
> ust-consumer.c:1829
> #4 0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
> metadata=<optimized out>) at consumer-stream.c:468
> #5 consumer_stream_sync_metadata (ctx=ctx at entry=0x640d10, session_id=8) at
> consumer-stream.c:545
> #6 0x000000000041d316 in lttng_ustconsumer_read_subbuffer
> (stream=stream at entry=0x7f1698018fa0, ctx=ctx at entry=0x640d10) at
> ust-consumer.c:2063
> #7 0x000000000040b613 in lttng_consumer_read_subbuffer (stream=0x7f1698018fa0,
> ctx=0x640d10) at consumer.c:3203
> #8 0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
> consumer.c:2557
> #9 0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
> pthread_create.c:312
> #10 0x00007f16af3dcafd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

> and

> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
> on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging

> This again can be reproduced by the same script, just running consecutive
> sessions of short duration (<10 seconds) with traces getting captured on a
> remote node.
> I hit this every time I run the script.

> Regards,
> Aravind.

> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
> Jonathan.rajotte-julien at efficios.com > wrote:

>> Hi Aravind,

>> On 2015-12-11 12:16 PM, Aravind HT wrote:

>>> Hi,

>>> Did you get a chance to reproduce the problem with the scripts ? Let me know if
>>> you need any help running it.

>> Time is a scarce resource unfortunately. But until I have the time to look at it
>> you might want to investigate if you are able to reproduce it on
>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would really
>> help.

>> Cheers!

>>> Regards,
>>> Aravind.

>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT < aravind.ht at gmail.com <mailto:
>>> aravind.ht at gmail.com >> wrote:

>>> Sorry about that, not sure how it got missed.
>>> Here it is.

>>> # This needs a two node set up (1. local current node 2. remote node )
>>> # relayd runs on the current node where traces are captured from
>>> the remote node
>>> # remote node runs test applications which generate traces.
>>> # the launch_script_RN is executed on the current node and uses
>>> ssh to execute commands on the remote node. So this part may not
>>> work in every case and may prompt for a password.
>>> # if experiencing problems with ssh , kindly check
>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password

>>> # ==================== To Run =============================
>>> launch_script_RN.py self_profile -c /tmp/configFile.txt

>>> # configFile.txt is the file which has configuration params that
>>> launchScript
>>> # needs to configure lttng sessions. Below is an explanation of
>>> the different options.
>>> # =================== configFile.txt =============================

>>> [section1]
>>> # final out put file path
>>> OutputFile = /tmp/Final_report.txt
>>> # the remote node hostname on which test applications run and the
>>> test sessions will be created; this should be something that could
>>> be used with ssh. Traces will be transported from this node to the
>>> lttng_relayd running on the current node.
>>> Node = MY_REMOTE_NODE
>>> # Sub buffer size to start this with
>>> SubBufSize = 16k
>>> # Sub buffer count
>>> SubBufCount = 4
>>> # per uid buffer
>>> BufferScheme = --buffers-uid
>>> # yes
>>> EnableTracing = yes
>>> # Bit rate of the test applications. Comman seperated example "1,
>>> 3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>> traces.
>>> # So with the below, we just start with 1 test application
>>> producing 10 kb/s
>>> TestApps = 10
>>> # session life time in seonds
>>> TestTime = 10
>>> # Max number of successive sessions to configure. if n then n-1
>>> sessions are run, ex MaxRun = 2 will run 1 session.
>>> MaxRun = 100

>>> # ==================== Place the following files under
>>> ===============

>>> # /tmp on the remote node
>>> clean_RemNode_apps.sh
>>> report_lttng_script.sh

>>> # rest of the scripts under /usr/sbin on the current local node on
>>> which lttng_realyd runs
>>> # Define a trace point MY_TRACE to take a single string arg with
>>> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>> and place it under /usr/sbin of the remote host

>>> # in launch_script_RN.py change currentNodeIP to the IP address on
>>> which relayd is receiving, default ports are used.

>>> # lttng_relayd is started as
>>> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d

>>> # lttng_sessiond is started as
>>> /usr/bin/lttng-sessiond --consumerd32-path
>>> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>> /usr/lib/ --consumerd64-path
>>> /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>> /usr/lib64/ -b --no-kernel

>>> Regards,
>>> Aravind.

>>> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>> < Jonathan.rajotte-julien at efficios.com
>>> <mailto: Jonathan.rajotte-julien at efficios.com >> wrote:

>>> Hi Aravind,

>>> There is no README in the archive you sent.

>>> Cheers

>>> On 2015-12-08 07:51 AM, Aravind HT wrote:

>>> Hi,

>>> I am trying to upgrade in parallel, but this issue may
>>> still be present after I upgrade or may be temporarily
>>> masked. So I need to find the root cause for this and then
>>> see if its available on the latest before committing to
>>> upgrade.

>>> There is another issue i'm hitting, the lttng list command
>>> hangs after lttng destroy session when running the profiling.

>>> I found that consumerd 64 goes into an infinite loop
>>> waiting to flush metadata in
>>> lttng_ustconsumer_recv_metadata() :: while
>>> (consumer_metadata_cache_flushed(channel, offset + len,
>>> timer)) .
>>> In consumer_metadata_cache,
>>> channel->metadata_stream->endpoint_status is
>>> CONSUMER_ENDPOINT_ACTIVE,
>>> metadata_stream->ust_metadata_pushed is 0 with offset
>>> having some value. This call always returns a 1 from the
>>> last else{} block resulting in an infinite loop. Upon
>>> searching the forum, I found the same issue being reported
>>> here :
>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html

>>> Regards,
>>> Aravind.

>>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>> < jonathan.r.julien at gmail.com
>>> <mailto: jonathan.r.julien at gmail.com >> wrote:

>>> Hi Aravind,

>>> Did you have the chance to upgrade to 2.6.1.If so
>>> where you able
>>> to reproduce?

>>> Cheers

>>> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>> < aravind.ht at gmail.com <mailto: aravind.ht at gmail.com >>
>>> wrote:

>>> Hi,

>>> I have attached the complete profiling scripts
>>> here, its a bit
>>> shabby, im new to python.

>>> There is a README which has the details on how to
>>> execute it.
>>> Im using a Yocto 1.6 on x86_64 platforms on both
>>> the nodes.

>>> Running this script when there are other sessions
>>> running
>>> seems to reproduce this problem easily.
>>> Please try it and let me know if you have any issues
>>> reproducing the problem.

>>> Regards,
>>> Aravind.

>>> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>> < jeremie.galarneau at efficios.com
>>> <mailto: jeremie.galarneau at efficios.com >
>>> <mailto: jeremie.galarneau at efficios.com
>>> <mailto: jeremie.galarneau at efficios.com >>> wrote:

>>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>> < aravind.ht at gmail.com
>>> <mailto: aravind.ht at gmail.com > <mailto: aravind.ht at gmail.com

>>> <mailto: aravind.ht at gmail.com >>> wrote:
>>> > I am using 2.6.0 .I will try to share the
>>> code that I'm
>>> using here in some
>>> > time. If there are any specific fixes that
>>> are relevant
>>> to this issue, see
>>> > if you can provide a link to them. I would
>>> ideally like
>>> to try them out
>>> > before trying a full upgrade to the latest
>>> versions.

>>> Hi,

>>> Can you provide more information on the
>>> system? Which
>>> distribution,
>>> architecture, kernel version?

>>> The verbose sessiond logs might help pinpoint any
>>> unexpected behaviour
>>> here (are all applications registering as
>>> expected?).

>>> Jérémie


>>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>> Galarneau
>>> > < jeremie.galarneau at efficios.com
>>> <mailto: jeremie.galarneau at efficios.com >
>>> <mailto: jeremie.galarneau at efficios.com
>>> <mailto: jeremie.galarneau at efficios.com >>> wrote:

>>> >> Hi Aravind,

>>> >> Can't say I have looked at everything you
>>> sent yet, but
>>> as a
>>> >> preemptive question, which version are we
>>> talking about
>>> here? 2.6.0 or
>>> >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>> fixes.

>>> >> Thanks,
>>> >> Jérémie

>>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>> < aravind.ht at gmail.com
>>> <mailto: aravind.ht at gmail.com >> wrote:
>>> >> > Hi,

>>> >> > I am trying to obtain the performance
>>> characteristics
>>> of lttng with the
>>> >> > use
>>> >> > of test applications. Traces are being
>>> produced on a
>>> local node and
>>> >> > delivered to relayd that is running on a
>>> separate
>>> node for storage.

>>> >> > An lttng session with the test
>>> applications producing
>>> an initial bit
>>> >> > rate of
>>> >> > 10 kb/s is started and run for about 30
>>> seconds. The
>>> starting sub-buffer
>>> >> > size is kept at 128 kb and sub-buf count
>>> at 4. The
>>> session is then
>>> >> > stopped
>>> >> > and destroyed and traces are analyzed to
>>> see if there
>>> are any drops.
>>> >> > This is
>>> >> > being done in a loop with every
>>> subsequent session
>>> having an increment
>>> >> > of 2
>>> >> > kb/s as long as there are no drops. If
>>> there are
>>> drops, I increase the
>>> >> > buffer size by a factor of x2 without
>>> incrementing
>>> the bit rate.

>>> >> > I see trace drops happening consistently
>>> with test
>>> apps producing traces
>>> >> > at
>>> >> > less than 40 kb/s, it doesnt seem to help
>>> even if I
>>> started with 1mb x 4
>>> >> > sub-buffers.

>>> >> > Analysis :

>>> >> > I have attached the lttng_relayd ,
>>> lttng_consumerd_64
>>> logs and the
>>> >> > entire
>>> >> > trace directory, hope you will be able to
>>> view it.
>>> >> > I have modified lttng_relayd code to dump
>>> the traces
>>> being captured in
>>> >> > the
>>> >> > lttng_relayd logs along with debug info.

>>> >> > Each test app is producing logs in the
>>> form of :
>>> >> > "TraceApp PID - 31940 THID - 31970
>>> @threadRate - 1032
>>> b/s appRate - 2079
>>> >> > b/s
>>> >> > threadTraceNum - 9 appTraceNum - 18 sleepTime - 192120"

>>> >> > The test application PID, test
>>> application thread id,
>>> thread bit rate,
>>> >> > test
>>> >> > app bit rate, thread trace number and
>>> application
>>> trace number s are
>>> >> > part of
>>> >> > the trace. So in the above trace, the
>>> thread is
>>> producing at 1 kb/s and
>>> >> > the
>>> >> > whole test app is producing at 2 kb/s.

>>> >> > If we look at the babeltrace out put, we
>>> see that the
>>> Trace with
>>> >> > TraceApp
>>> >> > PID - 31940 appTraceNum 2 is missing ,
>>> with 1, 3, 4,
>>> 5 and so on being
>>> >> > successfully captured.
>>> >> > I looked at the lttng_relayd logs and
>>> found that
>>> trace of "appTraceNum
>>> >> > 2" is
>>> >> > not delivered/generated by the consumerd
>>> to the
>>> relayd in sequence with
>>> >> > other traces. To rule out that this is
>>> not a test
>>> application problem,
>>> >> > you
>>> >> > can look at line ltttng_relayd log :
>>> 12778 and see
>>> traces from
>>> >> > appTraceNum -
>>> >> > 1 to appTraceNum - 18 including the
>>> appTraceNum 2 are
>>> "re-delivered" by
>>> >> > the
>>> >> > consumerd to the relayd.
>>> >> > Essentially, I see appTraceNum 1 through
>>> appTraceNum
>>> 18 being delivered
>>> >> > twice, once individually where
>>> appTraceNum 2 is
>>> missing and once as a
>>> >> > group
>>> >> > at line 12778 where its present.


>>> >> > Request help with
>>> >> > 1. why traces are delivered twice, is it
>>> by design or
>>> a genuine problem
>>> >> > ?
>>> >> > 2. how to avoid traces being dropped even
>>> though
>>> buffers are
>>> >> > sufficiently
>>> >> > large enough ?


>>> >> > Regards,
>>> >> > Aravind.


>>> _______________________________________________
>>> >> > lttng-dev mailing list
>>> >> > lttng-dev at lists.lttng.org
>>> <mailto: lttng-dev at lists.lttng.org >

>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev




>>> >> --
>>> >> Jérémie Galarneau
>>> >> EfficiOS Inc.
>>> >> http://www.efficios.com



>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com

>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev at lists.lttng.org
>>> <mailto: lttng-dev at lists.lttng.org >
>>> <mailto: lttng-dev at lists.lttng.org
>>> <mailto: lttng-dev at lists.lttng.org >>
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>> -- Jonathan Rajotte Julien

>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev at lists.lttng.org <mailto: lttng-dev at lists.lttng.org >
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>> -- Jonathan R. Julien
>>> Efficios

>> --
>> Jonathan R. Julien
>> Efficios

> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20151211/2cdf3788/attachment-0001.html>


More information about the lttng-dev mailing list