[lttng-dev] Relayd trace drops

Fri Dec 11 12:16:54 EST 2015

Hi,

Did you get a chance to reproduce the problem with the scripts ? Let me
know if you need any help running it.

Regards,
Aravind.

On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht at gmail.com> wrote:

> Sorry about that, not sure how it got missed.
> Here it is.
>
>
> # This needs a two node set up (1. local current node 2. remote node )
> # relayd runs on the current node where traces are captured from the
> remote node
> # remote node runs test applications which generate traces.
> # the launch_script_RN is executed on the current node and uses ssh  to
> execute commands on the remote node. So this part may not work in every
> case and may prompt for a password.
> # if experiencing problems with ssh , kindly check
> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>
> # ====================       To Run        =============================
> launch_script_RN.py self_profile -c /tmp/configFile.txt
>
>
>
> # configFile.txt is the file which has configuration params that
> launchScript
> # needs to configure lttng sessions. Below is an explanation of the
> different options.
> # ===================     configFile.txt   =============================
>
> [section1]
> # final out put file path
> OutputFile = /tmp/Final_report.txt
> # the remote node hostname on which test applications run and the test
> sessions will be created; this should be something that could be used with
> ssh. Traces will be transported from this node to the lttng_relayd running
> on the current node.
> Node = MY_REMOTE_NODE
> # Sub buffer size to start this with
> SubBufSize = 16k
> # Sub buffer count
> SubBufCount = 4
> # per uid buffer
> BufferScheme = --buffers-uid
> # yes
> EnableTracing = yes
> # Bit rate of the test applications. Comman seperated example "1, 3, 3,
> 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s traces.
> # So with the below, we just start with 1 test application producing 10
> kb/s
> TestApps = 10
> # session life time in seonds
> TestTime = 10
> # Max number of successive sessions to configure. if n then n-1 sessions
> are run, ex MaxRun = 2 will run 1 session.
> MaxRun = 100
>
>
> # ====================  Place the following files under ===============
>
> # /tmp on the remote node
> clean_RemNode_apps.sh
> report_lttng_script.sh
>
> # rest of the scripts under /usr/sbin on the current local node on which
> lttng_realyd runs
> # Define a trace point MY_TRACE to take a single string arg with
> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp and
> place it under /usr/sbin of the remote host
>
> # in launch_script_RN.py change currentNodeIP to the IP address on which
> relayd is receiving, default ports are used.
>
> # lttng_relayd is started as
> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>
> # lttng_sessiond is started as
> /usr/bin/lttng-sessiond --consumerd32-path
> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir /usr/lib/
> --consumerd64-path /usr/lib64/lttng/libexec/lttng-consumerd
> --consumerd64-libdir /usr/lib64/ -b --no-kernel
>
>
>
>
> Regards,
> Aravind.
>
> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien <
> Jonathan.rajotte-julien at efficios.com> wrote:
>
>> Hi Aravind,
>>
>> There is no README in the archive you sent.
>>
>> Cheers
>>
>> On 2015-12-08 07:51 AM, Aravind HT wrote:
>>
>>> Hi,
>>>
>>> I am trying to upgrade in parallel, but this issue may still be present
>>> after I upgrade or may be temporarily masked. So I need to find the root
>>> cause for this and then see if its available on the latest before
>>> committing to upgrade.
>>>
>>> There is another issue i'm hitting, the lttng list command hangs after
>>> lttng destroy session when running the profiling.
>>>
>>> I found that consumerd 64 goes into an infinite loop waiting to flush
>>> metadata in lttng_ustconsumer_recv_metadata() :: while
>>> (consumer_metadata_cache_flushed(channel, offset + len, timer)) .
>>> In consumer_metadata_cache, channel->metadata_stream->endpoint_status is
>>> CONSUMER_ENDPOINT_ACTIVE, metadata_stream->ust_metadata_pushed is 0 with
>>> offset having some value. This call always returns a 1 from the last else{}
>>> block resulting in an infinite loop. Upon searching the forum, I found the
>>> same issue being reported here :
>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>>
>>> Regards,
>>> Aravind.
>>>
>>>
>>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte <
>>> jonathan.r.julien at gmail.com> wrote:
>>>
>>>     Hi Aravind,
>>>
>>>     Did you have the chance to upgrade to 2.6.1.If so where you able
>>>     to reproduce?
>>>
>>>     Cheers
>>>
>>>     On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht at gmail.com>
>>>     wrote:
>>>
>>>         Hi,
>>>
>>>         I have attached the complete profiling scripts here, its a bit
>>>         shabby, im new to python.
>>>
>>>         There is a README which has the details on how to execute it.
>>>         Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>>>
>>>
>>>         Running this script when there are other sessions running
>>>         seems to reproduce this problem easily.
>>>         Please try it and let me know if you have any issues
>>>         reproducing the problem.
>>>
>>>         Regards,
>>>         Aravind.
>>>
>>>         On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>         <jeremie.galarneau at efficios.com
>>>         <mailto:jeremie.galarneau at efficios.com>> wrote:
>>>
>>>             On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>             <aravind.ht at gmail.com <mailto:aravind.ht at gmail.com>> wrote:
>>>             > I am using 2.6.0 .I will try to share the code that I'm
>>>             using here in some
>>>             > time. If there are any specific fixes that are relevant
>>>             to this issue, see
>>>             > if you can provide a link to them. I would ideally like
>>>             to try them out
>>>             > before trying a full upgrade to the latest versions.
>>>
>>>             Hi,
>>>
>>>             Can you provide more information on the system? Which
>>>             distribution,
>>>             architecture, kernel version?
>>>
>>>             The verbose sessiond logs might help pinpoint any
>>>             unexpected behaviour
>>>             here (are all applications registering as expected?).
>>>
>>>             Jérémie
>>>
>>>             >
>>>             > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>>>             > <jeremie.galarneau at efficios.com
>>>             <mailto:jeremie.galarneau at efficios.com>> wrote:
>>>             >>
>>>             >> Hi Aravind,
>>>             >>
>>>             >> Can't say I have looked at everything you sent yet, but
>>>             as a
>>>             >> preemptive question, which version are we talking about
>>>             here? 2.6.0 or
>>>             >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>>>             >>
>>>             >> Thanks,
>>>             >> Jérémie
>>>             >>
>>>             >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>             <aravind.ht at gmail.com> wrote:
>>>             >> > Hi,
>>>             >> >
>>>             >> > I am trying to obtain the performance characteristics
>>>             of lttng with the
>>>             >> > use
>>>             >> > of test applications. Traces are being produced on a
>>>             local node and
>>>             >> > delivered to relayd that is running on a separate
>>>             node for storage.
>>>             >> >
>>>             >> > An lttng session with the test applications producing
>>>             an initial bit
>>>             >> > rate of
>>>             >> > 10 kb/s is started and run for about 30 seconds. The
>>>             starting sub-buffer
>>>             >> > size is kept at 128 kb and sub-buf count at 4. The
>>>             session is then
>>>             >> > stopped
>>>             >> > and destroyed and traces are analyzed to see if there
>>>             are any drops.
>>>             >> > This is
>>>             >> > being done in a loop with every subsequent session
>>>             having an increment
>>>             >> > of 2
>>>             >> > kb/s as long as there are no drops. If there are
>>>             drops, I increase the
>>>             >> > buffer size by a factor of x2 without incrementing
>>>             the bit rate.
>>>             >> >
>>>             >> > I see trace drops happening consistently with test
>>>             apps producing traces
>>>             >> > at
>>>             >> > less than 40 kb/s, it doesnt seem to help even if I
>>>             started with 1mb x 4
>>>             >> > sub-buffers.
>>>             >> >
>>>             >> > Analysis :
>>>             >> >
>>>             >> > I have attached the lttng_relayd , lttng_consumerd_64
>>>             logs and the
>>>             >> > entire
>>>             >> > trace directory, hope you will be able to view it.
>>>             >> > I have modified lttng_relayd code to dump the traces
>>>             being captured in
>>>             >> > the
>>>             >> > lttng_relayd logs along with debug info.
>>>             >> >
>>>             >> > Each test app is producing logs in the form of  :
>>>             >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032
>>>             b/s appRate - 2079
>>>             >> > b/s
>>>             >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>>>             >> >
>>>             >> > The test application PID, test application thread id,
>>>             thread bit rate,
>>>             >> > test
>>>             >> > app bit rate, thread trace number and application
>>>             trace number s are
>>>             >> > part of
>>>             >> > the trace. So in the above trace, the thread is
>>>             producing at 1 kb/s and
>>>             >> > the
>>>             >> > whole test app is producing at 2 kb/s.
>>>             >> >
>>>             >> > If we look at the babeltrace out put, we see that the
>>>             Trace with
>>>             >> > TraceApp
>>>             >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4,
>>>             5 and so on being
>>>             >> > successfully captured.
>>>             >> > I looked at the lttng_relayd logs and found that
>>>             trace of "appTraceNum
>>>             >> > 2" is
>>>             >> > not delivered/generated by the consumerd to the
>>>             relayd in sequence with
>>>             >> > other traces. To rule out that this is not a test
>>>             application problem,
>>>             >> > you
>>>             >> > can look at line ltttng_relayd log : 12778 and see
>>>             traces from
>>>             >> > appTraceNum -
>>>             >> > 1 to appTraceNum - 18 including the appTraceNum 2 are
>>>             "re-delivered" by
>>>             >> > the
>>>             >> > consumerd to the relayd.
>>>             >> > Essentially, I see appTraceNum 1 through appTraceNum
>>>             18 being delivered
>>>             >> > twice, once individually where appTraceNum 2 is
>>>             missing and once as a
>>>             >> > group
>>>             >> > at line 12778 where its present.
>>>             >> >
>>>             >> >
>>>             >> > Request help with
>>>             >> > 1. why traces are delivered twice, is it by design or
>>>             a genuine problem
>>>             >> > ?
>>>             >> > 2. how to avoid traces being dropped even though
>>>             buffers are
>>>             >> > sufficiently
>>>             >> > large enough ?
>>>             >> >
>>>             >> >
>>>             >> > Regards,
>>>             >> > Aravind.
>>>             >> >
>>>             >> > _______________________________________________
>>>             >> > lttng-dev mailing list
>>>             >> > lttng-dev at lists.lttng.org
>>>             >> >
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>             >> >
>>>             >>
>>>             >>
>>>             >>
>>>             >> --
>>>             >> Jérémie Galarneau
>>>             >> EfficiOS Inc.
>>>             >> http://www.efficios.com
>>>             >
>>>             >
>>>
>>>
>>>
>>>             --
>>>             Jérémie Galarneau
>>>             EfficiOS Inc.
>>>             http://www.efficios.com
>>>
>>>
>>>
>>>         _______________________________________________
>>>         lttng-dev mailing list
>>>         lttng-dev at lists.lttng.org <mailto:lttng-dev at lists.lttng.org>
>>>         http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>>
>>>
>>>
>>>     --     Jonathan Rajotte Julien
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev at lists.lttng.org
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>
>> --
>> Jonathan R. Julien
>> Efficios
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20151211/f911b7d4/attachment-0001.html>