[lttng-dev] Relayd trace drops

Tue Dec 8 07:51:07 EST 2015

Hi,

I am trying to upgrade in parallel, but this issue may still be present
after I upgrade or may be temporarily masked. So I need to find the root
cause for this and then see if its available on the latest before
committing to upgrade.

There is another issue i'm hitting, the lttng list command hangs after
lttng destroy session when running the profiling.

I found that consumerd 64 goes into an infinite loop waiting to flush
metadata in lttng_ustconsumer_recv_metadata() :: while
(consumer_metadata_cache_flushed(channel, offset + len, timer)) .
In consumer_metadata_cache, channel->metadata_stream->endpoint_status
is CONSUMER_ENDPOINT_ACTIVE, metadata_stream->ust_metadata_pushed is 0 with
offset having some value. This call always returns a 1 from the last else{}
block resulting in an infinite loop. Upon searching the forum, I found the
same issue being reported here :
 https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html

Regards,
Aravind.

On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte <
jonathan.r.julien at gmail.com> wrote:

> Hi Aravind,
>
> Did you have the chance to upgrade to 2.6.1.If so where you able to
> reproduce?
>
> Cheers
>
> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht at gmail.com> wrote:
>
>> Hi,
>>
>> I have attached the complete profiling scripts here, its a bit shabby, im
>> new to python.
>>
>> There is a README which has the details on how to execute it.
>> Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>>
>>
>> Running this script when there are other sessions running seems to
>> reproduce this problem easily.
>> Please try it and let me know if you have any issues reproducing the
>> problem.
>>
>> Regards,
>> Aravind.
>>
>> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau <
>> jeremie.galarneau at efficios.com> wrote:
>>
>>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT <aravind.ht at gmail.com>
>>> wrote:
>>> > I am using 2.6.0 .I will try to share the code that I'm using here in
>>> some
>>> > time. If there are any specific fixes that are relevant to this issue,
>>> see
>>> > if you can provide a link to them. I would ideally like to try them out
>>> > before trying a full upgrade to the latest versions.
>>>
>>> Hi,
>>>
>>> Can you provide more information on the system? Which distribution,
>>> architecture, kernel version?
>>>
>>> The verbose sessiond logs might help pinpoint any unexpected behaviour
>>> here (are all applications registering as expected?).
>>>
>>> Jérémie
>>>
>>> >
>>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>>> > <jeremie.galarneau at efficios.com> wrote:
>>> >>
>>> >> Hi Aravind,
>>> >>
>>> >> Can't say I have looked at everything you sent yet, but as a
>>> >> preemptive question, which version are we talking about here? 2.6.0 or
>>> >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>>> >>
>>> >> Thanks,
>>> >> Jérémie
>>> >>
>>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht at gmail.com>
>>> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > I am trying to obtain the performance characteristics of lttng with
>>> the
>>> >> > use
>>> >> > of test applications. Traces are being produced on a local node and
>>> >> > delivered to relayd that is running on a separate node for storage.
>>> >> >
>>> >> > An lttng session with the test applications producing an initial bit
>>> >> > rate of
>>> >> > 10 kb/s is started and run for about 30 seconds. The starting
>>> sub-buffer
>>> >> > size is kept at 128 kb and sub-buf count at 4. The session is then
>>> >> > stopped
>>> >> > and destroyed and traces are analyzed to see if there are any drops.
>>> >> > This is
>>> >> > being done in a loop with every subsequent session having an
>>> increment
>>> >> > of 2
>>> >> > kb/s as long as there are no drops. If there are drops, I increase
>>> the
>>> >> > buffer size by a factor of x2 without incrementing the bit rate.
>>> >> >
>>> >> > I see trace drops happening consistently with test apps producing
>>> traces
>>> >> > at
>>> >> > less than 40 kb/s, it doesnt seem to help even if I started with
>>> 1mb x 4
>>> >> > sub-buffers.
>>> >> >
>>> >> > Analysis :
>>> >> >
>>> >> > I have attached the lttng_relayd , lttng_consumerd_64 logs and the
>>> >> > entire
>>> >> > trace directory, hope you will be able to view it.
>>> >> > I have modified lttng_relayd code to dump the traces being captured
>>> in
>>> >> > the
>>> >> > lttng_relayd logs along with debug info.
>>> >> >
>>> >> > Each test app is producing logs in the form of  :
>>> >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate -
>>> 2079
>>> >> > b/s
>>> >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>>> >> >
>>> >> > The test application PID, test application thread id, thread bit
>>> rate,
>>> >> > test
>>> >> > app bit rate, thread trace number and application trace number s are
>>> >> > part of
>>> >> > the trace. So in the above trace, the thread is producing at 1 kb/s
>>> and
>>> >> > the
>>> >> > whole test app is producing at 2 kb/s.
>>> >> >
>>> >> > If we look at the babeltrace out put, we see that the Trace with
>>> >> > TraceApp
>>> >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on
>>> being
>>> >> > successfully captured.
>>> >> > I looked at the lttng_relayd logs and found that trace of
>>> "appTraceNum
>>> >> > 2" is
>>> >> > not delivered/generated by the consumerd to the relayd in sequence
>>> with
>>> >> > other traces. To rule out that this is not a test application
>>> problem,
>>> >> > you
>>> >> > can look at line ltttng_relayd log : 12778 and see traces from
>>> >> > appTraceNum -
>>> >> > 1 to appTraceNum - 18 including the appTraceNum 2 are
>>> "re-delivered" by
>>> >> > the
>>> >> > consumerd to the relayd.
>>> >> > Essentially, I see appTraceNum 1 through appTraceNum 18 being
>>> delivered
>>> >> > twice, once individually where appTraceNum 2 is missing and once as
>>> a
>>> >> > group
>>> >> > at line 12778 where its present.
>>> >> >
>>> >> >
>>> >> > Request help with
>>> >> > 1. why traces are delivered twice, is it by design or a genuine
>>> problem
>>> >> > ?
>>> >> > 2. how to avoid traces being dropped even though buffers are
>>> >> > sufficiently
>>> >> > large enough ?
>>> >> >
>>> >> >
>>> >> > Regards,
>>> >> > Aravind.
>>> >> >
>>> >> > _______________________________________________
>>> >> > lttng-dev mailing list
>>> >> > lttng-dev at lists.lttng.org
>>> >> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jérémie Galarneau
>>> >> EfficiOS Inc.
>>> >> http://www.efficios.com
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev at lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>
>
> --
> Jonathan Rajotte Julien
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20151208/d148bc9b/attachment-0001.html>