[lttng-dev] Relayd trace drops
Jonathan Rajotte Julien
Jonathan.rajotte-julien at efficios.com
Fri Dec 11 13:03:22 EST 2015
Hi Aravind,
On 2015-12-11 12:16 PM, Aravind HT wrote:
> Hi,
>
> Did you get a chance to reproduce the problem with the scripts ? Let
> me know if you need any help running it.
Time is a scarce resource unfortunately. But until I have the time to
look at it you might want to investigate if you are able to reproduce it
on 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would
really help.
Cheers!
>
> Regards,
> Aravind.
>
> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht at gmail.com
> <mailto:aravind.ht at gmail.com>> wrote:
>
> Sorry about that, not sure how it got missed.
> Here it is.
>
>
> # This needs a two node set up (1. local current node 2. remote node )
> # relayd runs on the current node where traces are captured from
> the remote node
> # remote node runs test applications which generate traces.
> # the launch_script_RN is executed on the current node and uses
> ssh to execute commands on the remote node. So this part may not
> work in every case and may prompt for a password.
> # if experiencing problems with ssh , kindly check
> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>
> # ==================== To Run =============================
> launch_script_RN.py self_profile -c /tmp/configFile.txt
>
>
>
> # configFile.txt is the file which has configuration params that
> launchScript
> # needs to configure lttng sessions. Below is an explanation of
> the different options.
> # =================== configFile.txt =============================
>
> [section1]
> # final out put file path
> OutputFile = /tmp/Final_report.txt
> # the remote node hostname on which test applications run and the
> test sessions will be created; this should be something that could
> be used with ssh. Traces will be transported from this node to the
> lttng_relayd running on the current node.
> Node = MY_REMOTE_NODE
> # Sub buffer size to start this with
> SubBufSize = 16k
> # Sub buffer count
> SubBufCount = 4
> # per uid buffer
> BufferScheme = --buffers-uid
> # yes
> EnableTracing = yes
> # Bit rate of the test applications. Comman seperated example "1,
> 3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
> traces.
> # So with the below, we just start with 1 test application
> producing 10 kb/s
> TestApps = 10
> # session life time in seonds
> TestTime = 10
> # Max number of successive sessions to configure. if n then n-1
> sessions are run, ex MaxRun = 2 will run 1 session.
> MaxRun = 100
>
>
> # ==================== Place the following files under
> ===============
>
> # /tmp on the remote node
> clean_RemNode_apps.sh
> report_lttng_script.sh
>
> # rest of the scripts under /usr/sbin on the current local node on
> which lttng_realyd runs
> # Define a trace point MY_TRACE to take a single string arg with
> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
> and place it under /usr/sbin of the remote host
>
> # in launch_script_RN.py change currentNodeIP to the IP address on
> which relayd is receiving, default ports are used.
>
> # lttng_relayd is started as
> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>
> # lttng_sessiond is started as
> /usr/bin/lttng-sessiond --consumerd32-path
> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
> /usr/lib/ --consumerd64-path
> /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
> /usr/lib64/ -b --no-kernel
>
>
>
>
> Regards,
> Aravind.
>
> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
> <Jonathan.rajotte-julien at efficios.com
> <mailto:Jonathan.rajotte-julien at efficios.com>> wrote:
>
> Hi Aravind,
>
> There is no README in the archive you sent.
>
> Cheers
>
> On 2015-12-08 07:51 AM, Aravind HT wrote:
>
> Hi,
>
> I am trying to upgrade in parallel, but this issue may
> still be present after I upgrade or may be temporarily
> masked. So I need to find the root cause for this and then
> see if its available on the latest before committing to
> upgrade.
>
> There is another issue i'm hitting, the lttng list command
> hangs after lttng destroy session when running the profiling.
>
> I found that consumerd 64 goes into an infinite loop
> waiting to flush metadata in
> lttng_ustconsumer_recv_metadata() :: while
> (consumer_metadata_cache_flushed(channel, offset + len,
> timer)) .
> In consumer_metadata_cache,
> channel->metadata_stream->endpoint_status is
> CONSUMER_ENDPOINT_ACTIVE,
> metadata_stream->ust_metadata_pushed is 0 with offset
> having some value. This call always returns a 1 from the
> last else{} block resulting in an infinite loop. Upon
> searching the forum, I found the same issue being reported
> here :
> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>
> Regards,
> Aravind.
>
>
> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
> <jonathan.r.julien at gmail.com
> <mailto:jonathan.r.julien at gmail.com>> wrote:
>
> Hi Aravind,
>
> Did you have the chance to upgrade to 2.6.1.If so
> where you able
> to reproduce?
>
> Cheers
>
> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
> <aravind.ht at gmail.com <mailto:aravind.ht at gmail.com>>
> wrote:
>
> Hi,
>
> I have attached the complete profiling scripts
> here, its a bit
> shabby, im new to python.
>
> There is a README which has the details on how to
> execute it.
> Im using a Yocto 1.6 on x86_64 platforms on both
> the nodes.
>
>
> Running this script when there are other sessions
> running
> seems to reproduce this problem easily.
> Please try it and let me know if you have any issues
> reproducing the problem.
>
> Regards,
> Aravind.
>
> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
> <jeremie.galarneau at efficios.com
> <mailto:jeremie.galarneau at efficios.com>
> <mailto:jeremie.galarneau at efficios.com
> <mailto:jeremie.galarneau at efficios.com>>> wrote:
>
> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
> <aravind.ht at gmail.com
> <mailto:aravind.ht at gmail.com> <mailto:aravind.ht at gmail.com
> <mailto:aravind.ht at gmail.com>>> wrote:
> > I am using 2.6.0 .I will try to share the
> code that I'm
> using here in some
> > time. If there are any specific fixes that
> are relevant
> to this issue, see
> > if you can provide a link to them. I would
> ideally like
> to try them out
> > before trying a full upgrade to the latest
> versions.
>
> Hi,
>
> Can you provide more information on the
> system? Which
> distribution,
> architecture, kernel version?
>
> The verbose sessiond logs might help pinpoint any
> unexpected behaviour
> here (are all applications registering as
> expected?).
>
> Jérémie
>
> >
> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
> Galarneau
> > <jeremie.galarneau at efficios.com
> <mailto:jeremie.galarneau at efficios.com>
> <mailto:jeremie.galarneau at efficios.com
> <mailto:jeremie.galarneau at efficios.com>>> wrote:
> >>
> >> Hi Aravind,
> >>
> >> Can't say I have looked at everything you
> sent yet, but
> as a
> >> preemptive question, which version are we
> talking about
> here? 2.6.0 or
> >> 2.6.1? 2.6.1 contains a lot of relay daemon
> fixes.
> >>
> >> Thanks,
> >> Jérémie
> >>
> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
> <aravind.ht at gmail.com
> <mailto:aravind.ht at gmail.com>> wrote:
> >> > Hi,
> >> >
> >> > I am trying to obtain the performance
> characteristics
> of lttng with the
> >> > use
> >> > of test applications. Traces are being
> produced on a
> local node and
> >> > delivered to relayd that is running on a
> separate
> node for storage.
> >> >
> >> > An lttng session with the test
> applications producing
> an initial bit
> >> > rate of
> >> > 10 kb/s is started and run for about 30
> seconds. The
> starting sub-buffer
> >> > size is kept at 128 kb and sub-buf count
> at 4. The
> session is then
> >> > stopped
> >> > and destroyed and traces are analyzed to
> see if there
> are any drops.
> >> > This is
> >> > being done in a loop with every
> subsequent session
> having an increment
> >> > of 2
> >> > kb/s as long as there are no drops. If
> there are
> drops, I increase the
> >> > buffer size by a factor of x2 without
> incrementing
> the bit rate.
> >> >
> >> > I see trace drops happening consistently
> with test
> apps producing traces
> >> > at
> >> > less than 40 kb/s, it doesnt seem to help
> even if I
> started with 1mb x 4
> >> > sub-buffers.
> >> >
> >> > Analysis :
> >> >
> >> > I have attached the lttng_relayd ,
> lttng_consumerd_64
> logs and the
> >> > entire
> >> > trace directory, hope you will be able to
> view it.
> >> > I have modified lttng_relayd code to dump
> the traces
> being captured in
> >> > the
> >> > lttng_relayd logs along with debug info.
> >> >
> >> > Each test app is producing logs in the
> form of :
> >> > "TraceApp PID - 31940 THID - 31970
> @threadRate - 1032
> b/s appRate - 2079
> >> > b/s
> >> > threadTraceNum - 9 appTraceNum - 18
> sleepTime - 192120"
> >> >
> >> > The test application PID, test
> application thread id,
> thread bit rate,
> >> > test
> >> > app bit rate, thread trace number and
> application
> trace number s are
> >> > part of
> >> > the trace. So in the above trace, the
> thread is
> producing at 1 kb/s and
> >> > the
> >> > whole test app is producing at 2 kb/s.
> >> >
> >> > If we look at the babeltrace out put, we
> see that the
> Trace with
> >> > TraceApp
> >> > PID - 31940 appTraceNum 2 is missing ,
> with 1, 3, 4,
> 5 and so on being
> >> > successfully captured.
> >> > I looked at the lttng_relayd logs and
> found that
> trace of "appTraceNum
> >> > 2" is
> >> > not delivered/generated by the consumerd
> to the
> relayd in sequence with
> >> > other traces. To rule out that this is
> not a test
> application problem,
> >> > you
> >> > can look at line ltttng_relayd log :
> 12778 and see
> traces from
> >> > appTraceNum -
> >> > 1 to appTraceNum - 18 including the
> appTraceNum 2 are
> "re-delivered" by
> >> > the
> >> > consumerd to the relayd.
> >> > Essentially, I see appTraceNum 1 through
> appTraceNum
> 18 being delivered
> >> > twice, once individually where
> appTraceNum 2 is
> missing and once as a
> >> > group
> >> > at line 12778 where its present.
> >> >
> >> >
> >> > Request help with
> >> > 1. why traces are delivered twice, is it
> by design or
> a genuine problem
> >> > ?
> >> > 2. how to avoid traces being dropped even
> though
> buffers are
> >> > sufficiently
> >> > large enough ?
> >> >
> >> >
> >> > Regards,
> >> > Aravind.
> >> >
> >> >
> _______________________________________________
> >> > lttng-dev mailing list
> >> > lttng-dev at lists.lttng.org
> <mailto:lttng-dev at lists.lttng.org>
> >> >
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> >> >
> >>
> >>
> >>
> >> --
> >> Jérémie Galarneau
> >> EfficiOS Inc.
> >> http://www.efficios.com
> >
> >
>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>
>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> <mailto:lttng-dev at lists.lttng.org>
> <mailto:lttng-dev at lists.lttng.org
> <mailto:lttng-dev at lists.lttng.org>>
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>
>
> -- Jonathan Rajotte Julien
>
>
>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org <mailto:lttng-dev at lists.lttng.org>
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
> --
> Jonathan R. Julien
> Efficios
>
>
>
--
Jonathan R. Julien
Efficios
More information about the lttng-dev
mailing list