[lttng-dev] [lttng-ust] Missing events when just before a process exits

Thu Mar 15 04:37:06 EDT 2018

> -----Original Message-----
> From: Mathieu Desnoyers [mailto:mathieu.desnoyers at efficios.com]
> Sent: Wednesday, March 14, 2018 6:25 PM
> To: THEUNISSEN Rolf <rolf.theunissen at altran.com>
> Cc: lttng-dev <lttng-dev at lists.lttng.org>
> Subject: Re: [lttng-dev] [lttng-ust] Missing events when just before a process
> exits
> 
> ----- On Mar 14, 2018, at 12:37 PM, THEUNISSEN Rolf
> rolf.theunissen at altran.com wrote:
> 
> >> -----Original Message-----
> >> From: Jonathan Rajotte-Julien
> >> [mailto:jonathan.rajotte-julien at efficios.com]
> >> Sent: Tuesday, March 13, 2018 4:35 PM
> >> To: THEUNISSEN Rolf <rolf.theunissen at altran.com>
> >> Cc: lttng-dev at lists.lttng.org
> >> Subject: Re: [lttng-dev] [lttng-ust] Missing events when just before
> >> a process exits
> >>
> >> Hi Rolf,
> >>
> >> On Tue, Mar 13, 2018 at 10:48:17AM +0000, THEUNISSEN Rolf wrote:
> >> > Hi,
> >> >
> >> > I am currently tracing many processes with LTTng-UST on a system
> >> > under
> >> heavy load, with 2 CPUs.
> >> >
> >> > The traces seem to be missing ust-events near the end of a trace.
> >> > After
> >> doing some analysis, it seems that events are missing from the other
> >> CPU then the process exits on. More concretely:
> >> >
> >> > During the lifetime of a process:
> >> > - Process X executes on CPU 0,1: all events are in the trace Just
> >> > before the process is about to exit:
> >> > - Process X executes on CPU 0: missing events from the trace
> >>
> >> How are you validating this?
> >
> > I am also using kernel tracing, to trace the scheduler. In
> > TraceCompass I can view the states of the processes and CPUs.
> >
> >>
> >> > A few mili-seconds later:
> >> > - Process X executes on CPU1 and exits: The last events are in the
> >> > trace
> >>
> >> This is not expected. Do you have a reproducer so we can check what
> >> might be happening?
> >
> > I am not allowed to provide details about my current traces. And it is
> > hard to make a small example, but I got some more clues what is
> happening, see below.
> >
> >>
> >> We will need the version for lttng-ust, lttng-tools, babeltrace.
> >
> > I am using version 2.10, for analyzing the traces I am using
> > TraceCompass Kernel version: 3.10.62
> >
> >>
> >> Note that you can run your application with lttng-ust in debug mode
> >> using LTTNG_UST_DEBUG=y. Debug statements will be outputted on
> stderr.
> >>
> >> e.g:
> >>     $ LTTNG_UST_DEBUG=y ./my_sample_app
> >>
> >
> > Enableling this debugging flag gave me a clue what is happening. For
> > all processes, the registration is going OK. But for some processes,
> > the un-registration is not logged. On closer analysis (now with
> > tracepoints from SystemTap), I see that the processes for which the
> > events are missing and the un-registration is missing are killed by
> > SIGKILL signal. This probably explains why some data can be lost. But
> > maybe strange that only messages get lost that were traced on a different
> CPU.
> >
> > I remain puzzled why the SIGKILL arrives at the process, as a SIGTERM
> > is sent to the process. Somehow the kernel decides to deliver an
> > SIGKILL instead of a SIGTERM.
> 
> When using lttng-ust with per-uid buffers, there is a known design limitation
> regarding process terminated by non-handled kill signals: if this occurs while
> the buffer is being written to (between reserve and commit), it will cause the
> rest of the per-cpu buffer to be unreadable. This applies to all events
> recorded by all applications of the same user id for that particular CPU.
> Recovering from this involves destroying and re-creating the tracing session.
> 
> If this kind of scenario is likely for you, you might want to consider using per-
> pid buffers (lttng enable-channel -u --buffers-pid), which do not suffer from
> this design limitation. The downside of per-pid buffers is that it allocates
> more memory on your system for buffering, and adds extra overhead when
> used with processes that have a short life-time.
> 
> Thanks,
> 
> Mathieu
> 

Thanks for your reply, indeed I was using --buffers-uid when I first observed this issue. Then I switched back to --buffers-pid, as I suspected it to be more robust, however the issue did not resolve (though it might occur less often). Anyways, it is expected to lose data when a process is SIGKILLed.
It might be worth mentioning above design restriction in the man pages of lttng.

Thanks for your time.

Kind Regards,
Rolf

> 
> >
> > Rolf
> >
> >> Cheers
> >>
> >> --
> >> Jonathan Rajotte-Julien
> >> EfficiOS
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev at lists.lttng.org
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.lttng.org_c
> > gi-2Dbin_mailman_listinfo_lttng-
> 2Ddev&d=DwICaQ&c=cxWN2QSDopt5SklNfbjIj
> >
> g&r=XBg4LogZ8Z2xGquh4s7VDGwsgBcwKOdV99wapS1T0K8&m=1VAbXAhF9
> 0qZpqUripzg
> > XtT8byswpHwO42jCevAQHQI&s=Ev3_wi-aSOTDFTiHQU6sTXBFFbZbY2kKV-
> up4DWj2ts&
> > e=
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__www.efficios.com&d=DwICaQ&c=cxWN2QSDopt5SklNfbjIjg&r=XBg4Lo
> gZ8Z2xGquh4s7VDGwsgBcwKOdV99wapS1T0K8&m=1VAbXAhF90qZpqUripzg
> XtT8byswpHwO42jCevAQHQI&s=-AbmrFCL-5CGzKx2z1-
> cRYdbOEWaxQtOAhoyNHtUojg&e=