[lttng-dev] TMF: inconsistencies in the context vtid reporting

Fri Mar 22 17:24:59 EDT 2013

Hi Sébastien,

On 13-03-22 06:02 AM, Sébastien Barthélémy wrote:
> Hello,
>
> I'm using simultaneous kernel and userspace tracing in order to
> understand how a multi threaded application (mis)behaves at shutdown.
>
> The idea is to use userspace tracepoints in the application, augmented
> with the vtid context, in order to know which thread is calling what
> and to relate that with the threads lifespan as reconstructed from the
> kernel scheduler traces.
>
> First, is the approach ok? Am I right to expect the vtid attached to
> ust event to match the tid shown in the reconstructed kernel scheduler
> state?

It should work, yes. However note that the concept of "vtid" (virtual
thread ID) is used to differentiate the tid assigned to a process when
used within a container (LXC) vs. its tid on the host. You could use
context.tid too, but if you don't use containers those two should be the
same.

> Should I expect "time shifts" between the kernel and userspace traces
> (which would make the correlation harder).

Ideally no, but depending on your kernel/libc versions, UST might not
have access to the same time sources as the kernel tracer, so you could
get small offsets. Maybe one of the tracer devs can give a better
explanation.

>
> I was hoping to see the unix signals in the kernel trace too but could
> not find them so far.
>
> So now, the problem is that I get inconsistencies: some UST events
> report a vtid which is shown as waiting by the kernel trace (both in
> the Control Flow and the Resources views). How could a waiting thread
> trigger an UST event?
>
> I believe my kernel trace is complete, since the following command
> shows no output.
>
>     $ babeltrace lttng-traces/auto-20130321-172308/kernel/ > /dev/null
>
> My userspace trace is not complete. But I believe it does not matter,
> because the missing events are seconds before the time window I'm
> interested in, and AFAIK, there is no state reconstruction involved
> (but may be there is?).

It could happen ; maybe the tracer missed an important sched_switch
event that set a process in a particular state. I wouldn't recommend
running analyses on traces with missing events, some information could
be all wrong.

Unless you are on an embedded platform with very limited memory, it's
usually safe to increase the buffer size to a couple MBs. See
--enable-channel and --subbuf-size in the LTTng man page. I have to warn
you about https://bugs.lttng.org/issues/228 in advance though...

I still think the default value should be higher (and have the embedded
guys use --subbuf-size to reduce it), but I don't call those shots ;)

> I eventually came to a more blatant inconsistency: the displayed
> context vtid of an event is not always the same.
>
> [...]
>
> I'm using TMF 2.0.0.201303211612 (nightly build from yesterday).

Are your traces available somewhere (if they can be made public)? I
would like to take a look at it, it seems the CTF parser gets confused
between vtid and vpid for some weird reason...

> And by the way, thank your for the support I already got on IRC
> yesterday, you're a dream project to work with.

Thanks! That's encouraging to hear ;)

Cheers,
Alex