[lttng-dev] Allocation failures with babeltrace and TraceCompass - corrupt trace?

Wed Jun 14 16:39:33 UTC 2017

----- On Jun 14, 2017, at 11:55 AM, Thomas McGuire thomas.mcguire at kdab.com wrote:

> Hi,
> 
> On 14.06.2017 17:12, Mathieu Desnoyers wrote:
>> Can you provide a copy of the metadata file ? And ideally the data
>> streams too ? This would give us a better idea of what is happening.
>> 
>> Do you perform kernel or user-space tracing ? Do you trace huge
>> sequences of bytes within your own tracepoints ?
> 
> I perform kernel traceing only, in this case limited to syscalls,
> sched*, block* and irq*. No user-space tracepoints.
> 
> I didn't know the metadata file was plain text, I had a quick look into
> it and noticed corruption already, with random garbage data inserted all
> over the place. I'm surprised babeltrace didn't choke on the metadata
> already.

The lttng metadata is "packetized plain-text". What you see is plain-text in
a transport layer which is binary. This explains the "garbage" you see:
those are binary headers for packets. Use babeltrace -o ctf-metadata
to extract the text-only metadata (which is also valid metadata under CTF).
Both packetized and pure text metadata are allowed.

> I can not provide the data file as it has confidential data. Looking at
> it with a hex editor, I see the same kind of garbage as in the metadata
> file, so both files are affected by the same problem.

The CTF data files are binary, so the garbage you see can be either
headers or padding.

> 
> I've uploaded the metadata file to
> http://www.kdab.com/~thomas/stuff/metadata.
> 
> To double-check that it isn't file system corruption, I ran "yes >
> test.data" - that file is OK, so it's probably a different problem.
> 
> Any idea what can cause the corrupted trace?

Based on your babeltrace backtrace, the possible culprits would be the
events that have a sequence (variable-sized array):

syscalls: select, poll, ppoll, pselect6, epoll_wait, epoll_pwait

block_rq_issue, block_rq_insert, block_rq_complete, block_rq_requeue, block_rq_abort.

There are a few approaches to cornering the issue. You can try reproducing
on your workload/config by only enabling one of these events at a time.
Just knowing which event(s) is/are the culprit would be a good start.

Another possibility would be to send us a trace reproducing the issue
with only those events enabled, which should not contain confidential
info about your system.

Thanks,

Mathieu

> 
> Regards,
> Thomas
> --
> Thomas McGuire | thomas.mcguire at kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> 
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com