[lttng-dev] Allocation failures with babeltrace and TraceCompass - corrupt trace?

Fri Jun 23 15:48:55 UTC 2017

Hi,

On 14.06.2017 18:39, Mathieu Desnoyers wrote:
>> On 14.06.2017 17:12, Mathieu Desnoyers wrote:
>>> Can you provide a copy of the metadata file ? And ideally the data
>>> streams too ? This would give us a better idea of what is happening.
>>>
>>> Do you perform kernel or user-space tracing ? Do you trace huge
>>> sequences of bytes within your own tracepoints ?
>> I perform kernel traceing only, in this case limited to syscalls,
>> sched*, block* and irq*. No user-space tracepoints.
>>
>> I didn't know the metadata file was plain text, I had a quick look into
>> it and noticed corruption already, with random garbage data inserted all
>> over the place. I'm surprised babeltrace didn't choke on the metadata
>> already.
> The lttng metadata is "packetized plain-text". What you see is plain-text in
> a transport layer which is binary. This explains the "garbage" you see:
> those are binary headers for packets. Use babeltrace -o ctf-metadata
> to extract the text-only metadata (which is also valid metadata under CTF).
> Both packetized and pure text metadata are allowed.

Upps, right, red herring then. I've read the overview of the CTF format
now (great docs) and understand things a bit better.

>> Any idea what can cause the corrupted trace?
> Based on your babeltrace backtrace, the possible culprits would be the
> events that have a sequence (variable-sized array):
> 
> syscalls: select, poll, ppoll, pselect6, epoll_wait, epoll_pwait
> 
> block_rq_issue, block_rq_insert, block_rq_complete, block_rq_requeue, block_rq_abort.
> 
> There are a few approaches to cornering the issue. You can try reproducing
> on your workload/config by only enabling one of these events at a time.
> Just knowing which event(s) is/are the culprit would be a good start.
> 
> Another possibility would be to send us a trace reproducing the issue
> with only those events enabled, which should not contain confidential
> info about your system.

I've added some debug statements to babeltrace now. The culprit in this
particular case is the first block_rq_complete event, the __cmd_length
field contains a large value (3040877592). __cmd_length is used as the
length for the _cmd sequence, and then of course allocating space for
that sequence fails.

Any idea what can cause __cmd_length to be bogus?

For reference, this is the event declaration:
event {
	name = "block_rq_complete";
	id = 4;
	stream_
	fields := struct {
		integer { size = 32; align = 32; signed = 0; encoding = none; base =
10; } _dev;
		integer { size = 64; align = 64; signed = 0; encoding = none; base =
10; } _sector;
		integer { size = 32; align = 32; signed = 0; encoding = none; base =
10; } _nr_sector;
		integer { size = 32; align = 32; signed = 1; encoding = none; base =
10; } _errors;
		integer { size = 32; align = 32; signed = 0; encoding = none; base =
10; } _rwbs;
		integer { size = 32; align = 32; signed = 0; encoding = none; base =
10; } __cmd_length;
		integer { size = 8; align = 8; signed = 0; encoding = none; base = 16;
} _cmd[ __cmd_length ];
	};
};

Also, does anyone know which block_rq_ events are actually needed to
display the IO load in TraceCompass?

Regards,
Thomas
-- 
Thomas McGuire | thomas.mcguire at kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4000 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20170623/cd91ae1d/attachment.bin>