[lttng-dev] snapshot data corrupted after the output buffer is filled

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Sat Nov 15 12:59:54 EST 2014


----- Original Message -----

> From: "Anders Wallin" <wallinux at gmail.com>
> To: lttng-dev at lists.lttng.org
> Sent: Friday, November 14, 2014 8:28:18 AM
> Subject: [lttng-dev] snapshot data corrupted after the output buffer is
> filled

> Hi,

> it looks like snapshot events became faulty after the output max size is
> reached.
> The next snapshot get mixed events from different buffers. I have tested it
> on 2.3, 2.4 and 2.5
> and it's in all of them, but it hits earlier in 2.5 due to the problem
> reported in
> http://lists.lttng.org/pipermail/lttng-dev/2014-November/023773.html

> There should never be more then 2 HUNKS in a diff between 2 consecutive
> snapshots, here I got
> 11 as the worst case.

Hi Anders, 

This is not a bug, rather just per-cpu buffering at work.It does have some 
rather non-obvious consequences in terms of snapshot event flow. 

To understand this, we need to keep in mind that LTTng does everything 
in per-cpu buffers, and that overwrite of oldest events in buffer full conditions 
happens per-cpu. 

The use-case you show below really makes sense when you start to consider 
on which CPU (and thus per-cpu buffer) the application is running. Initially, 
the first app runs on CPU 6. As long as the following applications run on 
other cpus, we keep the same first event. Then when enough events are 
written in cpu 6 buffer to overwrite the first event, we see a first event change. 
Then, depending on the scheduling of applications, older parts of per-cpu 
buffers are overwritten, and others are not (if there are less events in those 
buffer). This means some per-cpu buffers will have events going further back 
in time compared to other higher-throughput CPUs. 

This is why you see more than 2 hunks in your diff. Babeltrace merges those 
per-cpu buffers back into a single text output, and the fact that some events 
have been overwritten for a time-range for high-throughput buffers, but not 
for low-throughput buffers, make it appear as if events are missing in the 
stream if your model is a single system-wide buffer, but the missing events 
make sense if you see it as a merge of per-cpu buffers. 

Our current behavior is to gather all available data when taking a snapshot, 
and babeltrace behavior is to show all events available in the streams. We 
could perhaps consider a different babeltrace behavior (a flag) that would 
allow it to only show the events for the intersection of time range of all 
streams, based on the packet header timestamp begin/end. This would 
take care of hiding events that happen in low-throughput streams when 
there is no intersection with snapshotted high-throughput streams. 

And perhaps this should be the default ? 

Thanks! 

Mathieu 

> TESTCASE;

> tracetest code:
> ---------------------------------------------------------------
> #include <stdio.h>
> #include "tracetest-tp.h"

> int main(int argc, char* argv[])
> {
> int i;
> for (i = 0; i < 1000; i++) {
> tracepoint(tracetest, first_tp, i, "test");
> }
> return 0;
> }
> ---------------------------------------------------------------

> test script:
> ---------------------------------------------------------------
> #!/bin/bash
> : ${NO_OF_SUBBUF:=6}
> : ${SUBBUF_SIZE:=4096}
> : ${KERNEL_TRACE:=NO}
> : ${LOOPS:=10}

> SESSION=test
> CHANNEL1=ch1
> CHANNEL2=ch2

> CH2_SUBBUF_SIZE=$(($SUBBUF_SIZE*$NO_OF_SUBBUF))

> LTTNG='lttng -n'
> killall lttng-consumerd > /dev/null 2>&1
> killall lttng-sessiond > /dev/null 2>&1
> if [ -z $DEBUG ]; then
> lttng-sessiond --no-kernel -d
> else
> lttng-sessiond -v --no-kernel --verbose-consumer -b
> fi
> sleep 1

> if [ ! -z $DEBUG ]; then
> export LTTNG_UST_DEBUG=1
> fi

> rm -rf $HOME/lttng-traces
> rm snapshot/*.bt

> $LTTNG --version

> $LTTNG create $SESSION --snapshot
> $LTTNG enable-channel -u --subbuf-size $SUBBUF_SIZE --num-subbuf
> $NO_OF_SUBBUF --overwrite $CHANNEL1
> $LTTNG enable-channel -u --subbuf-size $CH2_SUBBUF_SIZE --num-subbuf 2
> --overwrite $CHANNEL2
> $LTTNG enable-event -u -c $CHANNEL1 "*"
> $LTTNG start
> $LTTNG list $SESSION

> mkdir -p snapshot/

> h=1
> for i in $(seq 1 $LOOPS); do
> $APP
> $LTTNG snapshot record
> # print no of events and first and last event
> last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
> babeltrace $last > snapshot/$ i.bt
> cat snapshot/$ i.bt | wc -l
> cat snapshot/$ i.bt | head -1
> cat snapshot/$ i.bt | tail -1
> diff -u snapshot/$ i.bt snapshot/$ h.bt > snapshot/$i_$h.diff
> cat snapshot/$i_$h.diff | grep "@@" | wc -l | xargs echo "NUMBER OF DIFF
> HUNKS = "
> h=$i
> done

> $LTTNG stop
> $LTTNG destroy $SESSION

> ---------------------------------------------------------------

> result from run (removed some output lines for readability)
> ---------------------------------------------------------------
> lttng (LTTng Trace Control) 2.4.2 - Époque Opaque
> Snapshot mode set. Every channel enabled for that session will be set in
> overwrite mode and mmap output.
> Warning: The number of subbuffers (6) is rounded to the next power of 2 (8)
> UST channel ch1 enabled for session test
> Warning: The subbuf size (24576) is rounded to the next power of 2 (32768)
> UST channel ch2 enabled for session test
> UST event * created in channel ch1
> Tracing started for session test
> Tracing session test: [active snapshot]
> Trace path:

> === Domain: UST global ===

> Buffer type: per UID

> Channels:
> -------------
> - ch1: [enabled]

> Attributes:
> overwrite mode: 1
> subbufers size: 4096
> number of subbufers: 8
> switch timer interval: 0
> read timer interval: 0
> output: mmap()

> Events:
> * (type: tracepoint) [enabled]
> - ch2: [enabled]

> Attributes:
> overwrite mode: 1
> subbufers size: 32768
> number of subbufers: 2
> switch timer interval: 0
> read timer interval: 0
> output: mmap()

> Events:
> None

> Snapshot recorded successfully for session test
> 1011
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 6 }, { baddr = 0x7FFF645EE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.279975100] (+0.000000975) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 5 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 0
> Snapshot recorded successfully for session test
> 2022
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 6 }, { baddr = 0x7FFF645EE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.337703508] (+0.000001238) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 4 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 1
> Snapshot recorded successfully for session test
> 3033
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 6 }, { baddr = 0x7FFF645EE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.432827051] (+0.000000308) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 1
> Snapshot recorded successfully for session test
> 4044
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 6 }, { baddr = 0x7FFF645EE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.475784825] (+0.000000501) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 5 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 1
> Snapshot recorded successfully for session test
> 4735
> [07:09:43.279031320] (+?.?????????) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 5 }, { my_string_field = "test", my_integer_field = 0 }
> [07:09:43.549945106] (+0.000000517) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 12
> Snapshot recorded successfully for session test
> 4437
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.654229030] (+0.000000436) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 5 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 3
> Snapshot recorded successfully for session test
> 4746
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.747509349] (+0.000000502) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 11
> Snapshot recorded successfully for session test
> 5757
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.814938128] (+0.000001371) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 4 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 1
> Snapshot recorded successfully for session test
> 6066
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:43.919112586] (+0.000000990) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 11
> Snapshot recorded successfully for session test
> 6077
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f ust_baddr_statedump:soinfo:
> { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000, sopath = "[vdso]", size = 0, mtime
> = -1 }
> [07:09:44.004234933] (+0.000001213) 07f8cf84a38f tracetest:first_tp: { cpu_id
> = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS = 2
> Waiting for data availability
> Tracing stopped for session test
> Session test destroyed
> ---------------------------------------------------------------

> Regards

> Anders Wallin

> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20141115/3de3ae79/attachment-0001.html>


More information about the lttng-dev mailing list