[lttng-dev] snapshot data corrupted after the output buffer is filled

Anders Wallin wallinux at gmail.com
Mon Nov 17 04:45:37 EST 2014


Thx Mathieu,

if I'm running my test on one cpu (taskset -c 1 ./test.run) it works as
expected and described by you.
The babeltrace flag would be a good solution! Do you want me to file a
bug/enhancement on this?

Regards



Anders Wallin

On Sat, Nov 15, 2014 at 6:59 PM, Mathieu Desnoyers <
mathieu.desnoyers at efficios.com> wrote:

> ------------------------------
>
> *From: *"Anders Wallin" <wallinux at gmail.com>
> *To: *lttng-dev at lists.lttng.org
> *Sent: *Friday, November 14, 2014 8:28:18 AM
> *Subject: *[lttng-dev] snapshot data corrupted after the output buffer
> is        filled
>
> Hi,
>
> it looks like snapshot events became faulty after the output max size is
> reached.
> The next snapshot get mixed events from different buffers. I have tested
> it on 2.3, 2.4 and 2.5
> and it's in all of them, but it hits earlier in 2.5 due to the  problem
> reported in
> http://lists.lttng.org/pipermail/lttng-dev/2014-November/023773.html
>
> There should never be more then 2 HUNKS in a diff between 2 consecutive
> snapshots, here I got
> 11 as the worst case.
>
> Hi Anders,
>
> This is not a bug, rather just per-cpu buffering at work.It does have some
> rather non-obvious consequences in terms of snapshot event flow.
>
> To understand this, we need to keep in mind that LTTng does everything
> in per-cpu buffers, and that overwrite of oldest events in buffer full
> conditions
> happens per-cpu.
>
> The use-case you show below really makes sense when you start to consider
> on which CPU (and thus per-cpu buffer) the application is running.
> Initially,
> the first app runs on CPU 6. As long as the following applications run on
> other cpus, we keep the same first event. Then when enough events are
> written in cpu 6 buffer to overwrite the first event, we see a first event
> change.
> Then, depending on the scheduling of applications, older parts of per-cpu
> buffers are overwritten, and others are not (if there are less events in
> those
> buffer). This means some per-cpu buffers will have events going further
> back
> in time compared to other higher-throughput CPUs.
>
> This is why you see more than 2 hunks in your diff. Babeltrace merges those
> per-cpu buffers back into a single text output, and the fact that some
> events
> have been overwritten for a time-range for high-throughput buffers, but not
> for low-throughput buffers, make it appear as if events are missing in the
> stream if your model is a single system-wide buffer, but the missing events
> make sense if you see it as a merge of per-cpu buffers.
>
> Our current behavior is to gather all available data when taking a
> snapshot,
> and babeltrace behavior is to show all events available in the streams. We
> could perhaps consider a different babeltrace behavior (a flag) that would
> allow it to only show the events for the intersection of time range of all
> streams, based on the packet header timestamp begin/end. This would
> take care of hiding events that happen in low-throughput streams when
> there is no intersection with snapshotted high-throughput streams.
>
> And perhaps this should be the default ?
>
> Thanks!
>
> Mathieu
>
>
> TESTCASE;
>
> tracetest code:
> ---------------------------------------------------------------
> #include <stdio.h>
> #include "tracetest-tp.h"
>
> int main(int argc, char* argv[])
> {
>         int i;
>         for (i = 0; i < 1000; i++) {
>                 tracepoint(tracetest, first_tp, i, "test");
>         }
>         return 0;
> }
> ---------------------------------------------------------------
>
> test script:
> ---------------------------------------------------------------
> #!/bin/bash
> : ${NO_OF_SUBBUF:=6}
> : ${SUBBUF_SIZE:=4096}
> : ${KERNEL_TRACE:=NO}
> : ${LOOPS:=10}
>
> SESSION=test
> CHANNEL1=ch1
> CHANNEL2=ch2
>
> CH2_SUBBUF_SIZE=$(($SUBBUF_SIZE*$NO_OF_SUBBUF))
>
>
> LTTNG='lttng -n'
> killall lttng-consumerd > /dev/null 2>&1
> killall lttng-sessiond > /dev/null 2>&1
> if [ -z $DEBUG ]; then
>     lttng-sessiond --no-kernel -d
> else
>     lttng-sessiond -v --no-kernel --verbose-consumer -b
> fi
> sleep 1
>
> if [ ! -z $DEBUG ]; then
>     export LTTNG_UST_DEBUG=1
> fi
>
> rm -rf $HOME/lttng-traces
> rm snapshot/*.bt
>
> $LTTNG --version
>
> $LTTNG create $SESSION --snapshot
> $LTTNG enable-channel -u --subbuf-size $SUBBUF_SIZE --num-subbuf
> $NO_OF_SUBBUF --overwrite $CHANNEL1
> $LTTNG enable-channel -u --subbuf-size $CH2_SUBBUF_SIZE --num-subbuf 2
> --overwrite $CHANNEL2
> $LTTNG enable-event -u -c $CHANNEL1 "*"
> $LTTNG start
> $LTTNG list $SESSION
>
> mkdir -p snapshot/
>
> h=1
> for i in $(seq 1 $LOOPS); do
>     $APP
>     $LTTNG snapshot record
>     # print no of events and first and last event
>
>
>     last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
>     babeltrace $last > snapshot/$i.bt
>     cat snapshot/$i.bt | wc -l
>     cat snapshot/$i.bt | head -1
>     cat snapshot/$i.bt | tail -1
>     diff -u snapshot/$i.bt snapshot/$h.bt > snapshot/$i_$h.diff
>     cat snapshot/$i_$h.diff | grep "@@" | wc -l | xargs echo "NUMBER OF
> DIFF HUNKS = "
>     h=$i
> done
>
> $LTTNG stop
> $LTTNG destroy $SESSION
>
> ---------------------------------------------------------------
>
> result from run (removed some output lines for readability)
> ---------------------------------------------------------------
> lttng (LTTng Trace Control) 2.4.2 - Époque Opaque
> Snapshot mode set. Every channel enabled for that session will be set in
> overwrite mode and mmap output.
> Warning: The number of subbuffers (6) is rounded to the next power of 2 (8)
> UST channel ch1 enabled for session test
> Warning: The subbuf size (24576) is rounded to the next power of 2 (32768)
> UST channel ch2 enabled for session test
> UST event * created in channel ch1
> Tracing started for session test
> Tracing session test: [active snapshot]
>     Trace path:
>
> === Domain: UST global ===
>
> Buffer type: per UID
>
> Channels:
> -------------
> - ch1: [enabled]
>
>     Attributes:
>       overwrite mode: 1
>       subbufers size: 4096
>       number of subbufers: 8
>       switch timer interval: 0
>       read timer interval: 0
>       output: mmap()
>
>     Events:
>       * (type: tracepoint) [enabled]
> - ch2: [enabled]
>
>     Attributes:
>       overwrite mode: 1
>       subbufers size: 32768
>       number of subbufers: 2
>       switch timer interval: 0
>       read timer interval: 0
>       output: mmap()
>
>     Events:
>       None
>
> Snapshot recorded successfully for session test
> 1011
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.279975100] (+0.000000975) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  0
> Snapshot recorded successfully for session test
> 2022
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.337703508] (+0.000001238) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  1
> Snapshot recorded successfully for session test
> 3033
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.432827051] (+0.000000308) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  1
> Snapshot recorded successfully for session test
> 4044
> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.475784825] (+0.000000501) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  1
> Snapshot recorded successfully for session test
> 4735
> [07:09:43.279031320] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
> [07:09:43.549945106] (+0.000000517) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  12
> Snapshot recorded successfully for session test
> 4437
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.654229030] (+0.000000436) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  3
> Snapshot recorded successfully for session test
> 4746
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.747509349] (+0.000000502) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  11
> Snapshot recorded successfully for session test
> 5757
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.814938128] (+0.000001371) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  1
> Snapshot recorded successfully for session test
> 6066
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:43.919112586] (+0.000000990) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  11
> Snapshot recorded successfully for session test
> 6077
> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
> sopath = "[vdso]", size = 0, mtime = -1 }
> [07:09:44.004234933] (+0.000001213) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
> NUMBER OF DIFF HUNKS =  2
> Waiting for data availability
> Tracing stopped for session test
> Session test destroyed
> ---------------------------------------------------------------
>
> Regards
>
> Anders Wallin
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20141117/aacb6333/attachment-0001.html>


More information about the lttng-dev mailing list