[lttng-dev] snapshot data corrupted after the output buffer is filled

Anders Wallin wallinux at gmail.com
Wed Nov 19 05:31:56 EST 2014


Filed as https://bugs.lttng.org/issues/862



Anders Wallin

On Mon, Nov 17, 2014 at 12:45 PM, Mathieu Desnoyers <
mathieu.desnoyers at efficios.com> wrote:

> Hi Anders,
>
> Yes, please file a feature request against Babeltrace on
> the bug tracker pointing the this email thread and
> describing the feature.
>
> Thanks!
>
> Mathieu
>
>
> ------------------------------
>
> *From: *"Anders Wallin" <wallinux at gmail.com>
> *To: *"Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> *Cc: *lttng-dev at lists.lttng.org
> *Sent: *Monday, November 17, 2014 10:45:37 AM
> *Subject: *Re: [lttng-dev] snapshot data corrupted after the output
> buffer is filled
>
>
> Thx Mathieu,
>
> if I'm running my test on one cpu (taskset -c 1 ./test.run) it works as
> expected and described by you.
> The babeltrace flag would be a good solution! Do you want me to file a
> bug/enhancement on this?
>
> Regards
>
>
>
> Anders Wallin
>
> On Sat, Nov 15, 2014 at 6:59 PM, Mathieu Desnoyers <
> mathieu.desnoyers at efficios.com> wrote:
>
>> ------------------------------
>>
>> *From: *"Anders Wallin" <wallinux at gmail.com>
>> *To: *lttng-dev at lists.lttng.org
>> *Sent: *Friday, November 14, 2014 8:28:18 AM
>> *Subject: *[lttng-dev] snapshot data corrupted after the output buffer
>> is        filled
>>
>> Hi,
>>
>> it looks like snapshot events became faulty after the output max size is
>> reached.
>> The next snapshot get mixed events from different buffers. I have tested
>> it on 2.3, 2.4 and 2.5
>> and it's in all of them, but it hits earlier in 2.5 due to the  problem
>> reported in
>> http://lists.lttng.org/pipermail/lttng-dev/2014-November/023773.html
>>
>> There should never be more then 2 HUNKS in a diff between 2 consecutive
>> snapshots, here I got
>> 11 as the worst case.
>>
>> Hi Anders,
>>
>> This is not a bug, rather just per-cpu buffering at work.It does have some
>> rather non-obvious consequences in terms of snapshot event flow.
>>
>> To understand this, we need to keep in mind that LTTng does everything
>> in per-cpu buffers, and that overwrite of oldest events in buffer full
>> conditions
>> happens per-cpu.
>>
>> The use-case you show below really makes sense when you start to consider
>> on which CPU (and thus per-cpu buffer) the application is running.
>> Initially,
>> the first app runs on CPU 6. As long as the following applications run on
>> other cpus, we keep the same first event. Then when enough events are
>> written in cpu 6 buffer to overwrite the first event, we see a first
>> event change.
>> Then, depending on the scheduling of applications, older parts of per-cpu
>> buffers are overwritten, and others are not (if there are less events in
>> those
>> buffer). This means some per-cpu buffers will have events going further
>> back
>> in time compared to other higher-throughput CPUs.
>>
>> This is why you see more than 2 hunks in your diff. Babeltrace merges
>> those
>> per-cpu buffers back into a single text output, and the fact that some
>> events
>> have been overwritten for a time-range for high-throughput buffers, but
>> not
>> for low-throughput buffers, make it appear as if events are missing in the
>> stream if your model is a single system-wide buffer, but the missing
>> events
>> make sense if you see it as a merge of per-cpu buffers.
>>
>> Our current behavior is to gather all available data when taking a
>> snapshot,
>> and babeltrace behavior is to show all events available in the streams. We
>> could perhaps consider a different babeltrace behavior (a flag) that would
>> allow it to only show the events for the intersection of time range of all
>> streams, based on the packet header timestamp begin/end. This would
>> take care of hiding events that happen in low-throughput streams when
>> there is no intersection with snapshotted high-throughput streams.
>>
>> And perhaps this should be the default ?
>>
>> Thanks!
>>
>> Mathieu
>>
>>
>> TESTCASE;
>>
>> tracetest code:
>> ---------------------------------------------------------------
>> #include <stdio.h>
>> #include "tracetest-tp.h"
>>
>> int main(int argc, char* argv[])
>> {
>>         int i;
>>         for (i = 0; i < 1000; i++) {
>>                 tracepoint(tracetest, first_tp, i, "test");
>>         }
>>         return 0;
>> }
>> ---------------------------------------------------------------
>>
>> test script:
>> ---------------------------------------------------------------
>> #!/bin/bash
>> : ${NO_OF_SUBBUF:=6}
>> : ${SUBBUF_SIZE:=4096}
>> : ${KERNEL_TRACE:=NO}
>> : ${LOOPS:=10}
>>
>> SESSION=test
>> CHANNEL1=ch1
>> CHANNEL2=ch2
>>
>> CH2_SUBBUF_SIZE=$(($SUBBUF_SIZE*$NO_OF_SUBBUF))
>>
>>
>> LTTNG='lttng -n'
>> killall lttng-consumerd > /dev/null 2>&1
>> killall lttng-sessiond > /dev/null 2>&1
>> if [ -z $DEBUG ]; then
>>     lttng-sessiond --no-kernel -d
>> else
>>     lttng-sessiond -v --no-kernel --verbose-consumer -b
>> fi
>> sleep 1
>>
>> if [ ! -z $DEBUG ]; then
>>     export LTTNG_UST_DEBUG=1
>> fi
>>
>> rm -rf $HOME/lttng-traces
>> rm snapshot/*.bt
>>
>> $LTTNG --version
>>
>> $LTTNG create $SESSION --snapshot
>> $LTTNG enable-channel -u --subbuf-size $SUBBUF_SIZE --num-subbuf
>> $NO_OF_SUBBUF --overwrite $CHANNEL1
>> $LTTNG enable-channel -u --subbuf-size $CH2_SUBBUF_SIZE --num-subbuf 2
>> --overwrite $CHANNEL2
>> $LTTNG enable-event -u -c $CHANNEL1 "*"
>> $LTTNG start
>> $LTTNG list $SESSION
>>
>> mkdir -p snapshot/
>>
>> h=1
>> for i in $(seq 1 $LOOPS); do
>>     $APP
>>     $LTTNG snapshot record
>>     # print no of events and first and last event
>>
>>
>>     last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
>>     babeltrace $last > snapshot/$i.bt
>>     cat snapshot/$i.bt | wc -l
>>     cat snapshot/$i.bt | head -1
>>     cat snapshot/$i.bt | tail -1
>>     diff -u snapshot/$i.bt snapshot/$h.bt > snapshot/$i_$h.diff
>>     cat snapshot/$i_$h.diff | grep "@@" | wc -l | xargs echo "NUMBER OF
>> DIFF HUNKS = "
>>     h=$i
>> done
>>
>> $LTTNG stop
>> $LTTNG destroy $SESSION
>>
>> ---------------------------------------------------------------
>>
>> result from run (removed some output lines for readability)
>> ---------------------------------------------------------------
>> lttng (LTTng Trace Control) 2.4.2 - Époque Opaque
>> Snapshot mode set. Every channel enabled for that session will be set in
>> overwrite mode and mmap output.
>> Warning: The number of subbuffers (6) is rounded to the next power of 2
>> (8)
>> UST channel ch1 enabled for session test
>> Warning: The subbuf size (24576) is rounded to the next power of 2 (32768)
>> UST channel ch2 enabled for session test
>> UST event * created in channel ch1
>> Tracing started for session test
>> Tracing session test: [active snapshot]
>>     Trace path:
>>
>> === Domain: UST global ===
>>
>> Buffer type: per UID
>>
>> Channels:
>> -------------
>> - ch1: [enabled]
>>
>>     Attributes:
>>       overwrite mode: 1
>>       subbufers size: 4096
>>       number of subbufers: 8
>>       switch timer interval: 0
>>       read timer interval: 0
>>       output: mmap()
>>
>>     Events:
>>       * (type: tracepoint) [enabled]
>> - ch2: [enabled]
>>
>>     Attributes:
>>       overwrite mode: 1
>>       subbufers size: 32768
>>       number of subbufers: 2
>>       switch timer interval: 0
>>       read timer interval: 0
>>       output: mmap()
>>
>>     Events:
>>       None
>>
>> Snapshot recorded successfully for session test
>> 1011
>> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.279975100] (+0.000000975) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  0
>> Snapshot recorded successfully for session test
>> 2022
>> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.337703508] (+0.000001238) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  1
>> Snapshot recorded successfully for session test
>> 3033
>> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.432827051] (+0.000000308) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  1
>> Snapshot recorded successfully for session test
>> 4044
>> [07:09:43.278933559] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF645EE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.475784825] (+0.000000501) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  1
>> Snapshot recorded successfully for session test
>> 4735
>> [07:09:43.279031320] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
>> [07:09:43.549945106] (+0.000000517) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  12
>> Snapshot recorded successfully for session test
>> 4437
>> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.654229030] (+0.000000436) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  3
>> Snapshot recorded successfully for session test
>> 4746
>> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.747509349] (+0.000000502) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  11
>> Snapshot recorded successfully for session test
>> 5757
>> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.814938128] (+0.000001371) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  1
>> Snapshot recorded successfully for session test
>> 6066
>> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:43.919112586] (+0.000000990) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  11
>> Snapshot recorded successfully for session test
>> 6077
>> [07:09:43.335916076] (+?.?????????) 07f8cf84a38f
>> ust_baddr_statedump:soinfo: { cpu_id = 2 }, { baddr = 0x7FFFAD1FE000,
>> sopath = "[vdso]", size = 0, mtime = -1 }
>> [07:09:44.004234933] (+0.000001213) 07f8cf84a38f tracetest:first_tp: {
>> cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> NUMBER OF DIFF HUNKS =  2
>> Waiting for data availability
>> Tracing stopped for session test
>> Session test destroyed
>> ---------------------------------------------------------------
>>
>> Regards
>>
>> Anders Wallin
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev at lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>>
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20141119/7dd024ba/attachment-0001.html>


More information about the lttng-dev mailing list