[lttng-dev] snapshot issue with 2.5

Julien Desfossez jdesfossez at efficios.com
Wed Nov 12 18:11:00 EST 2014


Hi,

On 14-11-12 05:30 PM, Jérémie Galarneau wrote:
> On Wed, Nov 12, 2014 at 7:39 AM, Anders Wallin <wallinux at gmail.com> wrote:
>> Hi,
>>
>> using userspace snapshot mode in 2.5 seems to broken.
>>
>> I have small stupid application(tracetest) generating 1000 events
>>
>> Running the script (the below is simplified version of the script)
>> ----------------------------------------------------------------------------------
>> #!/bin/bash
>> LTTNG='lttng -n'
>> lttng-sessiond --no-kernel &
> 
> You should use the "-d" flag to launch the sessiond as a daemon. Using
> "&" might cause the "create" command to launch a second session daemon
> since the first one would not have had the time to open create its
> command sockets.
> 
>>
>> $LTTNG create test --snapshot
>> $LTTNG enable-channel -u --subbuf-size 64k --num-subbuf 16 --overwrite ch1
>> $LTTNG enable-event -u -c ch1 "*"
>> $LTTNG start
>> $LTTNG list ch1
>>
>> mkdir -p snapshot/
>>
>> for i in $(seq 1 $LOOPS); do
>>     ./tracetest
>>     $LTTNG snapshot record -m 1G
>>
>>     # print no of events and first and last event
>>     last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
>>     babeltrace $last > snapshot/$i.bt
>>     cat snapshot/$i.bt | wc -l
>>     cat snapshot/$i.bt | head -1
>>     cat snapshot/$i.bt | tail -1
>> done
>>
>> $LTTNG stop
>> $LTTNG destroy $SESSION
>> ----------------------------------------------------------------------------------
>>
>> The result of the run looks like this;
>> ./test.run
>> Error: consumer err socket second poll error
>> Error: Health error occurred in thread_manage_consumer
>> PERROR - 13:36:59.685167 [23772/23781]: bind inet: Address already in use
>> (in lttcomm_bind_inet_sock() at inet.c:109)
>> Warning: Another session daemon is using this JUL port. JUL support will be
>> deactivated to prevent interfering with the tracing.
>>
> 
> Is there a reason why you run two session daemons? I'm guessing this
> is related to my "& vs -d" comment above.
> 
>> lttng (LTTng Trace Control) 2.5.2 - Fumisterie
>> BabelTrace Trace Viewer and Converter 1.2.4
>>         linux-vdso.so.1 =>  (0x00007fffba966000)
>>         liblttng-ctl.so.0 => /usr/lib/x86_64-linux-gnu/liblttng-ctl.so.0
>> (0x00007fbfe4b1d000)
>>         librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbfe4915000)
>>         libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2
>> (0x00007fbfe45ae000)
>>         libpopt.so.0 => /lib/x86_64-linux-gnu/libpopt.so.0
>> (0x00007fbfe43a2000)
>>         libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
>> (0x00007fbfe4184000)
>>         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbfe3dbd000)
>>         /lib64/ld-linux-x86-64.so.2 (0x00007fbfe4fac000)
>>         liburcu.so.2 => /usr/lib/x86_64-linux-gnu/liburcu.so.2
>> (0x00007fbfe3bb6000)
>>         libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbfe39b2000)
>>         libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbfe3798000)
>>         liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5
>> (0x00007fbfe3576000)
>>         libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbfe3270000)
>>
>> Session test created.
>> Default snapshot output set to:
>> /home/awallin/lttng-traces/test-20141112-133700
>> Snapshot mode set. Every channel enabled for that session will be set in
>> overwrite mode and mmap output.
>> UST channel ch1 enabled for session test
>> UST event * created in channel ch1
>> Tracing started for session test
>> Tracing session test: [active snapshot]
>>     Trace path:
>>     Live timer interval (usec): 4294967295
>>
>> === Domain: UST global ===
>>
>> Channels:
>> -------------
>> - ch1: [enabled]
>>
>>     Attributes:
>>       overwrite mode: 1
>>       subbufers size: 4096
>>       number of subbufers: 16
>>       switch timer interval: 0
>>       read timer interval: 0
>>       trace file count: 0
>>       trace file size (bytes): 0
>>       output: mmap()
>>
>>     Events:
>>       * (type: tracepoint) [enabled]
>>
>> Snapshot recorded successfully for session test
>> 84
>> [13:37:00.852615950] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFFC6BC7000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:00.853798949] (+0.000000805) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 157
>> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>> [13:37:00.894839116] (+0.000000299) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>> [13:37:00.929447330] (+0.000000305) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test

Something is wrong from here:
>> 168
>> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:00.972035546] (+0.000000597) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:01.014060940] (+0.000000341) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:01.052555877] (+0.000000872) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
... to here.

The beginning timestamp is the same, the number of events is the same,
but the end timestamp moves. We see also this pattern below.
I would be really interested to see the whole snapshots when it happens.
It seems like we are not be overwriting the beginning of the ring buffer
in some conditions. Could you send us your tracetest program and the
script so we can try to reproduce the problem ?

>> Snapshot recorded successfully for session test
>> 168
>> [13:37:01.014035993] (+?.?????????) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 927 }
>> [13:37:01.098620808] (+0.000001404) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:01.172829105] (+0.000000575) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> Snapshot recorded successfully for session test
>> 168
>> [13:37:01.172021557] (+?.?????????) arn-awallin-mint-l3
>> ust_baddr_statedump:soinfo: { cpu_id = 7 }, { baddr = 0x7FFF653FE000, sopath
>> = "[vdso]", size = 0, mtime = -1 }
>> [13:37:01.220074165] (+0.000000553) arn-awallin-mint-l3 tracetest:first_tp:
>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> Waiting for data availability
>> Tracing stopped for session test
>> Session test destroyed
>>
>> the result from the run on older versions of lttng creates bigger and bigger
>> event traces until reaching max size.
We see this pattern here and the max size seems to be 168 events, was it
similar in your previous tests ?

> 
> Unless I'm mistaken, the behavior of the snapshot mode has changed
> between 2.4 and 2.5.
> 
> In the 2.4 series, consecutive snapshots could overlap; that is,
> events could be present in multiple snapshots. As of 2.5, an event is
> guaranteed to only appear in one snapshot as the events are considered
> as having been "consumed" at that point.

Hum, no the behavior is still the same, consecutive snapshots can overlap.

Could you tell us on which architecture you are experiencing the problem
and the number of CPUs ?

Thanks,

Julien



More information about the lttng-dev mailing list