[lttng-dev] snapshot issue with 2.5

Wed Nov 12 18:15:10 EST 2014

On Wed, Nov 12, 2014 at 6:11 PM, Julien Desfossez
<jdesfossez at efficios.com> wrote:
> Hi,
>
> On 14-11-12 05:30 PM, Jérémie Galarneau wrote:
>> On Wed, Nov 12, 2014 at 7:39 AM, Anders Wallin <wallinux at gmail.com> wrote:
>>> Hi,
>>>
>>> using userspace snapshot mode in 2.5 seems to broken.
>>>
>>> I have small stupid application(tracetest) generating 1000 events
>>>
>>> Running the script (the below is simplified version of the script)
>>> ----------------------------------------------------------------------------------
>>> #!/bin/bash
>>> LTTNG='lttng -n'
>>> lttng-sessiond --no-kernel &
>>
>> You should use the "-d" flag to launch the sessiond as a daemon. Using
>> "&" might cause the "create" command to launch a second session daemon
>> since the first one would not have had the time to open create its
>> command sockets.
>>
>>>
>>> $LTTNG create test --snapshot
>>> $LTTNG enable-channel -u --subbuf-size 64k --num-subbuf 16 --overwrite ch1
>>> $LTTNG enable-event -u -c ch1 "*"
>>> $LTTNG start
>>> $LTTNG list ch1
>>>
>>> mkdir -p snapshot/
>>>
>>> for i in $(seq 1 $LOOPS); do
>>>     ./tracetest
>>>     $LTTNG snapshot record -m 1G
>>>
>>>     # print no of events and first and last event
>>>     last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
>>>     babeltrace $last > snapshot/$i.bt
>>>     cat snapshot/$i.bt | wc -l
>>>     cat snapshot/$i.bt | head -1
>>>     cat snapshot/$i.bt | tail -1
>>> done
>>>
>>> $LTTNG stop
>>> $LTTNG destroy $SESSION
>>> ----------------------------------------------------------------------------------
>>>
>>> The result of the run looks like this;
>>> ./test.run
>>> Error: consumer err socket second poll error
>>> Error: Health error occurred in thread_manage_consumer
>>> PERROR - 13:36:59.685167 [23772/23781]: bind inet: Address already in use
>>> (in lttcomm_bind_inet_sock() at inet.c:109)
>>> Warning: Another session daemon is using this JUL port. JUL support will be
>>> deactivated to prevent interfering with the tracing.
>>>
>>
>> Is there a reason why you run two session daemons? I'm guessing this
>> is related to my "& vs -d" comment above.
>>
>>> lttng (LTTng Trace Control) 2.5.2 - Fumisterie
>>> BabelTrace Trace Viewer and Converter 1.2.4
>>>         linux-vdso.so.1 =>  (0x00007fffba966000)
>>>         liblttng-ctl.so.0 => /usr/lib/x86_64-linux-gnu/liblttng-ctl.so.0
>>> (0x00007fbfe4b1d000)
>>>         librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbfe4915000)
>>>         libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2
>>> (0x00007fbfe45ae000)
>>>         libpopt.so.0 => /lib/x86_64-linux-gnu/libpopt.so.0
>>> (0x00007fbfe43a2000)
>>>         libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
>>> (0x00007fbfe4184000)
>>>         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbfe3dbd000)
>>>         /lib64/ld-linux-x86-64.so.2 (0x00007fbfe4fac000)
>>>         liburcu.so.2 => /usr/lib/x86_64-linux-gnu/liburcu.so.2
>>> (0x00007fbfe3bb6000)
>>>         libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbfe39b2000)
>>>         libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fbfe3798000)
>>>         liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5
>>> (0x00007fbfe3576000)
>>>         libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbfe3270000)
>>>
>>> Session test created.
>>> Default snapshot output set to:
>>> /home/awallin/lttng-traces/test-20141112-133700
>>> Snapshot mode set. Every channel enabled for that session will be set in
>>> overwrite mode and mmap output.
>>> UST channel ch1 enabled for session test
>>> UST event * created in channel ch1
>>> Tracing started for session test
>>> Tracing session test: [active snapshot]
>>>     Trace path:
>>>     Live timer interval (usec): 4294967295
>>>
>>> === Domain: UST global ===
>>>
>>> Channels:
>>> -------------
>>> - ch1: [enabled]
>>>
>>>     Attributes:
>>>       overwrite mode: 1
>>>       subbufers size: 4096
>>>       number of subbufers: 16
>>>       switch timer interval: 0
>>>       read timer interval: 0
>>>       trace file count: 0
>>>       trace file size (bytes): 0
>>>       output: mmap()
>>>
>>>     Events:
>>>       * (type: tracepoint) [enabled]
>>>
>>> Snapshot recorded successfully for session test
>>> 84
>>> [13:37:00.852615950] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFFC6BC7000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:00.853798949] (+0.000000805) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 157
>>> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>>> [13:37:00.894839116] (+0.000000299) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>>> [13:37:00.929447330] (+0.000000305) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>
> Something is wrong from here:
>>> 168
>>> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:00.972035546] (+0.000000597) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:01.014060940] (+0.000000341) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:01.052555877] (+0.000000872) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
> ... to here.
>
> The beginning timestamp is the same, the number of events is the same,
> but the end timestamp moves. We see also this pattern below.
> I would be really interested to see the whole snapshots when it happens.
> It seems like we are not be overwriting the beginning of the ring buffer
> in some conditions. Could you send us your tracetest program and the
> script so we can try to reproduce the problem ?
>
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:01.014035993] (+?.?????????) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 927 }
>>> [13:37:01.098620808] (+0.000001404) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:01.172829105] (+0.000000575) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>>> Snapshot recorded successfully for session test
>>> 168
>>> [13:37:01.172021557] (+?.?????????) arn-awallin-mint-l3
>>> ust_baddr_statedump:soinfo: { cpu_id = 7 }, { baddr = 0x7FFF653FE000, sopath
>>> = "[vdso]", size = 0, mtime = -1 }
>>> [13:37:01.220074165] (+0.000000553) arn-awallin-mint-l3 tracetest:first_tp:
>>> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> Waiting for data availability
>>> Tracing stopped for session test
>>> Session test destroyed
>>>
>>> the result from the run on older versions of lttng creates bigger and bigger
>>> event traces until reaching max size.
> We see this pattern here and the max size seems to be 168 events, was it
> similar in your previous tests ?
>
>>
>> Unless I'm mistaken, the behavior of the snapshot mode has changed
>> between 2.4 and 2.5.
>>
>> In the 2.4 series, consecutive snapshots could overlap; that is,
>> events could be present in multiple snapshots. As of 2.5, an event is
>> guaranteed to only appear in one snapshot as the events are considered
>> as having been "consumed" at that point.
>
> Hum, no the behavior is still the same, consecutive snapshots can overlap.
>

Ah, alright then. Thanks for the correction!

Jérémie

> Could you tell us on which architecture you are experiencing the problem
> and the number of CPUs ?
>
> Thanks,
>
> Julien

-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com