[lttng-dev] snapshot issue with 2.5

Anders Wallin wallinux at gmail.com
Thu Nov 13 10:33:41 EST 2014


bug is introduced in  68808f4e0bbb3adccff72ff9dab6ec9f3a9e6866

Regards

Anders

Anders Wallin

On Thu, Nov 13, 2014 at 10:39 AM, Anders Wallin <wallinux at gmail.com> wrote:

>
>
> Anders Wallin
>
> On Thu, Nov 13, 2014 at 8:43 AM, Anders Wallin <wallinux at gmail.com> wrote:
>
>> On Thu, Nov 13, 2014 at 12:11 AM, Julien Desfossez <
>> jdesfossez at efficios.com> wrote:
>>
>>> Hi,
>>>
>>> On 14-11-12 05:30 PM, Jérémie Galarneau wrote:
>>> > On Wed, Nov 12, 2014 at 7:39 AM, Anders Wallin <wallinux at gmail.com>
>>> wrote:
>>> >> Hi,
>>> >>
>>> >> using userspace snapshot mode in 2.5 seems to broken.
>>> >>
>>> >> I have small stupid application(tracetest) generating 1000 events
>>> >>
>>> >> Running the script (the below is simplified version of the script)
>>> >>
>>> ----------------------------------------------------------------------------------
>>> >> #!/bin/bash
>>> >> LTTNG='lttng -n'
>>> >> lttng-sessiond --no-kernel &
>>> >
>>> > You should use the "-d" flag to launch the sessiond as a daemon. Using
>>> > "&" might cause the "create" command to launch a second session daemon
>>> > since the first one would not have had the time to open create its
>>> > command sockets.
>>> >
>>> >>
>>> >> $LTTNG create test --snapshot
>>> >> $LTTNG enable-channel -u --subbuf-size 64k --num-subbuf 16
>>> --overwrite ch1
>>> >> $LTTNG enable-event -u -c ch1 "*"
>>> >> $LTTNG start
>>> >> $LTTNG list ch1
>>> >>
>>> >> mkdir -p snapshot/
>>> >>
>>> >> for i in $(seq 1 $LOOPS); do
>>> >>     ./tracetest
>>> >>     $LTTNG snapshot record -m 1G
>>> >>
>>> >>     # print no of events and first and last event
>>> >>     last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
>>> >>     babeltrace $last > snapshot/$i.bt
>>> >>     cat snapshot/$i.bt | wc -l
>>> >>     cat snapshot/$i.bt | head -1
>>> >>     cat snapshot/$i.bt | tail -1
>>> >> done
>>> >>
>>> >> $LTTNG stop
>>> >> $LTTNG destroy $SESSION
>>> >>
>>> ----------------------------------------------------------------------------------
>>> >>
>>> >> The result of the run looks like this;
>>> >> ./test.run
>>> >> Error: consumer err socket second poll error
>>> >> Error: Health error occurred in thread_manage_consumer
>>> >> PERROR - 13:36:59.685167 [23772/23781]: bind inet: Address already in
>>> use
>>> >> (in lttcomm_bind_inet_sock() at inet.c:109)
>>> >> Warning: Another session daemon is using this JUL port. JUL support
>>> will be
>>> >> deactivated to prevent interfering with the tracing.
>>> >>
>>> >
>>> > Is there a reason why you run two session daemons? I'm guessing this
>>> > is related to my "& vs -d" comment above.
>>> >
>>> >> lttng (LTTng Trace Control) 2.5.2 - Fumisterie
>>> >> BabelTrace Trace Viewer and Converter 1.2.4
>>> >>         linux-vdso.so.1 =>  (0x00007fffba966000)
>>> >>         liblttng-ctl.so.0 =>
>>> /usr/lib/x86_64-linux-gnu/liblttng-ctl.so.0
>>> >> (0x00007fbfe4b1d000)
>>> >>         librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1
>>> (0x00007fbfe4915000)
>>> >>         libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2
>>> >> (0x00007fbfe45ae000)
>>> >>         libpopt.so.0 => /lib/x86_64-linux-gnu/libpopt.so.0
>>> >> (0x00007fbfe43a2000)
>>> >>         libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
>>> >> (0x00007fbfe4184000)
>>> >>         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
>>> (0x00007fbfe3dbd000)
>>> >>         /lib64/ld-linux-x86-64.so.2 (0x00007fbfe4fac000)
>>> >>         liburcu.so.2 => /usr/lib/x86_64-linux-gnu/liburcu.so.2
>>> >> (0x00007fbfe3bb6000)
>>> >>         libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2
>>> (0x00007fbfe39b2000)
>>> >>         libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1
>>> (0x00007fbfe3798000)
>>> >>         liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5
>>> >> (0x00007fbfe3576000)
>>> >>         libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6
>>> (0x00007fbfe3270000)
>>> >>
>>> >> Session test created.
>>> >> Default snapshot output set to:
>>> >> /home/awallin/lttng-traces/test-20141112-133700
>>> >> Snapshot mode set. Every channel enabled for that session will be set
>>> in
>>> >> overwrite mode and mmap output.
>>> >> UST channel ch1 enabled for session test
>>> >> UST event * created in channel ch1
>>> >> Tracing started for session test
>>> >> Tracing session test: [active snapshot]
>>> >>     Trace path:
>>> >>     Live timer interval (usec): 4294967295
>>> >>
>>> >> === Domain: UST global ===
>>> >>
>>> >> Channels:
>>> >> -------------
>>> >> - ch1: [enabled]
>>> >>
>>> >>     Attributes:
>>> >>       overwrite mode: 1
>>> >>       subbufers size: 4096
>>> >>       number of subbufers: 16
>>> >>       switch timer interval: 0
>>> >>       read timer interval: 0
>>> >>       trace file count: 0
>>> >>       trace file size (bytes): 0
>>> >>       output: mmap()
>>> >>
>>> >>     Events:
>>> >>       * (type: tracepoint) [enabled]
>>> >>
>>> >> Snapshot recorded successfully for session test
>>> >> 84
>>> >> [13:37:00.852615950] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFFC6BC7000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:00.853798949] (+0.000000805) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 157
>>> >> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>>> >> [13:37:00.894839116] (+0.000000299) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>>> >> [13:37:00.929447330] (+0.000000305) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>>
>>> Something is wrong from here:
>>> >> 168
>>> >> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:00.972035546] (+0.000000597) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:01.014060940] (+0.000000341) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:01.052555877] (+0.000000872) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> ... to here.
>>>
>>> The beginning timestamp is the same, the number of events is the same,
>>> but the end timestamp moves. We see also this pattern below.
>>> I would be really interested to see the whole snapshots when it happens.
>>> It seems like we are not be overwriting the beginning of the ring buffer
>>> in some conditions. Could you send us your tracetest program and the
>>> script so we can try to reproduce the problem ?
>>>
>>
>> I will append it including a run with debug turned on and all the
>> snapshot files created
>>
>>
>>>
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:01.014035993] (+?.?????????) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 927 }
>>> >> [13:37:01.098620808] (+0.000001404) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:01.172829105] (+0.000000575) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Snapshot recorded successfully for session test
>>> >> 168
>>> >> [13:37:01.172021557] (+?.?????????) arn-awallin-mint-l3
>>> >> ust_baddr_statedump:soinfo: { cpu_id = 7 }, { baddr = 0x7FFF653FE000,
>>> sopath
>>> >> = "[vdso]", size = 0, mtime = -1 }
>>> >> [13:37:01.220074165] (+0.000000553) arn-awallin-mint-l3
>>> tracetest:first_tp:
>>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>>> >> Waiting for data availability
>>> >> Tracing stopped for session test
>>> >> Session test destroyed
>>> >>
>>> >> the result from the run on older versions of lttng creates bigger and
>>> bigger
>>> >> event traces until reaching max size.
>>> We see this pattern here and the max size seems to be 168 events, was it
>>> similar in your previous tests ?
>>>
>> If I'm running the same test on 2.4 or 2.3 the max number of events is
>> much bigger, like ~10000
>> (I don't have the exact number right now, I will rebuild and run the test
>> again and get back
>> with traces from that run)
>>
> For 2.4.2 the log looks like this:
> ./test.run ./tracetest
>
> lttng (LTTng Trace Control) 2.4.2 - Époque Opaque
> BabelTrace Trace Viewer and Converter 1.2.1
> linux-vdso.so.1 =>  (0x00007fff775fe000)
> liblttng-ctl.so.0 => /usr/local/lib/liblttng-ctl.so.0 (0x00002b2b016e4000)
> libpopt.so.0 => /lib/x86_64-linux-gnu/libpopt.so.0 (0x00002b2b0190f000)
> libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> (0x00002b2b01b1b000)
> libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002b2b01d39000)
> librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00002b2b020ff000)
> liburcu.so.3 => /usr/local/lib/liburcu.so.3 (0x00002b2b02307000)
> /lib64/ld-linux-x86-64.so.2 (0x00002b2b014bf000)
>
> Session test created.
> Default snapshot output set to: //lttng-traces/test-20141113-093611
> Snapshot mode set. Every channel enabled for that session will be set in
> overwrite mode and mmap output.
> UST channel ch1 enabled for session test
> UST event * created in channel ch1
> Tracing started for session test
> Tracing session test: [active snapshot]
>     Trace path:
>
> === Domain: UST global ===
>
> Buffer type: per UID
>
> Channels:
> -------------
> - ch1: [enabled]
>
>     Attributes:
>       overwrite mode: 1
>       subbufers size: 4096
>       number of subbufers: 16
>       switch timer interval: 0
>       read timer interval: 0
>       output: mmap()
>
>     Events:
>       * (type: tracepoint) [enabled]
>
> Snapshot recorded successfully for session test
> 1000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:11.985857254] (+0.000000501) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 2000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.033066130] (+0.000001663) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 3000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.086888121] (+0.000001273) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 4000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.147378251] (+0.000001257) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 5000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.225584696] (+0.000000528) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 6000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.288311172] (+0.000000571) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 7000
> [09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.375789307] (+0.000001238) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 7000
> [09:36:12.032314084] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.460756779] (+0.000001248) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 8000
> [09:36:12.032314084] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.540464090] (+0.000001292) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
> Snapshot recorded successfully for session test
> 9000
> [09:36:12.032314084] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
> [09:36:12.621992931] (+0.000001298) 07f8cf84a38f tracetest:first_tp: {
> cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
> Waiting for data availability
> Tracing stopped for session test
> Session test destroyed
>
>
>>
>>> >
>>> > Unless I'm mistaken, the behavior of the snapshot mode has changed
>>> > between 2.4 and 2.5.
>>> >
>>> > In the 2.4 series, consecutive snapshots could overlap; that is,
>>> > events could be present in multiple snapshots. As of 2.5, an event is
>>> > guaranteed to only appear in one snapshot as the events are considered
>>> > as having been "consumed" at that point.
>>>
>>> Hum, no the behavior is still the same, consecutive snapshots can
>>> overlap.
>>>
>>> Could you tell us on which architecture you are experiencing the problem
>>> and the number of CPUs ?
>>>
>> The traces appended is from my laptop running Ubuntu 14.04 with "8" cpu's
>> model name      : Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
>> I have also run this on ARM with 16 cores and a powerpc
>>
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20141113/16711972/attachment-0001.html>


More information about the lttng-dev mailing list