[lttng-dev] snapshot issue with 2.5

Anders Wallin wallinux at gmail.com
Thu Nov 13 04:39:33 EST 2014


Anders Wallin

On Thu, Nov 13, 2014 at 8:43 AM, Anders Wallin <wallinux at gmail.com> wrote:

> On Thu, Nov 13, 2014 at 12:11 AM, Julien Desfossez <
> jdesfossez at efficios.com> wrote:
>
>> Hi,
>>
>> On 14-11-12 05:30 PM, Jérémie Galarneau wrote:
>> > On Wed, Nov 12, 2014 at 7:39 AM, Anders Wallin <wallinux at gmail.com>
>> wrote:
>> >> Hi,
>> >>
>> >> using userspace snapshot mode in 2.5 seems to broken.
>> >>
>> >> I have small stupid application(tracetest) generating 1000 events
>> >>
>> >> Running the script (the below is simplified version of the script)
>> >>
>> ----------------------------------------------------------------------------------
>> >> #!/bin/bash
>> >> LTTNG='lttng -n'
>> >> lttng-sessiond --no-kernel &
>> >
>> > You should use the "-d" flag to launch the sessiond as a daemon. Using
>> > "&" might cause the "create" command to launch a second session daemon
>> > since the first one would not have had the time to open create its
>> > command sockets.
>> >
>> >>
>> >> $LTTNG create test --snapshot
>> >> $LTTNG enable-channel -u --subbuf-size 64k --num-subbuf 16 --overwrite
>> ch1
>> >> $LTTNG enable-event -u -c ch1 "*"
>> >> $LTTNG start
>> >> $LTTNG list ch1
>> >>
>> >> mkdir -p snapshot/
>> >>
>> >> for i in $(seq 1 $LOOPS); do
>> >>     ./tracetest
>> >>     $LTTNG snapshot record -m 1G
>> >>
>> >>     # print no of events and first and last event
>> >>     last=$(ls -1drt $HOME/lttng-traces/$SESSION*/* | tail -1)
>> >>     babeltrace $last > snapshot/$i.bt
>> >>     cat snapshot/$i.bt | wc -l
>> >>     cat snapshot/$i.bt | head -1
>> >>     cat snapshot/$i.bt | tail -1
>> >> done
>> >>
>> >> $LTTNG stop
>> >> $LTTNG destroy $SESSION
>> >>
>> ----------------------------------------------------------------------------------
>> >>
>> >> The result of the run looks like this;
>> >> ./test.run
>> >> Error: consumer err socket second poll error
>> >> Error: Health error occurred in thread_manage_consumer
>> >> PERROR - 13:36:59.685167 [23772/23781]: bind inet: Address already in
>> use
>> >> (in lttcomm_bind_inet_sock() at inet.c:109)
>> >> Warning: Another session daemon is using this JUL port. JUL support
>> will be
>> >> deactivated to prevent interfering with the tracing.
>> >>
>> >
>> > Is there a reason why you run two session daemons? I'm guessing this
>> > is related to my "& vs -d" comment above.
>> >
>> >> lttng (LTTng Trace Control) 2.5.2 - Fumisterie
>> >> BabelTrace Trace Viewer and Converter 1.2.4
>> >>         linux-vdso.so.1 =>  (0x00007fffba966000)
>> >>         liblttng-ctl.so.0 =>
>> /usr/lib/x86_64-linux-gnu/liblttng-ctl.so.0
>> >> (0x00007fbfe4b1d000)
>> >>         librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1
>> (0x00007fbfe4915000)
>> >>         libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2
>> >> (0x00007fbfe45ae000)
>> >>         libpopt.so.0 => /lib/x86_64-linux-gnu/libpopt.so.0
>> >> (0x00007fbfe43a2000)
>> >>         libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
>> >> (0x00007fbfe4184000)
>> >>         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
>> (0x00007fbfe3dbd000)
>> >>         /lib64/ld-linux-x86-64.so.2 (0x00007fbfe4fac000)
>> >>         liburcu.so.2 => /usr/lib/x86_64-linux-gnu/liburcu.so.2
>> >> (0x00007fbfe3bb6000)
>> >>         libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2
>> (0x00007fbfe39b2000)
>> >>         libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1
>> (0x00007fbfe3798000)
>> >>         liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5
>> >> (0x00007fbfe3576000)
>> >>         libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6
>> (0x00007fbfe3270000)
>> >>
>> >> Session test created.
>> >> Default snapshot output set to:
>> >> /home/awallin/lttng-traces/test-20141112-133700
>> >> Snapshot mode set. Every channel enabled for that session will be set
>> in
>> >> overwrite mode and mmap output.
>> >> UST channel ch1 enabled for session test
>> >> UST event * created in channel ch1
>> >> Tracing started for session test
>> >> Tracing session test: [active snapshot]
>> >>     Trace path:
>> >>     Live timer interval (usec): 4294967295
>> >>
>> >> === Domain: UST global ===
>> >>
>> >> Channels:
>> >> -------------
>> >> - ch1: [enabled]
>> >>
>> >>     Attributes:
>> >>       overwrite mode: 1
>> >>       subbufers size: 4096
>> >>       number of subbufers: 16
>> >>       switch timer interval: 0
>> >>       read timer interval: 0
>> >>       trace file count: 0
>> >>       trace file size (bytes): 0
>> >>       output: mmap()
>> >>
>> >>     Events:
>> >>       * (type: tracepoint) [enabled]
>> >>
>> >> Snapshot recorded successfully for session test
>> >> 84
>> >> [13:37:00.852615950] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFFC6BC7000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:00.853798949] (+0.000000805) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 157
>> >> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>> >> [13:37:00.894839116] (+0.000000299) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:00.853735195] (+?.?????????) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 927 }
>> >> [13:37:00.929447330] (+0.000000305) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>>
>> Something is wrong from here:
>> >> 168
>> >> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:00.972035546] (+0.000000597) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:01.014060940] (+0.000000341) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:00.894411117] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 4 }, { baddr = 0x7FFF6BF81000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:01.052555877] (+0.000000872) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> ... to here.
>>
>> The beginning timestamp is the same, the number of events is the same,
>> but the end timestamp moves. We see also this pattern below.
>> I would be really interested to see the whole snapshots when it happens.
>> It seems like we are not be overwriting the beginning of the ring buffer
>> in some conditions. Could you send us your tracetest program and the
>> script so we can try to reproduce the problem ?
>>
>
> I will append it including a run with debug turned on and all the snapshot
> files created
>
>
>>
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:01.014035993] (+?.?????????) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 927 }
>> >> [13:37:01.098620808] (+0.000001404) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:01.129839830] (+0.000000533) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:01.097087388] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 6 }, { baddr = 0x7FFF4BFFE000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:01.172829105] (+0.000000575) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Snapshot recorded successfully for session test
>> >> 168
>> >> [13:37:01.172021557] (+?.?????????) arn-awallin-mint-l3
>> >> ust_baddr_statedump:soinfo: { cpu_id = 7 }, { baddr = 0x7FFF653FE000,
>> sopath
>> >> = "[vdso]", size = 0, mtime = -1 }
>> >> [13:37:01.220074165] (+0.000000553) arn-awallin-mint-l3
>> tracetest:first_tp:
>> >> { cpu_id = 6 }, { my_string_field = "test", my_integer_field = 999 }
>> >> Waiting for data availability
>> >> Tracing stopped for session test
>> >> Session test destroyed
>> >>
>> >> the result from the run on older versions of lttng creates bigger and
>> bigger
>> >> event traces until reaching max size.
>> We see this pattern here and the max size seems to be 168 events, was it
>> similar in your previous tests ?
>>
> If I'm running the same test on 2.4 or 2.3 the max number of events is
> much bigger, like ~10000
> (I don't have the exact number right now, I will rebuild and run the test
> again and get back
> with traces from that run)
>
For 2.4.2 the log looks like this:
./test.run ./tracetest

lttng (LTTng Trace Control) 2.4.2 - Époque Opaque
BabelTrace Trace Viewer and Converter 1.2.1
linux-vdso.so.1 =>  (0x00007fff775fe000)
liblttng-ctl.so.0 => /usr/local/lib/liblttng-ctl.so.0 (0x00002b2b016e4000)
libpopt.so.0 => /lib/x86_64-linux-gnu/libpopt.so.0 (0x00002b2b0190f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
(0x00002b2b01b1b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002b2b01d39000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00002b2b020ff000)
liburcu.so.3 => /usr/local/lib/liburcu.so.3 (0x00002b2b02307000)
/lib64/ld-linux-x86-64.so.2 (0x00002b2b014bf000)

Session test created.
Default snapshot output set to: //lttng-traces/test-20141113-093611
Snapshot mode set. Every channel enabled for that session will be set in
overwrite mode and mmap output.
UST channel ch1 enabled for session test
UST event * created in channel ch1
Tracing started for session test
Tracing session test: [active snapshot]
    Trace path:

=== Domain: UST global ===

Buffer type: per UID

Channels:
-------------
- ch1: [enabled]

    Attributes:
      overwrite mode: 1
      subbufers size: 4096
      number of subbufers: 16
      switch timer interval: 0
      read timer interval: 0
      output: mmap()

    Events:
      * (type: tracepoint) [enabled]

Snapshot recorded successfully for session test
1000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:11.985857254] (+0.000000501) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
2000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.033066130] (+0.000001663) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
3000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.086888121] (+0.000001273) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
4000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.147378251] (+0.000001257) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
5000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.225584696] (+0.000000528) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
6000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.288311172] (+0.000000571) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
7000
[09:36:11.985319272] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.375789307] (+0.000001238) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
7000
[09:36:12.032314084] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.460756779] (+0.000001248) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 4 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
8000
[09:36:12.032314084] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.540464090] (+0.000001292) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 7 }, { my_string_field = "test", my_integer_field = 999 }
Snapshot recorded successfully for session test
9000
[09:36:12.032314084] (+?.?????????) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 5 }, { my_string_field = "test", my_integer_field = 0 }
[09:36:12.621992931] (+0.000001298) 07f8cf84a38f tracetest:first_tp: {
cpu_id = 5 }, { my_string_field = "test", my_integer_field = 999 }
Waiting for data availability
Tracing stopped for session test
Session test destroyed


>
>> >
>> > Unless I'm mistaken, the behavior of the snapshot mode has changed
>> > between 2.4 and 2.5.
>> >
>> > In the 2.4 series, consecutive snapshots could overlap; that is,
>> > events could be present in multiple snapshots. As of 2.5, an event is
>> > guaranteed to only appear in one snapshot as the events are considered
>> > as having been "consumed" at that point.
>>
>> Hum, no the behavior is still the same, consecutive snapshots can overlap.
>>
>> Could you tell us on which architecture you are experiencing the problem
>> and the number of CPUs ?
>>
> The traces appended is from my laptop running Ubuntu 14.04 with "8" cpu's
> model name      : Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
> I have also run this on ARM with 16 cores and a powerpc
>
>>
>> Thanks,
>>
>> Julien
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20141113/c9e6b40d/attachment-0001.html>


More information about the lttng-dev mailing list