[lttng-dev] large liblttng-ust startup overhead (ust_lock)

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Wed Sep 6 19:57:11 UTC 2017


----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff at kdab.com wrote:

> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff at kdab.com wrote:
>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff at kdab.com
> wrote:
>> >> > Hey all,
>> >> > 
>> >> > I have noticed a very large overhead when linking against liblttng-ust:
>> >> > 
>> >> > ~~~~~
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ cat lttng-test.c
>> >> > int main()
>> >> > {
>> >> > 
>> >> >  return 0;
>> >> > 
>> >> > }
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl lttng-test.c
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  8.76% )
>> >> > 
>> >> >                 0      context-switches          #    0.000 K/sec
>> >> >                 0      cpu-migrations            #    0.000 K/sec
>> >> >                
>> >> >                49      page-faults               #    0.235 M/sec
>> >> > 
>> >> > ( +-  1.19% )
>> >> > 
>> >> >           706,854      cycles                    #    3.373 GHz
>> >> > 
>> >> > ( +-  8.82% )
>> >> > 
>> >> >           773,603      instructions              #    1.09  insn per
>> >> >           cycle
>> >> > 
>> >> > ( +-  0.75% )
>> >> > 
>> >> >           147,128      branches                  #  701.987 M/sec
>> >> > 
>> >> > ( +-  0.70% )
>> >> > 
>> >> >             4,096      branch-misses             #    2.78% of all
>> >> >             branches
>> >> > 
>> >> > ( +-  5.27% )
>> >> > 
>> >> >       0.000351422 seconds time elapsed
>> >> > 
>> >> > ( +- 11.85% )
>> >> > 
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  1.37% )
>> >> > 
>> >> >                44      context-switches          #    0.021 M/sec
>> >> > 
>> >> > ( +-  1.95% )
>> >> > 
>> >> >                 2      cpu-migrations            #    0.776 K/sec
>> >> > 
>> >> > ( +- 25.00% )
>> >> > 
>> >> >               209      page-faults               #    0.101 M/sec
>> >> > 
>> >> > ( +-  0.34% )
>> >> > 
>> >> >         7,053,686      cycles                    #    3.419 GHz
>> >> > 
>> >> > ( +-  2.03% )
>> >> > 
>> >> >         6,893,783      instructions              #    0.98  insn per
>> >> >         cycle
>> >> > 
>> >> > ( +-  0.25% )
>> >> > 
>> >> >         1,342,492      branches                  #  650.735 M/sec
>> >> > 
>> >> > ( +-  0.20% )
>> >> > 
>> >> >            29,390      branch-misses             #    2.19% of all
>> >> >            branches
>> >> > 
>> >> > ( +-  0.61% )
>> >> > 
>> >> >       0.225597302 seconds time elapsed
>> >> > 
>> >> > ( +-  6.68% )
>> >> > ~~~~~
>> >> > 
>> >> > This is without any LTTng session configured. If I enable LTTng kernel
>> >> > and
>> >> > userspace events, this becomes even worse:
>> >> > 
>> >> > ~~~~~
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ cat $(which run_lttng_trace.sh)
>> >> > #!/bin/sh
>> >> > 
>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> >> > 
>> >> >    sudo lttng-sessiond --daemonize
>> >> > 
>> >> > fi
>> >> > 
>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c kernel -k -a
>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> >> > sudo lttng start
>> >> > $@
>> >> > sudo lttng stop
>> >> > 
>> >> > sudo chmod a+rx -R ~/lttng-traces
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> >> > Session auto-20170905-162818 created.
>> >> > Traces will be written in
>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
>> >> > kernel
>> >> > enabled for session auto-20170905-162818
>> >> > All Kernel events are enabled in channel kernel
>> >> > UST channel ust enabled for session auto-20170905-162818
>> >> > UST event lttng_ust_tracef:* created in channel ust
>> >> > Tracing started for session auto-20170905-162818
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  1.28% )
>> >> > 
>> >> >               117      context-switches          #    0.024 M/sec
>> >> > 
>> >> > ( +-  0.70% )
>> >> > 
>> >> >                 3      cpu-migrations            #    0.608 K/sec
>> >> > 
>> >> > ( +- 23.57% )
>> >> > 
>> >> >               245      page-faults               #    0.050 M/sec
>> >> > 
>> >> > ( +-  0.20% )
>> >> > 
>> >> >        16,541,355      cycles                    #    3.350 GHz
>> >> > 
>> >> > ( +-  0.94% )
>> >> > 
>> >> >        20,611,637      instructions              #    1.25  insn per
>> >> >        cycle
>> >> > 
>> >> > ( +-  0.23% )
>> >> > 
>> >> >         3,580,525      branches                  #  725.123 M/sec
>> >> > 
>> >> > ( +-  0.19% )
>> >> > 
>> >> >            52,093      branch-misses             #    1.45% of all
>> >> >            branches
>> >> > 
>> >> > ( +-  1.73% )
>> >> > 
>> >> >       0.965545882 seconds time elapsed
>> >> > 
>> >> > ( +- 16.96% )
>> >> > 
>> >> > Waiting for data availability...
>> >> > Tracing stopped for session auto-20170905-162818
>> >> > ~~~~~
>> >> > 
>> >> > I see this ~1s delay during startup in nearly all test applications
>> >> > where
>> >> > I
>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
>> >> > time
>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> >> > locks
>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> >> > taking a lot of time:
>> >> > 
>> >> > ~~~~~
>> >> > # without a running lttng session:
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ perf trace --duration 1 ./a.out
>> >> > 
>> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> msg:
>> >> > 0x7fbe2fbb1070                  ) = 1
>> >> > 
>> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> >> > 
>> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> >> >    WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> > # with a running lttng session:
>> >> > ┌milian at milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> >> > 
>> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
>> >> >     3<socket:[1185362]>,
>> > 
>> > msg:
>> >> > 0x7fcd81d8a0b0                  ) = 612
>> >> > 
>> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg:
>> >> > 0x7fcd81d8a070                  ) = 1
>> >> > 
>> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> >> > 
>> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
>> >> >   WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> op:
>> >> > WAIT|PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> >> >   WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> op:
>> >> > WAIT|PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> >> > 
>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> >> > ~~~~~
>> >> > 
>> >> > This makes lttng-ust pretty much unusable for any short-lived
>> >> > application.
>> >> > Is this a known limitation?
>> >> 
>> >> This kind of delay is not completely unexpected, but a bit higher than I
>> >> would expect. I'm interested in figuring out if most of this delay
>> >> happens
>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
>> >> lock).
>> > 
>> > See my other email which adds some harder numbers from a perf sleep
>> > record.
>> > 
>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
>> >> > I've
>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>> >> > above measurements there. So far, I suspected the ust statedump to take
>> >> > too much time, but I disabled that one for the measurements above.
>> >> > 
>> >> > Any help would be appreciated
>> >> 
>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>> >> share the performance results ?
>> > 
>> > I'd have to recompile the kernel, which is something I don't like to do.
>> > Why is this required, or put differently - why would the kernel option
>> > influence the userspace LTTng library?
>> 
>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
>> system call to speed up tracing, at the expense of slower execution of
>> tracing state update.
>> 
>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
>> (single-process) use cases which should execute much faster than the
>> non-expedited SHARED command currently implemented.
>> 
>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
>> it when it can.
> 
> May I ask whether you have such a configuration available? The test case I
> have submitted should be easy to reproduce for everyone. Compiling my own
> kernel version takes a huge amount of time which I'm quite frankly not willing
> to invest at this point.

We have reproduced the situation, and it's pointing towards our use of the
sys_membarrier system call. We'll try narrowing it down between its use by
lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).

You can try to patch liburcu so it won't try using sys_membarrier at all
(see attached patch)

> 
> I simply wonder whether anyone else is actually using lttng-ust, or if they
> are, whether they are only doing that for long-lived applications where this
> overhead is not noticeable?

It appears that this application startup slowdown mainly appears with
LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
This would explain why you are the first one to report this across our user
base.

Thanks,

Mathieu


> 
> Thanks
> 
> --
> Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: liburcu-no-sys-mb.patch
Type: text/x-patch
Size: 799 bytes
Desc: not available
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20170906/42733c0b/attachment.bin>


More information about the lttng-dev mailing list