[lttng-dev] large liblttng-ust startup overhead (ust_lock)
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Wed Sep 6 21:22:36 UTC 2017
BTW, your expedited commit hit mainline earlier this week. Here is
hoping! ;-)
Thanx, Paul
On Wed, Sep 06, 2017 at 08:23:40PM +0000, Mathieu Desnoyers wrote:
> ----- On Sep 6, 2017, at 3:57 PM, Mathieu Desnoyers mathieu.desnoyers at efficios.com wrote:
>
> > ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wolff at kdab.com wrote:
> >
> >> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wolff at kdab.com wrote:
> >>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wolff at kdab.com
> >> wrote:
> >>> >> > Hey all,
> >>> >> >
> >>> >> > I have noticed a very large overhead when linking against liblttng-ust:
> >>> >> >
> >>> >> > ~~~~~
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ cat lttng-test.c
> >>> >> > int main()
> >>> >> > {
> >>> >> >
> >>> >> > return 0;
> >>> >> >
> >>> >> > }
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> >
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> > 0.209587 task-clock (msec) # 0.596 CPUs
> >>> >> > utilized
> >>> >> >
> >>> >> > ( +- 8.76% )
> >>> >> >
> >>> >> > 0 context-switches # 0.000 K/sec
> >>> >> > 0 cpu-migrations # 0.000 K/sec
> >>> >> >
> >>> >> > 49 page-faults # 0.235 M/sec
> >>> >> >
> >>> >> > ( +- 1.19% )
> >>> >> >
> >>> >> > 706,854 cycles # 3.373 GHz
> >>> >> >
> >>> >> > ( +- 8.82% )
> >>> >> >
> >>> >> > 773,603 instructions # 1.09 insn per
> >>> >> > cycle
> >>> >> >
> >>> >> > ( +- 0.75% )
> >>> >> >
> >>> >> > 147,128 branches # 701.987 M/sec
> >>> >> >
> >>> >> > ( +- 0.70% )
> >>> >> >
> >>> >> > 4,096 branch-misses # 2.78% of all
> >>> >> > branches
> >>> >> >
> >>> >> > ( +- 5.27% )
> >>> >> >
> >>> >> > 0.000351422 seconds time elapsed
> >>> >> >
> >>> >> > ( +- 11.85% )
> >>> >> >
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> >
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> > 2.063040 task-clock (msec) # 0.009 CPUs
> >>> >> > utilized
> >>> >> >
> >>> >> > ( +- 1.37% )
> >>> >> >
> >>> >> > 44 context-switches # 0.021 M/sec
> >>> >> >
> >>> >> > ( +- 1.95% )
> >>> >> >
> >>> >> > 2 cpu-migrations # 0.776 K/sec
> >>> >> >
> >>> >> > ( +- 25.00% )
> >>> >> >
> >>> >> > 209 page-faults # 0.101 M/sec
> >>> >> >
> >>> >> > ( +- 0.34% )
> >>> >> >
> >>> >> > 7,053,686 cycles # 3.419 GHz
> >>> >> >
> >>> >> > ( +- 2.03% )
> >>> >> >
> >>> >> > 6,893,783 instructions # 0.98 insn per
> >>> >> > cycle
> >>> >> >
> >>> >> > ( +- 0.25% )
> >>> >> >
> >>> >> > 1,342,492 branches # 650.735 M/sec
> >>> >> >
> >>> >> > ( +- 0.20% )
> >>> >> >
> >>> >> > 29,390 branch-misses # 2.19% of all
> >>> >> > branches
> >>> >> >
> >>> >> > ( +- 0.61% )
> >>> >> >
> >>> >> > 0.225597302 seconds time elapsed
> >>> >> >
> >>> >> > ( +- 6.68% )
> >>> >> > ~~~~~
> >>> >> >
> >>> >> > This is without any LTTng session configured. If I enable LTTng kernel
> >>> >> > and
> >>> >> > userspace events, this becomes even worse:
> >>> >> >
> >>> >> > ~~~~~
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ cat $(which run_lttng_trace.sh)
> >>> >> > #!/bin/sh
> >>> >> >
> >>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >>> >> >
> >>> >> > sudo lttng-sessiond --daemonize
> >>> >> >
> >>> >> > fi
> >>> >> >
> >>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c kernel -k -a
> >>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >>> >> > sudo lttng start
> >>> >> > $@
> >>> >> > sudo lttng stop
> >>> >> >
> >>> >> > sudo chmod a+rx -R ~/lttng-traces
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >>> >> > Session auto-20170905-162818 created.
> >>> >> > Traces will be written in
> >>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >>> >> > kernel
> >>> >> > enabled for session auto-20170905-162818
> >>> >> > All Kernel events are enabled in channel kernel
> >>> >> > UST channel ust enabled for session auto-20170905-162818
> >>> >> > UST event lttng_ust_tracef:* created in channel ust
> >>> >> > Tracing started for session auto-20170905-162818
> >>> >> >
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> > 4.937820 task-clock (msec) # 0.005 CPUs
> >>> >> > utilized
> >>> >> >
> >>> >> > ( +- 1.28% )
> >>> >> >
> >>> >> > 117 context-switches # 0.024 M/sec
> >>> >> >
> >>> >> > ( +- 0.70% )
> >>> >> >
> >>> >> > 3 cpu-migrations # 0.608 K/sec
> >>> >> >
> >>> >> > ( +- 23.57% )
> >>> >> >
> >>> >> > 245 page-faults # 0.050 M/sec
> >>> >> >
> >>> >> > ( +- 0.20% )
> >>> >> >
> >>> >> > 16,541,355 cycles # 3.350 GHz
> >>> >> >
> >>> >> > ( +- 0.94% )
> >>> >> >
> >>> >> > 20,611,637 instructions # 1.25 insn per
> >>> >> > cycle
> >>> >> >
> >>> >> > ( +- 0.23% )
> >>> >> >
> >>> >> > 3,580,525 branches # 725.123 M/sec
> >>> >> >
> >>> >> > ( +- 0.19% )
> >>> >> >
> >>> >> > 52,093 branch-misses # 1.45% of all
> >>> >> > branches
> >>> >> >
> >>> >> > ( +- 1.73% )
> >>> >> >
> >>> >> > 0.965545882 seconds time elapsed
> >>> >> >
> >>> >> > ( +- 16.96% )
> >>> >> >
> >>> >> > Waiting for data availability...
> >>> >> > Tracing stopped for session auto-20170905-162818
> >>> >> > ~~~~~
> >>> >> >
> >>> >> > I see this ~1s delay during startup in nearly all test applications
> >>> >> > where
> >>> >> > I
> >>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
> >>> >> > time
> >>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
> >>> >> > locks
> >>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
> >>> >> > taking a lot of time:
> >>> >> >
> >>> >> > ~~~~~
> >>> >> > # without a running lttng session:
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ perf trace --duration 1 ./a.out
> >>> >> >
> >>> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> >> msg:
> >>> >> > 0x7fbe2fbb1070 ) = 1
> >>> >> >
> >>> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >>> >> >
> >>> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >>> >> > WAIT|
> >>> >> >
> >>> >> > PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > # with a running lttng session:
> >>> >> > ┌milian at milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >>> >> >
> >>> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >>> >> > 3<socket:[1185362]>,
> >>> >
> >>> > msg:
> >>> >> > 0x7fcd81d8a0b0 ) = 612
> >>> >> >
> >>> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
> >>> >
> >>> > msg:
> >>> >> > 0x7fcd81d8a070 ) = 1
> >>> >> >
> >>> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >>> >> >
> >>> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >>> >> > WAIT|
> >>> >> >
> >>> >> > PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> >> op:
> >>> >> > WAIT|PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
> >>> >> > WAIT|
> >>> >> >
> >>> >> > PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >> op:
> >>> >> > WAIT|PRIV, val: 2 ) = 0
> >>> >> >
> >>> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >>> >> >
> >>> >> > ) = 0
> >>> >> >
> >>> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >>> >> >
> >>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
> >>> >> > ~~~~~
> >>> >> >
> >>> >> > This makes lttng-ust pretty much unusable for any short-lived
> >>> >> > application.
> >>> >> > Is this a known limitation?
> >>> >>
> >>> >> This kind of delay is not completely unexpected, but a bit higher than I
> >>> >> would expect. I'm interested in figuring out if most of this delay
> >>> >> happens
> >>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
> >>> >> lock).
> >>> >
> >>> > See my other email which adds some harder numbers from a perf sleep
> >>> > record.
> >>> >
> >>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >>> >> > I've
> >>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >>> >> > above measurements there. So far, I suspected the ust statedump to take
> >>> >> > too much time, but I disabled that one for the measurements above.
> >>> >> >
> >>> >> > Any help would be appreciated
> >>> >>
> >>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >>> >> share the performance results ?
> >>> >
> >>> > I'd have to recompile the kernel, which is something I don't like to do.
> >>> > Why is this required, or put differently - why would the kernel option
> >>> > influence the userspace LTTng library?
> >>>
> >>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> >>> system call to speed up tracing, at the expense of slower execution of
> >>> tracing state update.
> >>>
> >>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
> >>> (single-process) use cases which should execute much faster than the
> >>> non-expedited SHARED command currently implemented.
> >>>
> >>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> >>> it when it can.
> >>
> >> May I ask whether you have such a configuration available? The test case I
> >> have submitted should be easy to reproduce for everyone. Compiling my own
> >> kernel version takes a huge amount of time which I'm quite frankly not willing
> >> to invest at this point.
> >
> > We have reproduced the situation, and it's pointing towards our use of the
> > sys_membarrier system call. We'll try narrowing it down between its use by
> > lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
> >
> > You can try to patch liburcu so it won't try using sys_membarrier at all
> > (see attached patch)
> >
> >>
> >> I simply wonder whether anyone else is actually using lttng-ust, or if they
> >> are, whether they are only doing that for long-lived applications where this
> >> overhead is not noticeable?
> >
> > It appears that this application startup slowdown mainly appears with
> > LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
> > This would explain why you are the first one to report this across our user
> > base.
>
> We have confirmed that it's the use of sys_membarrier SHARED by liburcu-bp,
> used by lttng-ust, that causes those delays.
>
> lttng-ust usage pattern is to call synchronize_rcu() directly. Having
> membarrier system calls each time adds up to significant delays, because
> membarrier SHARED blocks until a scheduler switch happens on each core.
> We are invoking synchronize_rcu() directly rather than using the call_rcu
> batching facilities because we want to minimize the number of threads we
> have to insert into the application.
>
> I'm currently proposing a PRIVATE_EXPEDITED command for the membarrier
> system call in Linux. It will actively send IPIs rather than blocking.
>
> I'm going to push a "Fix" in liburcu master, 0.10, 0.9 that will make it
> stop using sys_membarrier in liburcu-bp. Then, after the new PRIVATE_EXPEDITED
> command makes it way into the upstream Linux kernel, a future liburcu
> (0.11 or +) will introduce usage of that new command, given that it has
> a much, much smaller impact compared to the SHARED command.
>
> The reason why we want sys_membarrier there is to speed up tracing. When
> can trade memory barriers for simple compiler barriers on the tracing
> fast-path by using sys_membarrier on the update side.
>
> Thanks for reporting your finding!
>
> Mathieu
>
>
> >
> > Thanks,
> >
> > Mathieu
> >
> >
> >>
> >> Thanks
> >>
> >> --
> >> Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
> >> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> >> Tel: +49-30-521325470
> >> KDAB - The Qt Experts
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.efficios.com&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=ux41CW3B5BSVxDMRNRWyLbUmPebZc70Kq4AkfdiRGMI&m=3Vy8SVv1V7W52p2avk7362Nu7zY7IrKURNrPvKWJydk&s=gJ_xD9XLVeE2RJHq9MmoWZRJIk5IUabWGcG0VdiM7U4&e=
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.efficios.com&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=ux41CW3B5BSVxDMRNRWyLbUmPebZc70Kq4AkfdiRGMI&m=3Vy8SVv1V7W52p2avk7362Nu7zY7IrKURNrPvKWJydk&s=gJ_xD9XLVeE2RJHq9MmoWZRJIk5IUabWGcG0VdiM7U4&e=
>
More information about the lttng-dev
mailing list