[lttng-dev] large liblttng-ust startup overhead (ust_lock)

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Tue Sep 5 18:20:10 UTC 2017


----- On Sep 5, 2017, at 11:10 AM, Milian Wolff milian.wolff at kdab.com wrote:

> On Tuesday, September 5, 2017 4:52:42 PM CEST Milian Wolff wrote:
>> On Tuesday, September 5, 2017 4:34:54 PM CEST Milian Wolff wrote:
>> > Hey all,
>> > 
>> > I have noticed a very large overhead when linking against liblttng-ust:
>> > 
>> > ~~~~~
>> > ┌milian at milian-kdab2:/tmp
>> > └$ cat lttng-test.c
>> > int main()
>> > {
>> > 
>> >   return 0;
>> > 
>> > }
>> > ┌milian at milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl lttng-test.c
>> > ┌milian at milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> > 
>> >  Performance counter stats for './a.out' (5 runs):
>> >           0.209587      task-clock (msec)         #    0.596 CPUs utilized
>> > 
>> > ( +-  8.76% )
>> > 
>> >                  0      context-switches          #    0.000 K/sec
>> >                  0      cpu-migrations            #    0.000 K/sec
>> >                 
>> >                 49      page-faults               #    0.235 M/sec
>> > 
>> > ( +-  1.19% )
>> > 
>> >            706,854      cycles                    #    3.373 GHz
>> > 
>> > ( +-  8.82% )
>> > 
>> >            773,603      instructions              #    1.09  insn per
>> >            cycle
>> > 
>> > ( +-  0.75% )
>> > 
>> >            147,128      branches                  #  701.987 M/sec
>> > 
>> > ( +-  0.70% )
>> > 
>> >              4,096      branch-misses             #    2.78% of all
>> >              branches
>> > 
>> > ( +-  5.27% )
>> > 
>> >        0.000351422 seconds time elapsed
>> > 
>> > ( +- 11.85% )
>> > 
>> > ┌milian at milian-kdab2:/tmp
>> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> > ┌milian at milian-kdab2:/tmp
>> > └$ perf stat -r 5 ./a.out
>> > 
>> >  Performance counter stats for './a.out' (5 runs):
>> >           2.063040      task-clock (msec)         #    0.009 CPUs utilized
>> > 
>> > ( +-  1.37% )
>> > 
>> >                 44      context-switches          #    0.021 M/sec
>> > 
>> > ( +-  1.95% )
>> > 
>> >                  2      cpu-migrations            #    0.776 K/sec
>> > 
>> > ( +- 25.00% )
>> > 
>> >                209      page-faults               #    0.101 M/sec
>> > 
>> > ( +-  0.34% )
>> > 
>> >          7,053,686      cycles                    #    3.419 GHz
>> > 
>> > ( +-  2.03% )
>> > 
>> >          6,893,783      instructions              #    0.98  insn per
>> >          cycle
>> > 
>> > ( +-  0.25% )
>> > 
>> >          1,342,492      branches                  #  650.735 M/sec
>> > 
>> > ( +-  0.20% )
>> > 
>> >             29,390      branch-misses             #    2.19% of all
>> >             branches
>> > 
>> > ( +-  0.61% )
>> > 
>> >        0.225597302 seconds time elapsed
>> > 
>> > ( +-  6.68% )
>> > ~~~~~
>> > 
>> > This is without any LTTng session configured. If I enable LTTng kernel and
>> > userspace events, this becomes even worse:
>> > 
>> > ~~~~~
>> > ┌milian at milian-kdab2:/tmp
>> > └$ cat $(which run_lttng_trace.sh)
>> > #!/bin/sh
>> > 
>> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> > 
>> >     sudo lttng-sessiond --daemonize
>> > 
>> > fi
>> > 
>> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c kernel -k -a
>> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> > sudo lttng start
>> > $@
>> > sudo lttng stop
>> > 
>> > sudo chmod a+rx -R ~/lttng-traces
>> > ┌milian at milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> > Session auto-20170905-162818 created.
>> > Traces will be written in
>> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel kernel
>> > enabled for session auto-20170905-162818
>> > All Kernel events are enabled in channel kernel
>> > UST channel ust enabled for session auto-20170905-162818
>> > UST event lttng_ust_tracef:* created in channel ust
>> > Tracing started for session auto-20170905-162818
>> > 
>> >  Performance counter stats for './a.out' (5 runs):
>> >           4.937820      task-clock (msec)         #    0.005 CPUs utilized
>> > 
>> > ( +-  1.28% )
>> > 
>> >                117      context-switches          #    0.024 M/sec
>> > 
>> > ( +-  0.70% )
>> > 
>> >                  3      cpu-migrations            #    0.608 K/sec
>> > 
>> > ( +- 23.57% )
>> > 
>> >                245      page-faults               #    0.050 M/sec
>> > 
>> > ( +-  0.20% )
>> > 
>> >         16,541,355      cycles                    #    3.350 GHz
>> > 
>> > ( +-  0.94% )
>> > 
>> >         20,611,637      instructions              #    1.25  insn per
>> >         cycle
>> > 
>> > ( +-  0.23% )
>> > 
>> >          3,580,525      branches                  #  725.123 M/sec
>> > 
>> > ( +-  0.19% )
>> > 
>> >             52,093      branch-misses             #    1.45% of all
>> >             branches
>> > 
>> > ( +-  1.73% )
>> > 
>> >        0.965545882 seconds time elapsed
>> > 
>> > ( +- 16.96% )
>> > 
>> > Waiting for data availability...
>> > Tracing stopped for session auto-20170905-162818
>> > ~~~~~
>> > 
>> > I see this ~1s delay during startup in nearly all test applications where
>> > I
>> > wanted to use lttng-ust. In tracecompass, and according to perf sleep time
>> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> > locks
>> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> > taking a lot of time:
>> > 
>> > ~~~~~
>> > # without a running lttng session:
>> > ┌milian at milian-kdab2:/tmp
>> > └$ perf trace --duration 1 ./a.out
>> > 
>> >      6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg:
>> > 0x7fbe2fbb1070                  ) = 1
>> > 
>> >      5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> > 
>> >     59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >     59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> >     WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >    179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> > # with a running lttng session:
>> > ┌milian at milian-kdab2:/tmp
>> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> > 
>> >      6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg: 0x7fcd81d8a0b0                  ) = 612
>> > 
>> >    400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg: 0x7fcd81d8a070                  ) = 1
>> > 
>> >      6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> > 
>> >    610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: WAIT|
>> > 
>> > PRIV, val: 2                    ) = 0
>> > 
>> >    616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, op:
>> > WAIT|PRIV, val: 2                    ) = 0
>> > 
>> >    738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT| PRIV, val: 2                    ) = 0
>> > 
>> >    838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >    838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> > WAIT|PRIV, val: 2                    ) = 0
>> > 
>> >    958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >   1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> > 
>> > ) = 0
>> > 
>> >      6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> > 
>> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> > ~~~~~
>> > 
>> > This makes lttng-ust pretty much unusable for any short-lived application.
>> > Is this a known limitation?
>> > 
>> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. I've
>> > seen this problem also on Ubuntu 17.04 though I didn't do any of the above
>> > measurements there. So far, I suspected the ust statedump to take too much
>> > time, but I disabled that one for the measurements above.
>> > 
>> > Any help would be appreciated
>> 
>> Here's the perf sleep record data when no session is configured:
>> 
>> ~~~~~
>> 100% = 848.063336375 ms
>> 
>> # Overhead      Pid:Command    Shared Object     Symbol          Source:Line
>> # ........  .................  ................  ..............
>> ........... #
>>     76.50%    10336:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>>                futex_wait_queue_me futex.c:0
>>                futex_wait futex.c:0
>>                do_futex
>>                sys_futex
>>                do_syscall_64
>>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                syscall
>>                ust_listener_thread futex.h:65
>>                /usr/include/urcu/futex.h:65 (inline)
>>                /usr/include/urcu/futex.h:97 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1311 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1373 (inline)
>> 
>>     15.79%    10335:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>> 
>>                 --15.70%--futex_wait_queue_me futex.c:0
>>                           futex_wait futex.c:0
>>                           do_futex
>>                           sys_futex
>>                           do_syscall_64
>>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                           __lll_lock_wait
>>                           pthread_mutex_lock
>> 
>>                            --15.69%--0x2338
>>                                      0x3b29
>>                                      __nptl_deallocate_tsd.part.4
>> pthread_create.c:0
>>                                      start_thread pthread_create.c:0
>>                                      __clone
>> 
>>      7.72%    10334:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>>                futex_wait_queue_me futex.c:0
>>                futex_wait futex.c:0
>>                do_futex
>>                sys_futex
>>                do_syscall_64
>>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                do_futex_wait sem_timedwait.c:0
>>                __new_sem_wait_slow sem_timedwait.c:0
>>                lttng_ust_init.part.3 lttng-ust-comm.c:1757
>>                call_init.part.0 dl-init.c:0
>>                _dl_init
>>                _dl_start_user
>> 
>> ~~~~~
>> 
>> Here's the perf sleep record data for an existing session:
>> 
>> ~~~~~
>> 100% = 3.0793293265 s
>> 
>> # Overhead      Pid:Command    Shared Object     Symbol          Source:Line
>> # ........  .................  ................  ..............
>> ........... #
>>     52.10%    10762:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>>                futex_wait_queue_me futex.c:0
>>                futex_wait futex.c:0
>>                do_futex
>>                sys_futex
>>                do_syscall_64
>>                return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                syscall
>>                ust_listener_thread futex.h:65
>>                /usr/include/urcu/futex.h:65 (inline)
>>                /usr/include/urcu/futex.h:97 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1311 (inline)
>>                /home/milian/projects/src/lttng/lttng-ust/liblttng-ust/lttng-
>> ust-comm.c:1373 (inline)
>> 
>>     27.09%    10761:a.out-ust  [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>> 
>>                |--13.69%--schedule_timeout
>>                |
>>                |          unix_stream_read_generic af_unix.c:0
>>                |          unix_stream_recvmsg af_unix.c:0
>>                |          sock_recvmsg
>>                |          ___sys_recvmsg socket.c:0
>>                |          __sys_recvmsg
>>                |          sys_recvmsg
>>                |          do_syscall_64
>>                |          return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                |          __libc_recvmsg
>>                |          ustcomm_recv_unix_sock lttng-ust-comm.c:307
>> 
>>                 --13.40%--futex_wait_queue_me futex.c:0
>>                           futex_wait futex.c:0
>>                           do_futex
>>                           sys_futex
>>                           do_syscall_64
>>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>>                           __lll_lock_wait
>>                           pthread_mutex_lock
>> 
>>                           |--8.41%--0x2338
>>                           |
>>                           |          0x3b29
>>                           |          __nptl_deallocate_tsd.part.4
>> 
>> pthread_create.c:0
>> 
>>                           |          start_thread pthread_create.c:0
>>                           |          __clone
>> 
>>                            --4.99%--ust_lock_nocheck lttng-ust-comm.c:178
>> 
>>     20.81%    10760:a.out      [kernel.vmlinux]  [k] __schedule  core.c:0
>> 
>>             ---__schedule core.c:0
>>                schedule
>> 
>>                 --20.81%--futex_wait_queue_me futex.c:0
>>                           futex_wait futex.c:0
>>                           do_futex
>>                           sys_futex
>>                           do_syscall_64
>>                           return_from_SYSCALL_64 .tmp_entry_64.o:0
>> 
>>                            --20.77%--do_futex_wait sem_timedwait.c:0
>>                                      __new_sem_wait_slow sem_timedwait.c:0
>>                                      lttng_ust_init.part.3 lttng-ust-comm.c:
>> 1757
>>                                      call_init.part.0 dl-init.c:0
>>                                      _dl_init
>>                                      _dl_start_user
>> 
>> ~~~~~
>> 
>> Hope that helps
> 
> One thing that is apparent from the above, and can also be made visible by
> adding some visible output to the application before quitting it: It's not the
> main thread that's blocked initially. Rather, the main thread spins up the
> secondary -ust threads, then runs its code and finishes before the -ust
> threads manage to initialize. Before the main thread then quits, it waits for
> the ust threads to finish.

Those threads appear to be talking to the session daemon (on recvmsg). Please
note that there are two cases there:

- If you use per-pid buffers (explicitly at channel creation), the consumer
  daemon need to allocate buffers on behalf of each and every new process
  that appear while tracing, which takes a significant amount of time,
- If you use default buffers (per-uid), only the first process that registers
  for a specific UID will require that the consumer daemon allocates buffers.
  After that, the same buffers will be sent to other applications registering
  for the same user ID.

Clearly, per-pid buffers incur a significant slowdown at application start. This
is one reason why those are not used by default.

If you benchmark per-uid buffers, you will want to benchmark those two scenarios:

1) startup time of the first process that registers for a UID,

2) startup time of following process registrations

Technically, the buffer allocation delay only impacts the first process that
registers for a UID, and not all of them.

Thanks,

Mathieu



> 
> --
> Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


More information about the lttng-dev mailing list