[lttng-dev] liblttng-ust overhead, take 2

Thu Sep 7 17:55:22 UTC 2017

----- On Sep 7, 2017, at 1:06 PM, Milian Wolff milian.wolff at kdab.com wrote:

> On Thursday, September 7, 2017 6:29:58 PM CEST Jérémie Galarneau wrote:
>> On 7 September 2017 at 04:47, Milian Wolff <milian.wolff at kdab.com> wrote:
>> > Hey all,
>> > 
>> > after my latest report over liblttng-ust overhead yielded such quick
>> > results, I would like to expand on this with a more realistic workload
>> > that I'm looking at.
>> > 
>> > We have two applications, one that sets up an SQLite database and then
>> > launches multiple client applications that try to read and write
>> > concurrently from that database. Of course, this will show file-level
>> > lock contention. I'd like to use LTTng to visualize the effect.
>> > 
>> > On my machine I get the following numbers.
>> > 
>> > 1) Without LTTng and without UST:
>> > ~~~~~
>> > 
>> >  Performance counter stats for './launcher/lab_latency_launcher':
>> >         495.075367      task-clock (msec)         #    0.150 CPUs utilized
>> >         
>> >              2,340      context-switches          #    0.005 M/sec
>> >              
>> >                291      cpu-migrations            #    0.588 K/sec
>> >             
>> >             52,265      page-faults               #    0.106 M/sec
>> >      
>> >      3,098,760,038      cycles                    #    6.259 GHz
>> >      1,894,074,743      instructions              #    0.61  insn per
>> >      cycle
>> >      
>> >        354,869,636      branches                  #  716.799 M/sec
>> >        
>> >         10,317,266      branch-misses             #    2.91% of all
>> >         branches
>> >        
>> >        3.298337574 seconds time elapsed
>> > 
>> > ~~~~~~
>> > 
>> > 2) With LTTng (all kernel events enabled) and without UST:
>> > ~~~~~
>> > 
>> >  Performance counter stats for 'run_lttng_trace.sh ./launcher/
>> > 
>> > lab_latency_launcher':
>> >         296.329694      task-clock (msec)         #    0.068 CPUs utilized
>> >         
>> >              1,808      context-switches          #    0.006 M/sec
>> >              
>> >                398      cpu-migrations            #    0.001 M/sec
>> >             
>> >             53,493      page-faults               #    0.181 M/sec
>> >      
>> >      2,411,871,165      cycles                    #    8.139 GHz
>> >      2,370,005,487      instructions              #    0.98  insn per
>> >      cycle
>> >      
>> >        425,284,747      branches                  # 1435.174 M/sec
>> >        
>> >         10,442,662      branch-misses             #    2.46% of all
>> >         branches
>> >        
>> >        4.338938821 seconds time elapsed
>> > 
>> > ~~~~~~
>> > 
>> > 3) With LTTng (all kernel events enabled) and with UST `tracef()` events
>> 
>> > enabled (~400 UST events in total):
>> We need more data to make sense of those numbers.
>> 
>> What does the payload of those events look like? Using tracef(),
>> string formating will most likely dominate over the time taken to
>> write to the bufers.
> 
> I attached the complete source code for you to reproduce this yourself. Also,
> string formatting _does not_ dominate the time, otherwise I'd see more CPU
> utilization (cf. perf stat output's first line).
> 
>> How many clients are you launching? What is their expected lifetime?
> 
> 100 clients, 10 in parallel, each taking a couple of milliseconds. See the
> source code for yourself.

I've tried your program here, but cannot reproduce the overhead you describe.
What am I doing wrong ? (fyi, I use master branches of lttng-ust, modules, tools,
liburcu). This is on a Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz, Linux 4.9.0-3-amd64
4.9.30-2+deb9u2 (debian).

* build-no-ust:

time ./launcher/launcher > /dev/null
real	0m1.752s
user	0m0.264s
sys	0m0.172s

* build-no-ust (with kernel syscalls traced):
[ using lttng enable-event -k --syscall -a ]
time ./launcher/launcher > /dev/null
real	0m1.700s
user	0m0.212s
sys	0m0.236s

* build-no-ust (with kernel tracepoints traced):
[ using lttng enable-event -k --tracepoint -a ]
time ./launcher/launcher > /dev/null
real	0m2.031s
user	0m0.248s
sys	0m0.332s

-> this overhead is expected given that we are tracing lots of
   kernel events. Selecting specific events instead would help
   making this overhead smaller.

* build-no-ust (with all kernel events (tracepoints+syscalls)):
[ using lttng enable-event -k -a ]
time ./launcher/launcher > /dev/null
real	0m2.031s
user	0m0.256s
sys	0m0.332s

* build-with-ust (with tracef UST event)
[ using lttng enable-event -u lttng_ust_tracef:event ]

time ./launcher/launcher > /dev/null
real	0m1.525s
user	0m0.432s
sys	0m0.360s

lttng view | wc -l
400

-> 400 events collected.

* build-with-ust (with tracef UST event and all kernel events)
[ using: lttng enable-event -u lttng_ust_tracef:event 
    and  lttng enable-event -k -a ]

time ./launcher/launcher > /dev/null
real	0m2.428s
user	0m0.460s
sys	0m0.892s

The main difference I see between your setup and mine is that I use
"time" instead of perf to run it. Also, I run lttng-sessiond as root,
is it your case as well ?

Also, please note that lttng-modules exposes a /proc/lttng-logger
file, into which you can write from user-space. It ends up in the
kernel trace. This could do the trick if for some reason lttng-ust
does not work for your scenario. However, the lttng-logger proc
file is just there as a convenience for scripts, and is by no mean
a "fast" tracing interface.

Thanks,

Mathieu

> 
> Thanks
> 
> --
> Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com