[lttng-dev] liblttng-ust overhead, take 2

Thu Sep 7 08:47:46 UTC 2017

Hey all,

after my latest report over liblttng-ust overhead yielded such quick results, 
I would like to expand on this with a more realistic workload that I'm looking 
at.

We have two applications, one that sets up an SQLite database and then 
launches multiple client applications that try to read and write concurrently 
from that database. Of course, this will show file-level lock contention. I'd 
like to use LTTng to visualize the effect.

On my machine I get the following numbers.

1) Without LTTng and without UST:
~~~~~
 Performance counter stats for './launcher/lab_latency_launcher':

        495.075367      task-clock (msec)         #    0.150 CPUs utilized          
             2,340      context-switches          #    0.005 M/sec                  
               291      cpu-migrations            #    0.588 K/sec                  
            52,265      page-faults               #    0.106 M/sec                  
     3,098,760,038      cycles                    #    6.259 GHz                    
     1,894,074,743      instructions              #    0.61  insn per cycle         
       354,869,636      branches                  #  716.799 M/sec                  
        10,317,266      branch-misses             #    2.91% of all branches        

       3.298337574 seconds time elapsed
~~~~~~

2) With LTTng (all kernel events enabled) and without UST:
~~~~~
 Performance counter stats for 'run_lttng_trace.sh ./launcher/
lab_latency_launcher':

        296.329694      task-clock (msec)         #    0.068 CPUs utilized          
             1,808      context-switches          #    0.006 M/sec                  
               398      cpu-migrations            #    0.001 M/sec                  
            53,493      page-faults               #    0.181 M/sec                  
     2,411,871,165      cycles                    #    8.139 GHz                    
     2,370,005,487      instructions              #    0.98  insn per cycle         
       425,284,747      branches                  # 1435.174 M/sec                  
        10,442,662      branch-misses             #    2.46% of all branches        

       4.338938821 seconds time elapsed
~~~~~~

3) With LTTng (all kernel events enabled) and with UST `tracef()` events 
enabled (~400 UST events in total):

~~~~~~
 Performance counter stats for './launcher/lab_latency_launcher':

        796.595231      task-clock (msec)         #    0.069 CPUs utilized          
            17,728      context-switches          #    0.022 M/sec                  
               429      cpu-migrations            #    0.539 K/sec                  
            85,737      page-faults               #    0.108 M/sec                  
     4,483,999,258      cycles                    #    5.629 GHz                    
     5,112,633,073      instructions              #    1.14  insn per cycle         
       985,315,773      branches                  # 1236.909 M/sec                  
        14,993,555      branch-misses             #    1.52% of all branches        

      11.505288968 seconds time elapsed
~~~~~~

4) This is running against the patched userspace-urcu. Without the performance 
fix there, the performance is even worse:

~~~~~~
 Performance counter stats for 'run_lttng_trace.sh ./launcher/
lab_latency_launcher':

        952.959433      task-clock (msec)         #    0.019 CPUs utilized          
             6,021      context-switches          #    0.006 M/sec                  
               954      cpu-migrations            #    0.001 M/sec                  
            69,310      page-faults               #    0.073 M/sec                  
     5,335,988,506      cycles                    #    5.599 GHz                    
     6,031,685,882      instructions              #    1.13  insn per cycle         
     1,093,132,635      branches                  # 1147.093 M/sec                  
        16,090,093      branch-misses             #    1.47% of all branches        

      50.786464678 seconds time elapsed
~~~~~

So, I don't think we need to talk about 4) anymore. But I do think that 
comparing 3) to 2) begs for clarifications. Just linking against lttng-ust 
grinds the workload to a halt. The ~400 UST events cannot explain this at all. 
Additionally, the timeline visualization of the processes becomes much harder, 
as the UST overhead seems to dominate. This becomes apparent when comparing 
the 2) and 3) data in tracecompass visually...

Could anyone look into this please? As-is, I cannot recommend using lttng-ust. 
Rather, I'd opt for "faking" the custom trace points by e.g. doing something 
like open("/UST-FAKE/$payload-here"). This also allows me to "annotate" the 
kernel trace, a bit clumsily, but at least it has extremely low overhead when 
compared to UST...

You can find a toy Qt 5 application to reproduce my measurements above here:

https://swanson.kdab.com/owncloud/index.php/s/FGI2PZNvwzoR4RZ

Unpack it, then build it with:

~~~~~
cd ust-slowness
mkdir build-with-ust build-no-ust
cd build-with-ust
qmake -r CONFIG+=with_lttng_ust ..
make
cd ../build-no-ust
qmake -r ..
make
~~~~~~

Then you can reproduce the issue by comparing runs of the `launcher` binary in 
the two build folders. One will be quick, the other will be dog slow. I hope 
you guys find more optimization opportunities here.

If not, then shouldn't it be possible to get something like tracef() simulated 
via a custom syscall that gets directly handled by the kernel LTTng module? 
This would hopefully be just as fast as using only the kernel tracing, while 
removing the overhead of UST tracing... Just an idea.

Cheers

-- 
Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3826 bytes
Desc: not available
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20170907/fde21ff1/attachment.bin>