[lttng-dev] liblttng-ust overhead, take 2
Milian Wolff
milian.wolff at kdab.com
Thu Sep 7 08:47:46 UTC 2017
Hey all,
after my latest report over liblttng-ust overhead yielded such quick results,
I would like to expand on this with a more realistic workload that I'm looking
at.
We have two applications, one that sets up an SQLite database and then
launches multiple client applications that try to read and write concurrently
from that database. Of course, this will show file-level lock contention. I'd
like to use LTTng to visualize the effect.
On my machine I get the following numbers.
1) Without LTTng and without UST:
~~~~~
Performance counter stats for './launcher/lab_latency_launcher':
495.075367 task-clock (msec) # 0.150 CPUs utilized
2,340 context-switches # 0.005 M/sec
291 cpu-migrations # 0.588 K/sec
52,265 page-faults # 0.106 M/sec
3,098,760,038 cycles # 6.259 GHz
1,894,074,743 instructions # 0.61 insn per cycle
354,869,636 branches # 716.799 M/sec
10,317,266 branch-misses # 2.91% of all branches
3.298337574 seconds time elapsed
~~~~~~
2) With LTTng (all kernel events enabled) and without UST:
~~~~~
Performance counter stats for 'run_lttng_trace.sh ./launcher/
lab_latency_launcher':
296.329694 task-clock (msec) # 0.068 CPUs utilized
1,808 context-switches # 0.006 M/sec
398 cpu-migrations # 0.001 M/sec
53,493 page-faults # 0.181 M/sec
2,411,871,165 cycles # 8.139 GHz
2,370,005,487 instructions # 0.98 insn per cycle
425,284,747 branches # 1435.174 M/sec
10,442,662 branch-misses # 2.46% of all branches
4.338938821 seconds time elapsed
~~~~~~
3) With LTTng (all kernel events enabled) and with UST `tracef()` events
enabled (~400 UST events in total):
~~~~~~
Performance counter stats for './launcher/lab_latency_launcher':
796.595231 task-clock (msec) # 0.069 CPUs utilized
17,728 context-switches # 0.022 M/sec
429 cpu-migrations # 0.539 K/sec
85,737 page-faults # 0.108 M/sec
4,483,999,258 cycles # 5.629 GHz
5,112,633,073 instructions # 1.14 insn per cycle
985,315,773 branches # 1236.909 M/sec
14,993,555 branch-misses # 1.52% of all branches
11.505288968 seconds time elapsed
~~~~~~
4) This is running against the patched userspace-urcu. Without the performance
fix there, the performance is even worse:
~~~~~~
Performance counter stats for 'run_lttng_trace.sh ./launcher/
lab_latency_launcher':
952.959433 task-clock (msec) # 0.019 CPUs utilized
6,021 context-switches # 0.006 M/sec
954 cpu-migrations # 0.001 M/sec
69,310 page-faults # 0.073 M/sec
5,335,988,506 cycles # 5.599 GHz
6,031,685,882 instructions # 1.13 insn per cycle
1,093,132,635 branches # 1147.093 M/sec
16,090,093 branch-misses # 1.47% of all branches
50.786464678 seconds time elapsed
~~~~~
So, I don't think we need to talk about 4) anymore. But I do think that
comparing 3) to 2) begs for clarifications. Just linking against lttng-ust
grinds the workload to a halt. The ~400 UST events cannot explain this at all.
Additionally, the timeline visualization of the processes becomes much harder,
as the UST overhead seems to dominate. This becomes apparent when comparing
the 2) and 3) data in tracecompass visually...
Could anyone look into this please? As-is, I cannot recommend using lttng-ust.
Rather, I'd opt for "faking" the custom trace points by e.g. doing something
like open("/UST-FAKE/$payload-here"). This also allows me to "annotate" the
kernel trace, a bit clumsily, but at least it has extremely low overhead when
compared to UST...
You can find a toy Qt 5 application to reproduce my measurements above here:
https://swanson.kdab.com/owncloud/index.php/s/FGI2PZNvwzoR4RZ
Unpack it, then build it with:
~~~~~
cd ust-slowness
mkdir build-with-ust build-no-ust
cd build-with-ust
qmake -r CONFIG+=with_lttng_ust ..
make
cd ../build-no-ust
qmake -r ..
make
~~~~~~
Then you can reproduce the issue by comparing runs of the `launcher` binary in
the two build folders. One will be quick, the other will be dog slow. I hope
you guys find more optimization opportunities here.
If not, then shouldn't it be possible to get something like tracef() simulated
via a custom syscall that gets directly handled by the kernel LTTng module?
This would hopefully be just as fast as using only the kernel tracing, while
removing the overhead of UST tracing... Just an idea.
Cheers
--
Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3826 bytes
Desc: not available
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20170907/fde21ff1/attachment.bin>
More information about the lttng-dev
mailing list