[lttng-dev] liblttng-ust overhead, take 2
Jérémie Galarneau
jeremie.galarneau at efficios.com
Thu Sep 7 16:29:58 UTC 2017
On 7 September 2017 at 04:47, Milian Wolff <milian.wolff at kdab.com> wrote:
> Hey all,
>
> after my latest report over liblttng-ust overhead yielded such quick results,
> I would like to expand on this with a more realistic workload that I'm looking
> at.
>
> We have two applications, one that sets up an SQLite database and then
> launches multiple client applications that try to read and write concurrently
> from that database. Of course, this will show file-level lock contention. I'd
> like to use LTTng to visualize the effect.
>
> On my machine I get the following numbers.
>
> 1) Without LTTng and without UST:
> ~~~~~
> Performance counter stats for './launcher/lab_latency_launcher':
>
> 495.075367 task-clock (msec) # 0.150 CPUs utilized
> 2,340 context-switches # 0.005 M/sec
> 291 cpu-migrations # 0.588 K/sec
> 52,265 page-faults # 0.106 M/sec
> 3,098,760,038 cycles # 6.259 GHz
> 1,894,074,743 instructions # 0.61 insn per cycle
> 354,869,636 branches # 716.799 M/sec
> 10,317,266 branch-misses # 2.91% of all branches
>
> 3.298337574 seconds time elapsed
> ~~~~~~
>
> 2) With LTTng (all kernel events enabled) and without UST:
> ~~~~~
> Performance counter stats for 'run_lttng_trace.sh ./launcher/
> lab_latency_launcher':
>
> 296.329694 task-clock (msec) # 0.068 CPUs utilized
> 1,808 context-switches # 0.006 M/sec
> 398 cpu-migrations # 0.001 M/sec
> 53,493 page-faults # 0.181 M/sec
> 2,411,871,165 cycles # 8.139 GHz
> 2,370,005,487 instructions # 0.98 insn per cycle
> 425,284,747 branches # 1435.174 M/sec
> 10,442,662 branch-misses # 2.46% of all branches
>
> 4.338938821 seconds time elapsed
> ~~~~~~
>
> 3) With LTTng (all kernel events enabled) and with UST `tracef()` events
> enabled (~400 UST events in total):
>
We need more data to make sense of those numbers.
What does the payload of those events look like? Using tracef(),
string formating will most likely dominate over the time taken to
write to the bufers.
How many clients are you launching? What is their expected lifetime?
Thanks,
Jérémie
> ~~~~~~
> Performance counter stats for './launcher/lab_latency_launcher':
>
> 796.595231 task-clock (msec) # 0.069 CPUs utilized
> 17,728 context-switches # 0.022 M/sec
> 429 cpu-migrations # 0.539 K/sec
> 85,737 page-faults # 0.108 M/sec
> 4,483,999,258 cycles # 5.629 GHz
> 5,112,633,073 instructions # 1.14 insn per cycle
> 985,315,773 branches # 1236.909 M/sec
> 14,993,555 branch-misses # 1.52% of all branches
>
> 11.505288968 seconds time elapsed
> ~~~~~~
>
> 4) This is running against the patched userspace-urcu. Without the performance
> fix there, the performance is even worse:
>
> ~~~~~~
> Performance counter stats for 'run_lttng_trace.sh ./launcher/
> lab_latency_launcher':
>
> 952.959433 task-clock (msec) # 0.019 CPUs utilized
> 6,021 context-switches # 0.006 M/sec
> 954 cpu-migrations # 0.001 M/sec
> 69,310 page-faults # 0.073 M/sec
> 5,335,988,506 cycles # 5.599 GHz
> 6,031,685,882 instructions # 1.13 insn per cycle
> 1,093,132,635 branches # 1147.093 M/sec
> 16,090,093 branch-misses # 1.47% of all branches
>
> 50.786464678 seconds time elapsed
> ~~~~~
>
> So, I don't think we need to talk about 4) anymore. But I do think that
> comparing 3) to 2) begs for clarifications. Just linking against lttng-ust
> grinds the workload to a halt. The ~400 UST events cannot explain this at all.
> Additionally, the timeline visualization of the processes becomes much harder,
> as the UST overhead seems to dominate. This becomes apparent when comparing
> the 2) and 3) data in tracecompass visually...
>
> Could anyone look into this please? As-is, I cannot recommend using lttng-ust.
> Rather, I'd opt for "faking" the custom trace points by e.g. doing something
> like open("/UST-FAKE/$payload-here"). This also allows me to "annotate" the
> kernel trace, a bit clumsily, but at least it has extremely low overhead when
> compared to UST...
>
> You can find a toy Qt 5 application to reproduce my measurements above here:
>
> https://swanson.kdab.com/owncloud/index.php/s/FGI2PZNvwzoR4RZ
>
> Unpack it, then build it with:
>
> ~~~~~
> cd ust-slowness
> mkdir build-with-ust build-no-ust
> cd build-with-ust
> qmake -r CONFIG+=with_lttng_ust ..
> make
> cd ../build-no-ust
> qmake -r ..
> make
> ~~~~~~
>
> Then you can reproduce the issue by comparing runs of the `launcher` binary in
> the two build folders. One will be quick, the other will be dog slow. I hope
> you guys find more optimization opportunities here.
>
> If not, then shouldn't it be possible to get something like tracef() simulated
> via a custom syscall that gets directly handled by the kernel LTTng module?
> This would hopefully be just as fast as using only the kernel tracing, while
> removing the overhead of UST tracing... Just an idea.
>
> Cheers
>
> --
> Milian Wolff | milian.wolff at kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
--
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list