[lttng-dev] What is the cost of user-space tracepoint() ?

ChitlurSrinivasa, Venkatesh Venkatesh.Babu at netapp.com
Fri Sep 5 18:32:57 EDT 2014


Hi Mathieu,


  Thanks for the quick response.


  First issue is that Debian EGLIBC 2.13​, tries to use vsyscall and in newer kernel (3.x) it is deprecated (see http://lwn.net/Articles/446528/). Though there is a bootparm, 'vsyscall', that can have values "native or emulated or none", the performance is still lower.  Only way to get expected low cost of these calls is to use VDSO. Only libc version 2.14+ supports it. It is not an option for us until another 6-8 months when we upgrade the os. So I was writing wrappers to these clock_gettime() and sched_getcpu() calls to use VDSO calls.


  I see that the cost of clock_gettime(CLOCK_MONOTONIC) with VDSO wrapper is 276nsec. If I have a simple monotonically incrementing counter it costs only 46nsec. So why is the clock_gettime() is so much expensive? As the the code it is just reading the time from global variable 'gtod' and I don't see why is it costing so much.


> I remember having seen benchmarks of more recent lttng-ust 2.x around these numbers (or perhaps more around 275 ns/event)


In my case, with VDSO wrappers and UST 2.4.1, the cost of tracing one integer (on a single thread) was 795 cycles, which is 284nsec on 2.8GHz cpu. This number seems closer to your numbers. But this number is worse than your 1 thread number of  181nsec with UST 0.11. Is this performance degradation attributed to new features? I would love to see if we can get back to the earlier number 181nsec.


> what payload are you tracing exactly ?

> Can you create a small package with a simple benchmark program you use so we can build it ourselves and try it out ?


I am just tweaking the lttng-ust-2.4.1/doc/examples/hello-static-lib/hello.c and running it on our platforms. If you need I can send my changes to it.


> Moreover, I'd be interested to see results of perf profiling of the benchmark.


Yes, I ran the perf profiling and that is how I found the VDSO issue described earlier. Following is the perf output of the latest run with vdso wrappers.


    15.54%    hello  liblttng-ust.so.0.0.0             [.] lttng_event_reserve
    10.19%    hello  liblttng-ust.so.0.0.0             [.] lttng_event_commit
     9.03%    hello  hello                             [.] rdtsc
     7.36%    hello  hello                             [.] __event_probe__ust_tests_hello___tptest
     7.27%    hello  liburcu-bp.so.2.0.0               [.] rcu_read_unlock_bp

VBabu







________________________________
From: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
Sent: Friday, September 5, 2014 12:19 PM
To: ChitlurSrinivasa, Venkatesh
Cc: lttng-dev at lists.lttng.org
Subject: Re: What is the cost of user-space tracepoint() ?

________________________________
From: "Venkatesh ChitlurSrinivasa" <Venkatesh.Babu at netapp.com>
To: "mathieu desnoyers" <mathieu.desnoyers at efficios.com>
Cc: lttng-dev at lists.lttng.org
Sent: Friday, September 5, 2014 2:53:51 PM
Subject: What is the cost of user-space tracepoint() ?


Mathieu,

I tried to send this email to lttng-dev@ but I didn't get any response. So I am sending this directly to you. I greatly appreciate your response.

Hi Venkatesh,

Sorry, I was on vacation and just recently returned. I must admit I did not
have time to fully deal with my email backlog.


​

We are planning to use LTTng UST as it supports lot of interesting features, but have some performance concerns (as compared with our in-house tracing tool). Please point me to the latest benchmark tests and performance results. On CPU Intel Xeon E5-2680 v2 @ 2.80GHz​, running Linux 3.6.11​ and lttng 2.4.1,  I am getting about 927 cycles (9270692144 cycles for 10000000​ iterations). This seems to be lot higher than the documented results. In the paper https://lttng.org/files/papers/desnoyers.pdf  the average cost of tracepoint() with older ltt-usertrace-fast tracepoint is 297 cycles. Another link http://lttng.org/files/thesis/desnoyers-thesis-defense-2009-12-e1.pdf says cache hot tracepoint() cost is 238 cycles.​

Indeed, this is surprising. I would expect a performance figure in the area of 500 cycles per
UST tracepoints on modern Intel processors with lttng-ust 2.x, using the Linux kernel
monotonic clock.


I noticed that the tracepoint is making clock_gettime() and sched_getcpu() system calls. With Linux kernel v3.6.11 and libc v2.13-38+deb7u3, I see that these system calls are not going through VDSO and hence costing more. I tried to add wrapper functions for these system calls to call __vdso_clock_gettime and __vdso_getcpu as upgrading libc was not an option. With this change the cost of tracepoint() recording one integer dropped to 795 cycles (= 284 nsec on Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz). Still this number seems to be higher than earlier published numbers.

Indeed, going through a syscall for gettime and getcpu can be a large
cause of performance degradation. Upgrading the libc is really recommended
here.

The latest published benchmarks I remember are actually a bit old (UST 0.11):

https://sourceware.org/ml/systemtap/2011-q1/msg00244.html

This gave 211 ns/event on a CPU Intel Xeon E5404 at 2.0GHz, for
422 cycles per event. I remember having seen benchmarks of more
recent lttng-ust 2.x around these numbers (or perhaps more around
275 ns/event).

With LTTng 2.x, we reworked the ring buffer and added features, but
indeed the figure of 795 cycles/event (just below 400ns/event at 2.0GHz)
is higher than expected.

One very important question: what payload are you tracing exactly ? Can
you create a small package with a simple benchmark program you use
so we can build it ourselves and try it out ?

Another thing to consider is that performance is likely limited by the
cache throughput, memory barrier execution and so on. Therefore,
just because your CPU is running at 2.8GHz does not mean we can
trace faster than with a CPU at 2.0GHz. Therefore, it might be better
to measure in ns per event rather than cycles per event.

Moreover, I'd be interested to see results of perf profiling of the
benchmark.

Thanks,

Mathieu



VBabu



--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20140905/c6487289/attachment-0001.html>


More information about the lttng-dev mailing list