[lttng-dev] What is the cost of user-space tracepoint() ?

Fri Sep 5 15:19:01 EDT 2014

----- Original Message -----

> From: "Venkatesh ChitlurSrinivasa" <Venkatesh.Babu at netapp.com>
> To: "mathieu desnoyers" <mathieu.desnoyers at efficios.com>
> Cc: lttng-dev at lists.lttng.org
> Sent: Friday, September 5, 2014 2:53:51 PM
> Subject: What is the cost of user-space tracepoint() ?

> Mathieu,

> I tried to send this email to lttng-dev@ but I didn't get any response. So I
> am sending this directly to you. I greatly appreciate your response.

Hi Venkatesh, 

Sorry, I was on vacation and just recently returned. I must admit I did not 
have time to fully deal with my email backlog. 

> 

> We are planning to use LTTng UST as it supports lot of interesting features,
> but have some performance concerns (as compared with our in-house tracing
> tool). Please point me to the latest benchmark tests and performance
> results. On CPU Intel Xeon E5-2680 v2 @ 2.80GHz, running Linux 3.6.11 and
> lttng 2.4.1, I am getting about 927 cycles (9270692144 cycles for 10000000
> iterations). This seems to be lot higher than the documented results. In the
> paper https://lttng.org/files/papers/desnoyers.pdf the average cost of
> tracepoint() with older ltt-usertrace-fast tracepoint is 297 cycles. Another
> link http://lttng.org/files/thesis/desnoyers-thesis-defense-2009-12-e1.pdf
> says cache hot tracepoint() cost is 238 cycles. 

Indeed, this is surprising. I would expect a performance figure in the area of 500 cycles per 
UST tracepoints on modern Intel processors with lttng-ust 2.x, using the Linux kernel 
monotonic clock. 

> I noticed that the tracepoint is making clock_gettime() and sched_getcpu()
> system calls. With Linux kernel v3.6.11 and libc v2.13-38+deb7u3, I see that
> these system calls are not going through VDSO and hence costing more. I
> tried to add wrapper functions for these system calls to call
> __vdso_clock_gettime and __vdso_getcpu as upgrading libc was not an option.
> With this change the cost of tracepoint() recording one integer dropped to
> 795 cycles (= 284 nsec on Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz). Still
> this number seems to be higher than earlier published numbers.

Indeed, going through a syscall for gettime and getcpu can be a large 
cause of performance degradation. Upgrading the libc is really recommended 
here. 

The latest published benchmarks I remember are actually a bit old (UST 0.11): 

https://sourceware.org/ml/systemtap/2011-q1/msg00244.html 

This gave 211 ns/event on a CPU Intel Xeon E5404 at 2.0GHz, for 
422 cycles per event. I remember having seen benchmarks of more 
recent lttng-ust 2.x around these numbers (or perhaps more around 
275 ns/event). 

With LTTng 2.x, we reworked the ring buffer and added features, but 
indeed the figure of 795 cycles/event (just below 400ns/event at 2.0GHz) 
is higher than expected. 

One very important question: what payload are you tracing exactly ? Can 
you create a small package with a simple benchmark program you use 
so we can build it ourselves and try it out ? 

Another thing to consider is that performance is likely limited by the 
cache throughput, memory barrier execution and so on. Therefore, 
just because your CPU is running at 2.8GHz does not mean we can 
trace faster than with a CPU at 2.0GHz. Therefore, it might be better 
to measure in ns per event rather than cycles per event. 

Moreover, I'd be interested to see results of perf profiling of the 
benchmark. 

Thanks, 

Mathieu 

> VBabu

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20140905/ee1bc71b/attachment-0001.html>