[ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks

Tue Feb 15 11:54:29 EST 2011

Hi Frank,

* Frank Ch. Eigler (fche at redhat.com) wrote:
> 
> Julien Desfossez <julien.desfossez at polymtl.ca> writes:
> 
> > LTTng-UST vs SystemTap userspace tracing benchmarks
> 
> Thank you.
> 
> > [...]  For flight recorder tracing, UST is 289 times faster than
> > SystemTap on an 8-core system with a LTTng kernel and 279 times with
> > a vanilla+utrace kernel.
> 
> This is not that surprising, considering how the two tools work.  UST
> does its work in userspace,

This first part of the statement is true,

> and is therefore focused on an individual
> process's activities.

This is incorrect. LTTng and UST gather traces from multiple processes and from
the kernel, and merge them in post-processing. This toolset is therefore focused
on system-wide activity analysis.

> Systemtap does its work in kernelspace, and can
> therefore focus on many different processes and the kernel at the same
> time.  This entails some ring transitions.

The difference between UST and SystemTAP is not the target goal, but rather
where the computation is done: UST uses buffering to send its trace output,
conversely, SystemTAP performs the ring transition for each individual event.
This is a core design difference that partly explains the dramatically
performance results we see here.

> 
> (One may imagine a future version of systemtap where scripts that
> happen to independently probe single processes are executed with a
> pure userspace backend, but this is not in our immediate roadmap.)
> 
> > SystemTap does not scale for multithreaded applications running on
> > multi-core systems.  [...]
> 
> We know of at least one kernel problem in this area,
> <http://sourceware.org/PR5660>, which may be fixable via core or
> utrace or uprobes changes.
> 
> 
> > This study proves that LTTng-UST and SystemTap are two tools with a
> > complementary purpose.  [...]
> 
> Strictly speaking, it shows that their performance differs
> dramatically in this sort of microbenchmark.  

Strictly speaking, you are right. I've done performance testing on LTTng (the
kernel equivalent of UST, using very similar technology) on real workloads
traced at the kernel level, and this kind of microbenchmark actually shows a
lower-bound of the tracer performance impact per probe (the upper-bound being up
to a factor 3 higher due to cache misses in the trace buffers). All the details
are presented in http://lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf,
Chapters 5.5, 8.4 and 8.5. Now the overall performance impact must indeed be
weighted by the number of times the tracer is called by the application. If, for
example, we trace standard tests like "dbench" at the kernel-level with LTTng,
we get a 3% performance hit. If we multiply this by 294, this gets in the area
of a 882% performance hit on the system, which is likely to have some noticeable
impact on the end user experience.

> 
> Thank you for your data gathering.

Thanks for your reply. We'll be glad to help out if we can.

Mathieu

> 
> - FChE

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com