[ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks

Julien Desfossez julien.desfossez at polymtl.ca
Tue Feb 15 10:53:08 EST 2011


LTTng-UST vs SystemTap userspace tracing benchmarks

February 15th, 2011

Authors: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
         Julien Desfossez <julien.desfossez at polymtl.ca>

-- Introduction

The purpose of this benchmark is to compare the performance for
userspace tracing of SystemTap and LTTng-UST. The goal is to show that
the two tools are complementary since SystemTap doesn't seem to be able
to handle tracing applications with a high throughput of trace data.

-- Benchmark
10 million events generated per thread, number of threads vary.  Each
event generates a time-stamp and contains a 4-byte integer value.
Synthetic workload: cache-hot test, function writing events called in
loop.  On a 8-core Intel Xeon, (2x 4-core), E5405 at 2.0GHz, 16GB ram
Running Linux 2.6.37 (custom build, with utrace patches, debuginfo
enabled and LTTng trace clock available)

UST 0.11, hooking on user-space Tracepoints
* UST tuning : Normal (blocking) mode, 16 buffers, 4k each
* We test UST with the LTTng Trace Clock (w/ TC) and with the standard
clock infrastructure (w/o TC)

SystemTap 1.2-5 (from Debian package), hooking on DTrace user-space
static markup.
* SystemTap probe (stap testutrace.stp -F) :
probe process("./.libs/tracepoint_benchmark").mark("single_trace") {
    printf("%d : %s\n", gettimeofday_ns(), $arg1);
}

-- Results
0) Baseline : running the program without any instrumentation

                            TOTAL CPU TIME
Number of threads           baseline
                1           0:0.33
                2           0:0.33
                4           0:0.33
                8           0:0.33


1) Flight recorder tracing comparison UST vs SystemTap
                            TOTAL CPU TIME
Number of threads       UST w/ TC       UST w/o TC      SystemTap
                1       0:01.81         0:02.25         0:58.36
                2       0:01.86         0:02.13         1:49.94
                4       0:01.86         0:02.22         2:38.49
                8       0:01.97         0:02.14         9:29.58

                            TOTAL CPU TIME (ns/event)
Number of threads       UST w/ TC       UST w/o TC      SystemTap
                1       181             225             5836
                2       186             213             10994
                4       186             222             15849
                8       197             204             56958

                            UST SPEEDUP
Number of threads       UST w/ TC       UST w/o TC
                1       32x             25x
                2       59x             51x
                4       85x             71x
                8       289x            279x


2) Tracing to disk comparison UST vs SystemTap (trace output fits in
page cache)
                            TOTAL CPU TIME
Number of threads       UST w/ TC    UST w/o TC   SystemTap
                1       0:01.82      0:02.11      1:01.12 (128622 lost)
                2       0:01.95      0:02.14      1:44.20 (397859 lost)
                4       0:01.97      0:02.31      2:38.13 (360549 lost)
                8       0:02.28      0:02.68      9:29.36 (158538 lost)

                            TOTAL CPU TIME (ns/event)
Number of threads       UST w/ TC       UST w/o TC      SystemTap
                1       182             211             6112
                2       195             214             10420
                4       197             231             15813
                8       228             268             56936

                            UST SPEEDUP
Number of threads       UST w/ TC       UST w/o TC
                1       33x             28x
                2       53x             48x
                4       80x             68x
                8       249x            212x

                            OUTPUT SIZE (MB)
Number of threads       UST         SystemTap     UST Output compression
                1       77          271           3.52
                2       153         554           3.62
                4       306         1097          3.58
                8       612         2214          3.61


--  Conclusions

For flight recorder tracing, UST is 289 times faster than SystemTap on
an 8-core system with a LTTng kernel and 279 times with a vanilla+utrace
kernel.

When recording traces to disk, UST is 249 times faster than SystemTap on
an 8-core system with a LTTng kernel and 212 times with a vanilla+utrace
kernel.
Only a small part of the UST speedup over SystemTap is due to the more
compressed size of its output (binary for UST vs text for SystemTap).

SystemTap does not scale for multithreaded applications running on
multi-core systems. UST scales linearly with the number of cores for
flight recorder tracing, and almost linearly when saving tracing output
to the page cache.

This study proves that LTTng-UST and SystemTap are two tools with a
complementary purpose. LTTng-UST is more efficient in extracting a high
volume of trace data which allows a developper or a system engineer to
diagnose an unknown problem, whereas SystemTap is more targetted to
provide a quick interface for instrumenting specific problems.



More information about the ltt-dev mailing list