[ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
Julien Desfossez
julien.desfossez at polymtl.ca
Tue Feb 15 10:53:08 EST 2011
LTTng-UST vs SystemTap userspace tracing benchmarks
February 15th, 2011
Authors: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
Julien Desfossez <julien.desfossez at polymtl.ca>
-- Introduction
The purpose of this benchmark is to compare the performance for
userspace tracing of SystemTap and LTTng-UST. The goal is to show that
the two tools are complementary since SystemTap doesn't seem to be able
to handle tracing applications with a high throughput of trace data.
-- Benchmark
10 million events generated per thread, number of threads vary. Each
event generates a time-stamp and contains a 4-byte integer value.
Synthetic workload: cache-hot test, function writing events called in
loop. On a 8-core Intel Xeon, (2x 4-core), E5405 at 2.0GHz, 16GB ram
Running Linux 2.6.37 (custom build, with utrace patches, debuginfo
enabled and LTTng trace clock available)
UST 0.11, hooking on user-space Tracepoints
* UST tuning : Normal (blocking) mode, 16 buffers, 4k each
* We test UST with the LTTng Trace Clock (w/ TC) and with the standard
clock infrastructure (w/o TC)
SystemTap 1.2-5 (from Debian package), hooking on DTrace user-space
static markup.
* SystemTap probe (stap testutrace.stp -F) :
probe process("./.libs/tracepoint_benchmark").mark("single_trace") {
printf("%d : %s\n", gettimeofday_ns(), $arg1);
}
-- Results
0) Baseline : running the program without any instrumentation
TOTAL CPU TIME
Number of threads baseline
1 0:0.33
2 0:0.33
4 0:0.33
8 0:0.33
1) Flight recorder tracing comparison UST vs SystemTap
TOTAL CPU TIME
Number of threads UST w/ TC UST w/o TC SystemTap
1 0:01.81 0:02.25 0:58.36
2 0:01.86 0:02.13 1:49.94
4 0:01.86 0:02.22 2:38.49
8 0:01.97 0:02.14 9:29.58
TOTAL CPU TIME (ns/event)
Number of threads UST w/ TC UST w/o TC SystemTap
1 181 225 5836
2 186 213 10994
4 186 222 15849
8 197 204 56958
UST SPEEDUP
Number of threads UST w/ TC UST w/o TC
1 32x 25x
2 59x 51x
4 85x 71x
8 289x 279x
2) Tracing to disk comparison UST vs SystemTap (trace output fits in
page cache)
TOTAL CPU TIME
Number of threads UST w/ TC UST w/o TC SystemTap
1 0:01.82 0:02.11 1:01.12 (128622 lost)
2 0:01.95 0:02.14 1:44.20 (397859 lost)
4 0:01.97 0:02.31 2:38.13 (360549 lost)
8 0:02.28 0:02.68 9:29.36 (158538 lost)
TOTAL CPU TIME (ns/event)
Number of threads UST w/ TC UST w/o TC SystemTap
1 182 211 6112
2 195 214 10420
4 197 231 15813
8 228 268 56936
UST SPEEDUP
Number of threads UST w/ TC UST w/o TC
1 33x 28x
2 53x 48x
4 80x 68x
8 249x 212x
OUTPUT SIZE (MB)
Number of threads UST SystemTap UST Output compression
1 77 271 3.52
2 153 554 3.62
4 306 1097 3.58
8 612 2214 3.61
-- Conclusions
For flight recorder tracing, UST is 289 times faster than SystemTap on
an 8-core system with a LTTng kernel and 279 times with a vanilla+utrace
kernel.
When recording traces to disk, UST is 249 times faster than SystemTap on
an 8-core system with a LTTng kernel and 212 times with a vanilla+utrace
kernel.
Only a small part of the UST speedup over SystemTap is due to the more
compressed size of its output (binary for UST vs text for SystemTap).
SystemTap does not scale for multithreaded applications running on
multi-core systems. UST scales linearly with the number of cores for
flight recorder tracing, and almost linearly when saving tracing output
to the page cache.
This study proves that LTTng-UST and SystemTap are two tools with a
complementary purpose. LTTng-UST is more efficient in extracting a high
volume of trace data which allows a developper or a system engineer to
diagnose an unknown problem, whereas SystemTap is more targetted to
provide a quick interface for instrumenting specific problems.
More information about the lttng-dev
mailing list