[ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks

Thu Feb 17 14:11:11 EST 2011

On 02/17/2011 09:33 AM, Julien Desfossez wrote:
> Hi,
> 
> On 02/16/2011 10:30 AM, Tom Tromey wrote:
>> Julien> 0) Baseline : running the program without any instrumentation
>> Julien> 1) Flight recorder tracing comparison UST vs SystemTap
>>
>> I'd be interested to also see the numbers when the probes are in place
>> in the source, but not enabled.  That is, what is the overhead of a
>> disabled probe?
> I disabled the probe by undefining HAVE_SYSTEMTAP, but I have the same
> results in flight recorder mode. Of course if the module is not loaded
> we have no overhead at all. It means that the module is responsible for
> all the overhead regarless if the probe is called or not.
> I would be really interested if you know why it happens (and how to fix it).
> This last test was done on a Fedora Core 14 (kernel
> 2.6.35.10-74.fc14.x86_64 with SystemTap 1.3-3).
> 
> If you want to test, the benchmark code is here :
> git://git.lttng.org/benchmarks.git

Your "testutrace.stp" is probing with process.function, which means
you're not using the compiled tracepoint at all, but rather a function
probe based on dwarf debuginfo.  So compiling !HAVE_SYSTEMTAP in this
case doesn't matter, the function still exists for the module to probe.
The correct form for SDT probes is a process.mark probe, as you quoted
in your original mail, in which case stap would fail to compile the
module for the !HAVE_SYSTEMTAP case as the marks don't exist.

In the general use case, a script can be conditional on the presence of
different probe types, as described in "man stapprobes".  For the
purpose of benchmarking I would avoid this, so we can be absolutely sure
of what's being probed.  But for reference, it can look like:
  probe process("foo").mark("myfn")!,
        process("foo").function("myfn")
  { ... }

Note also that there's about twice the overhead for process.function
versus process.mark.  With .mark, a NOP instruction is inserted for us
to place the debug breakpoint on.  As of the uprobes in stap 1.3, we can
skip the singlestep of probes on a NOP.  But for function probes, the
debug breakpoint is placed near the beginning of the function, likely on
a significant instruction, so it must be singlestepped.  Having a
singlestep means there's basically two traps per probe hit, so it really
is a big win to use process.mark instead.

Getting back to Tom's request, I think these are the variations that we
need to see for a fuller picture:

1) Baseline with NO instrumentation compiled in at all.  You may need
something like an asm("") in single_trace() to keep gcc from compiling
the loop away altogether.
1a) Same binary w/ probe process.function (showing that stap can probe
unmodified binaries, though I expect this to be slowest of all)

2) UST baseline: UST compiled in, but not active.
2a) Same binary w/ tracing activated, UST w/ TC
2b) Same binary w/ tracing activated, UST w/o TC
2c) etc. any other UST variant
* if the UST variations require different compilation, the split this up
and report active/inactive numbers each time.

3) SDT baseline: stap SDT compiled in, but not active.
3a) Same binary w/ active probe process.mark

4) SDT-semaphore baseline: SDT compiled in and using a semaphore, not
active.  The semaphore is TRACEPOINT_BENCHMARK_SINGLE_TRACE_ENABLED(),
so you could put  if (..._ENABLED()) TRACE(...);
4a) Same binary w/ active probe process.mark

Thanks,

Josh