[ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks

Wed Feb 16 05:56:18 EST 2011

On Tue, 2011-02-15 at 17:00 +0000, Stefan Hajnoczi wrote:
> On Tue, Feb 15, 2011 at 4:26 PM, Frank Ch. Eigler <fche at redhat.com> wrote:
> > (One may imagine a future version of systemtap where scripts that
> > happen to independently probe single processes are executed with a
> > pure userspace backend, but this is not in our immediate roadmap.)
> 
> What is the fundamental mechanism that UST and SystemTap use for tracing?
> 
> e.g. Here's a guess:
> UST: a conditional function call within the same process
> SystemTap: a software interrupt on x86
> 
> I don't know the implementations details but would be interested in
> understanding this.

I don't know the precise implementation details for ltt. But for
SystemTap you could divide the "tracing" process into a couple of steps:

1) The probe marking. The way you embed where you can place probes and
   how to get at arguments/context of the probe. For userspace probes
   SystemTap mainly relies on two mechanisms:

 - dwarf debuginfo. This is the same mechanism debuggers use. It is a
   very low level description of how the source program maps to the
   binary. Through it you can determine locations for probes based on
   source lines, function names, etc. and get a description of how to
   get at local variables and arguments. Advantage is that it is already
   there (when compiled -g), so you don't need to do anything special.
   Downside is that it is pretty low level, so you do need to know a bit
   about the program structure before you can "trace" effectively.
   Recent advancements in gcc made the dwarf debuginfo pretty reliable.

 - sdt markers. This is a mechanism also employed by dtrace (although
   the way the markers and arguments are embedded is slightly different,
   this is an implementation detail though). A program #include
   <sys/sdt.h> and places PROBE markers in their source code to indicate
   "high-level events" and relevant arguments for that event.
   The macros get translated to special code that places the name,
   address and where to find the arguments into a special elf note.
   Advantage is that as a "trace user" you get an overview of high level
   events that might be interested to introspect. Disadvantage is that
   the programmer needs to explicitly embed them in their program (but
   since dtrace and now gdb can also hook onto them they are getting
   used more and more).

2) The probe and context selection. In a systemtap stap script you
   list all places/events you want to place a probe on. These can be
   low level kernel events (tracepoints, based on kernel debuginfo,
   timers, perf events, etc) or user level events (based on the dwarf
   debuginfo or sdt markers placed in the program). Then for each (group
   of) probe events, you write a handler listing the context you are
   interested in (variables, arguments, etc.). These can then be used to
   filter and/or log the event (see under 5. The actual "trace").

4) Hooking onto the probe. Based on the stap script you provide the
   systemtap runtime decides which addresses to place probes on (or hook
   into event notifiers). It also extracts the location of each context
   variable and/or parameter used in the probe handler for that
   location. Currently for each user space address derived (which could
   be multiple if the probe point is inlined in various places) it uses
   uprobes to place a breakpoint instruction at that location and
   inserts a callback handler to the handler responsible for that probe
   event. All the nitty-gritty of placing the probes and handling the
   software interrupt is delegated to uprobes (it saves a full roundtrip
   user/kernel/user necessary with for example ptrace), which is being
   pushed into the upstream kernel so it can be used by others like perf
   and gdb in the future. But you could imagine hooking being done
   through other mechanisms, like in-process functional calls in the
   user process. If the code injection techniques of ltt are reusable
   that would be a very cool idea.

5) The actual "trace"/data gathering step. Depending on the stap
   handler you wrote for the probe the SystemTap runtime (called
   through the probe hook) will extract the context variables
   and/or parameters you are interested in. They are then used for
   filtering (based on the conditionals used in your handler) and
   then lets you either assign derived values to global (script)
   variables or statistical containers, or make you log the event
   and/or some of the context. Basically you write a log or printf
   statement in your handler when you want to "trace" it. Depending
   on how you invoked stap it is then placed in a file or some buffer
   through procfs, relayfs, debugfs or ring_buffers. Alternatively
   you can write an "end" handler that just spits out the data you
   accumulated and stored in the script variables and statistics
   (so as not to have to output anything at all during the probe
    event itself to save data output and processing time).

Hope that helps. And if someone could give a similar overview of ltt
then we could see how we can more easily mix and match these various
steps in the future. Since it seems the mechanisms used are nicely
complementary.

Cheers,

Mark