[lttng-dev] Hardware-assisted tracing in LTTng

Tue Apr 1 14:23:55 EDT 2014

Hi,

I'm finishing my master and I want to share my results on what can be done
using STM, a hardware tracer present on newer ARM-based platform.  It really
speeds up tracing, but requires a lot of changes if we want LTTng to support
it with something like:

$ lttng use-hardware stm

Most debugging hardware units provide execution tracing (i.e. recording every
branch during execution), which is not adapted to event tracing.  However, some
other hardware solutions exist to accelerate event recording.  They provide
dedicated resources for writing and timestamping events.  Currently, two
hardware modules provide such functionality:
- "Data Acquisition Messages" on Freescale QorIQ processors;
- "System Trace Module" on some ARM-based chips (called "System Trace
  Macrocell" in new versions).
Since Data Acquisition Messages are not publicly documented, the following only
concerns STM.

Test we made on a Pandaboard showed that using STM to record small UST
tracepoints is 10x faster than using LTTng-UST.  Also, STM has many independent
"channels" so that in most situations, the use of locks isn't needed.

However, the characteristics of this hardware makes special handling necessary.
Particularly, the following are needed:
- a new way of writing trace *sequentially*;
- a new type of trace consumer (for dealing with hardware buffers);
- a new trace converter (STM format -> CTF).

About sequential trace
----------------------
Writing to STM must be sequential: to store a 4-integer array, you need to
write 4 times to the same address.  Thus, the classic "channel0_%d" files
cannot be written to STM, first because they are not written sequentially: the
header is updated once a full page has been written.  The other reason is that
theses files are full of padding and spaces, which would be really inefficient
in STM because it means writing each of these zeros (and not just moving a
pointer as in libringbuffer).

Another STM capability is to automatically timestamp messages using a hardware
clock, so LTTng software timestamping would not be needed when using hardware.
I don't know how much speedup can be expected.  Since STM clock and LTTng clock
(RDTSC) are different, an artefact is needed to synchronize (for instance,
sending a signal packet containing RDTSC value into STM every second).

About trace consuming
---------------------
STM output is stored in a special buffer called ETB.  This buffer can either be
read from the host system, or remotely drained from another computer attached
with JTAG.  Both cases have users: some want to export trace via JTAG to reduce
overhead, some don't have a JTAG connector.  So both cases should be supported.

My way of doing things would be a stand-alone trace consumer software, not part
of the LTTng consumers.  This way, both on-chip and off-chip use-cases would
result in a raw trace file, either created on the traced host or retrieved by
the other monitoring machine.

About STM trace format
----------------------
Then, a trace converted would produce CTF from this raw trace file.  The STM
format (called STP) is too strange to be described as CTF.  Also, processing
STP is quite CPU-intensive, so real-time decoding is not a good idea as it
would be a burden on performance.  Decoding at analysis time should be
preferred.  The STP format is not publicly documented, but I wrote code to
decode it [1].  It could be reused as a base for a decoder, for example a
"stp2ctf" babeltrace plugin.

Comments are welcome to answer these two questions:
1. Is it worth supporting STM in LTTng?
2. How to define a new sequential trace format that is further-compatible with
   future event-tracing hardware?

Thanks,

Adrien Vergé

[1]: https://github.com/adrienverge/libcoresightomap4430