[ltt-dev] LTTng specialized probes

Mathieu Desnoyers compudj at krystal.dyndns.org
Tue Oct 7 20:07:27 EDT 2008


Hi Michael,

* Michael Davidson (md at google.com) wrote:
> Hi, Mathieu!
> 
> Jiaying forwarded this to me and I wanted to try to understand a
> little better exactly what direction you are headed in.
> 
> Like Martin, I am (at least) a little confused and since I wasn't
> involved in the discussions in Portland please forgive me if I am
> going over old ground here.
> 

No problem, I'll try to answer the best I can. Don't hesitate to ask for
clarifications if I am not clear enough.

> It seems to me that one of the key issues in getting good performance
> out of any kernel tracing system is that you have to record the data
> that you are trying to capture as quickly as possible. To me that
> means that during the initial recording of the data, to the maximum
> extent possible, you don't even look at it - you just slam it straight
> into your recording buffer and you are done (this should work well for
> the majority of cases where the data being captured are just scalar
> values - the more exotic the data the more processing it will need).
> 

Yes, I agree that the best performance is achieved by having a probe
which already "knows" how much data to save and just "does it", and this
is what we want for high-throughput events.

However, I pursue a supplementary goal in LTTng : I wanted to make it
trivial for kernel developers to add new events in the kernel. This can
be done by declaring a new marker, which has a format string that
permits an automatic vsnprintf-like parsing to save the data into the
trace buffer. The format strings are saved in the trace "metadata"
(only once); they describe the event fields. This provides flexible
and extensible event data. This is by no mean the tracing "fast path".

> The internal binary format that is recorded in the kernel buffers is
> inherently machine and kernel specific.
> 

Yes, the binary data I write in the buffers is in the kernel endianness
and follows the machine-specific basic types. However, no compound types
are currently supported.

> You don't need to even think about putting the data into a canonical format
> until you are reday to export it out of the kernel. (Note that, in one
> of our likely use cases we might run tracing continuously but only
> extract snapshots of the data from the kernel occasionally - so most
> of the data that we collected would never even need to be put into
> canonical format).

Yes, one use-case is to keep it within the kernel so it can be later
exported (flight recorder mode). But this mode alone might need to
export the data from an half-crashed kernel where we want to do the
minimum operations to get the data out; e.g. simply copying the data to
a serial link, without any formatting whatsoever. The second use-case
involves saving huge amounts of data to disk ("normal", or continuous,
tracing mode). This is where having a trace buffer representation which
follows the internal machine representation *and* which is
self-described matters, because the most efficient way to get the data
out of the kernel is to write the pages directly to disk, or send them
directly through the network, without any supplementary formatting.

> 
> The task of converting the internal binary format to something that
> can be exported is also machine and kernel specific and you need
> machine and kernel specific code to do it (although you can handle a
> lot of issues by providing some appropriate meta-data that gopes along
> with the trace data and describes its format).

By exporting the marker format strings along with the trace in a special
"channel" (this is the metadata), and by exporting the endianness and
type sizes in the trace header, the userspace code which reads the trace
(LTTV) can be (and is) machine-independant. This is very useful in a lot
of use-cases, especially for embedded systems where the traced device
cannot be used for decoding (too little memory, slow link, slow CPU,
half-working kernel...).

Your statement above however applies if one whishes to export C
compound types (structures, unions, arrays) directly into the trace
buffers. Then a specialized decoder would be needed, but hopefully this
will be a specialized corner-case. As long as we leave room for such
decoder, I think nobody will suffer. This specialized decoder does not,
however, need to be kernel or machine-specific. It can expect the data,
e.g. a structure, to follow the C standard regarding alignment.
Therefore, the fields can be accessed in an architecture-independent
manner given we have the endianness and type size information from the
trace header.

> 
> This conversion could be done in the kernel, by a user space program
> running on the same machine, or by a program running elsewhere.

You seem to assume that the same ABI the kernel runs in will be
available elsewhere, which might not be true in the embedded field. The
same applies to 64-bits x86 kernel with 32-bits userland. Therefore,
getting data out of the kernel should be done by a standardized ABI;
ideally following the kernel ABI for speed, but more importantly :
self-described. This is what LTTng does.

> While
> I agree that it is very convenient for small scale uses to be able to
> just do this in the kernel, once again it may not be appropriate when
> it is being used on a large scale with machines running real live
> workloads where CPU cycles are precious. In those cases it is
> important that a tracing solution does not preclude gathering binary
> trace data from machines in the most compact format possible, and
> doing the conversion to a canonical format elsewhere where the
> processing will not impact a live running system.

What I propose here is actually an optimization on top of the current
mechanism which writes the data in the traces in a self-described, yet
machine-specific and very compact and efficient binary format. This
optimization will consist of a few specialized callbacks which will
serialize the data of the most common event types in the trace buffers
following the same layout already followed by the fully dynamic (but
slow) serializer. Therefore, the event types which appear in the
high-throughput events will be supported by such high-speed trivial
serializers and we will have both flexibility (it will still be trivial
to add new event types) and speed (if one needs a specific event to be
handled very quickly, he just have to create a new specialized
serializer).

So I think both flexibility and speed are by no way opposed goals and
can be both achieved.

Mathieu

> 
> md
> 
> 
> 
> On Tue, Oct 7, 2008 at 11:16 AM, Jiaying Zhang <jiayingz at google.com> wrote:
> 
> > Hi Michael,
> >
> > This email from Mathieu has some interesting performance results
> > on the overhead of format passing. It indeed added a lot of overhead.
> >
> > Jiaying
> >
> > Forwarded conversation
> > Subject: LTTng specialized probes
> > ------------------------
> >
> > From: *Mathieu Desnoyers* <compudj at krystal.dyndns.org>
> > Date: Mon, Oct 6, 2008 at 7:11 AM
> > To: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> > Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca, Martin
> > Bligh <mbligh at google.com>
> >
> >
> > Hi,
> >
> > I'm currently working towards getting LTTng in shape for what is
> > required for mainline. I got the "TLB-less" buffers and splice()
> > working last week. I then did some performance testing on the flight
> > recorder mode and noticed an optimization that's really worth doing :
> >
> > LTTng "ltt-serialize.c", which parses the format strings and formats
> > data into the trace buffers takes a lot of CPU time. I tried only
> > keeping the size calculation (first pass on the format string) and
> > disabling the real data write and basically got something like :
> >
> > (default LTTng instrumentation, very approximate numbers)
> >
> > tbench no tracing : ~1900MB/s
> >       Markers enabled : ~1800MB/s
> >       with size calculation : ~1400MB/s
> >       size calc + data write : ~950MB/s
> >
> > I then remembered I've done ltt-serialize in such a way that it can be
> > easily overridden by per-format string specialized callbacks.
> >
> > Therefore, it would be worthwhile to create such specialized serializers
> > so the common cases can be made much faster. I think it will have a very
> > significant impact on performance.
> >
> > It's simply a matter of creating a new .c kernel module in ltt/ and to
> > create structures similar to :
> >
> > ltt-serialize.c :
> >
> > struct ltt_available_probe default_probe = {
> >        .name = "default",
> >        .format = NULL,
> >        .probe_func = ltt_vtrace,
> >        .callbacks[0] = ltt_serialize_data,
> > };
> >
> > Give it a non-null format string (just giving the types expected by the
> > callback), a good name, and a callback function, which implements the
> > specialized serialization. Note that kernel/marker.c currently expects
> > the format string to match exactly the marker format string, including
> > the type names, which should be changed. The type verification should
> > only check that the %X parameters are the same (and that there are the
> > same amount of arguments expected).
> >
> > That should not be hard, but it's not what I plan to focus on next.
> > Anyone is willing to work on this ?
> >
> > Mathieu
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> >
> > ----------
> > From: *Martin Bligh* <mbligh at google.com>
> > Date: Mon, Oct 6, 2008 at 8:14 AM
> > To: Mathieu Desnoyers <compudj at krystal.dyndns.org>
> > Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> > Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
> >
> >
> > Question ... It seems that strings are mandatory for markers at the moment.
> > I don't see why this is, and it seems significantly less efficient than
> > what
> > we discussed in Portland recently?
> >
> > ----------
> > From: *Mathieu Desnoyers* <compudj at krystal.dyndns.org>
> > Date: Mon, Oct 6, 2008 at 8:26 AM
> > To: Martin Bligh <mbligh at google.com>
> > Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> > Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
> >
> >
> > The idea is that someone who want to add a new instrumentation site in
> > the Linux kernel does not have to write a specialized probe up front.
> > The format string parser will take care of writing the typed data into
> > the buffers (default behavior), but can still overridden by a
> > specialized function which will expect the format string arguments and
> > serialize those into the buffers.
> >
> > About what we discussed in Portland and where Steven is currently going:
> > it does not provide any kind of binary standard to export the data
> > between different platforms or even from kernel 64-bits kernel to
> > 32-bits userland. Steven also cleary states that he doesn't care about
> > exporting this data to userspace in binary format. He wants a
> > supplementary layer to do this formatting, which I don't think will
> > produce the performance results we are looking for. Plus, I think
> > feeding the data through the kernel which recorded the information to
> > decode it is the wrong approach, especially when the system which
> > recorded such information is a small embedded device, where getting the
> > data _out_ is already non-trivial. Feeding it back in seems a bit crazy.
> >
> > Mathieu
> > --
> >
> > ----------
> > From: *Martin Bligh* <mbligh at google.com>
> > Date: Mon, Oct 6, 2008 at 8:37 AM
> > To: Mathieu Desnoyers <compudj at krystal.dyndns.org>
> > Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> > Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
> >
> >
> > On Mon, Oct 6, 2008 at 8:26 AM, Mathieu Desnoyers
> > OK, it seemed mandatory to me, but if it's not, that's good.
> > I know he wants the in-kernel parsing for ease-of-use, and getting things
> > upstream ... but it seemed to me that there was nothing in what he was
> > doing that made it impossible to get the data in binary form out to
> > userspace.
> > Exporting the buffers is obviously easy. I was under the impression you
> > were
> > recording strings in the buffers anyway, in which case I don't see why you
> > care, but I might be totally mistaken. Even so, it seems what we'd need
> > is just to make sure the buffer headers were exported, plus the decoding
> > functions - making C files that will link with both the kernel and into a
> > userspace library would be a little tricky, but not impossible?
> >
> > ----------
> > From: *Mathieu Desnoyers* <compudj at krystal.dyndns.org>
> > Date: Mon, Oct 6, 2008 at 8:56 AM
> > To: Martin Bligh <mbligh at google.com>
> > Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> > Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
> >
> >
> > Yes, exporting garbage to userspace is easy too ;) Making sense out of
> > it, especially without DWARF info, might be a bit more difficult.
> > The LTTng buffer format records those markers format strings
> > only once in a "metadata" channel so the mapping
> >
> > event id <-> marker name <-> format string
> >
> > can be extracted from the trace. We can therefore encode event size and
> > typing in this table and manage to leave that metadata out of the high
> > throughput tracing stream. By adding a layer that does not take
> > advantage of such indirection, Steven is actually reserving event IDs
> > for "internal use" when we could, in many cases, use those bits to put
> > the event IDs which map to the marker event table. By separating the
> > low-level event header management from the event ID registration
> > mechanism, we are aiming at a much less efficient solution.
> >
> > Also, by limiting the event reservation so events never cross a page
> > boundary, we are actually limiting the event size that can be exported
> > through such stream to 4kB. To me, 4kB non-contiguous pages should be
> > _one_ memory backend to use for the buffers (others being video memory
> > which survives hot reboots or linearly addressable buffer allocated at
> > boot time), which clearly does not have the same 4kB restrictions. I
> > therefore don't see why the higher-level buffer management primitives
> > (reserve/commit) should suffer from this specific lower-level buffer
> > limitation, especially given we can encapsulate writes so it's easy to
> > deal with page-crossing writes (c.f. vmap()-less buffers I posted last
> > week).
> > Linking 64-bits kernel objects into 32-bits userland executables seems
> > messy to me. And this is without considering cross-architecture concerns
> > (embedded developers with a small powerpc board but an x86 dev. machine
> > might want to look at the trace from a non-ABI compatible architecture).
> >
> > ----------
> > From: *Jiaying Zhang* <jiayingz at google.com>
> > Date: Mon, Oct 6, 2008 at 10:55 AM
> > To: Mathieu Desnoyers <compudj at krystal.dyndns.org>
> > Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> > ltt-dev at lists.casi.polymtl.ca, Martin Bligh <mbligh at google.com>
> >
> >
> >
> >
> >
> > Thanks a lot for sharing these numbers! Looks like we should
> > use special probe functions for high-frequency tracing events.
> > Also, do you know why enabling markers adds so much overhead?
> >
> > Jiaying
> >
> >
> >
> >
> >

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68




More information about the lttng-dev mailing list