[ltt-dev] LTTng specialized probes

Tue Oct 7 17:28:32 EDT 2008

Hi, Mathieu!

Jiaying forwarded this to me and I wanted to try to understand a little
better exactly
what direction you are headed in.

Like Martin, I am (at least) a little confused and since I wasn't involved
in the discussions
in Portland please forgive me if I am going over old ground here.

It seems to me that one of the key issues in getting good performance out of
any kernel
tracing system is that you have to record the data that you are trying to
capture as
quickly as possible. To me that means that during the initial recording of
the data, to
the maximum extent possible, you don't even look at it - you just slam it
straight into
your recording buffer and you are done (this should work well for the
majority of cases
where the data being captured are just scalar values - the more exotic the
data the
more processing it will need).

The internal binary format that is recorded in the kernel buffers is
inherently machine
and kernel specific.

You don't need to even think about putting the data into a canonical format
until you
are reday to export it out of the kernel. (Note that, in one of our likely
use cases we
might run tracing continuously but only extract snapshots of the data from
the kernel
occasionally - so most of the data that we collected would never even need
to be put
into canonical format).

The task of converting the internal binary format to something that can be
exported is
also machine and kernel specific and you need machine and kernel specific
code to
do it (although you can handle a lot of issues by providing some appropriate
meta-data
that gopes along with the trace data and describes its format).

This conversion could be done in the kernel, by a user space program running
on the
same machine, or by a program running elsewhere. While I agree that it is
very
convenient for small scale uses to be able to just do this in the kernel,
once again it
may not be appropriate when it is being used on a large scale with machines
running
real live workloads where CPU cycles are precious. In those cases it is
important that
a tracing solution does not preclude gathering binary trace data from
machines in the
most compact format possible, and doing the conversion to a canonical format
elsewhere where the processing will not impact a live running system.

md

On Tue, Oct 7, 2008 at 11:16 AM, Jiaying Zhang <jiayingz at google.com> wrote:

> Hi Michael,
>
> This email from Mathieu has some interesting performance results
> on the overhead of format passing. It indeed added a lot of overhead.
>
> Jiaying
>
> Forwarded conversation
> Subject: LTTng specialized probes
> ------------------------
>
> From: *Mathieu Desnoyers* <compudj at krystal.dyndns.org>
> Date: Mon, Oct 6, 2008 at 7:11 AM
> To: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca, Martin
> Bligh <mbligh at google.com>
>
>
> Hi,
>
> I'm currently working towards getting LTTng in shape for what is
> required for mainline. I got the "TLB-less" buffers and splice()
> working last week. I then did some performance testing on the flight
> recorder mode and noticed an optimization that's really worth doing :
>
> LTTng "ltt-serialize.c", which parses the format strings and formats
> data into the trace buffers takes a lot of CPU time. I tried only
> keeping the size calculation (first pass on the format string) and
> disabling the real data write and basically got something like :
>
> (default LTTng instrumentation, very approximate numbers)
>
> tbench no tracing : ~1900MB/s
>       Markers enabled : ~1800MB/s
>       with size calculation : ~1400MB/s
>       size calc + data write : ~950MB/s
>
> I then remembered I've done ltt-serialize in such a way that it can be
> easily overridden by per-format string specialized callbacks.
>
> Therefore, it would be worthwhile to create such specialized serializers
> so the common cases can be made much faster. I think it will have a very
> significant impact on performance.
>
> It's simply a matter of creating a new .c kernel module in ltt/ and to
> create structures similar to :
>
> ltt-serialize.c :
>
> struct ltt_available_probe default_probe = {
>        .name = "default",
>        .format = NULL,
>        .probe_func = ltt_vtrace,
>        .callbacks[0] = ltt_serialize_data,
> };
>
> Give it a non-null format string (just giving the types expected by the
> callback), a good name, and a callback function, which implements the
> specialized serialization. Note that kernel/marker.c currently expects
> the format string to match exactly the marker format string, including
> the type names, which should be changed. The type verification should
> only check that the %X parameters are the same (and that there are the
> same amount of arguments expected).
>
> That should not be hard, but it's not what I plan to focus on next.
> Anyone is willing to work on this ?
>
> Mathieu
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>
> ----------
> From: *Martin Bligh* <mbligh at google.com>
> Date: Mon, Oct 6, 2008 at 8:14 AM
> To: Mathieu Desnoyers <compudj at krystal.dyndns.org>
> Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
>
>
> Question ... It seems that strings are mandatory for markers at the moment.
> I don't see why this is, and it seems significantly less efficient than
> what
> we discussed in Portland recently?
>
> ----------
> From: *Mathieu Desnoyers* <compudj at krystal.dyndns.org>
> Date: Mon, Oct 6, 2008 at 8:26 AM
> To: Martin Bligh <mbligh at google.com>
> Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
>
>
> The idea is that someone who want to add a new instrumentation site in
> the Linux kernel does not have to write a specialized probe up front.
> The format string parser will take care of writing the typed data into
> the buffers (default behavior), but can still overridden by a
> specialized function which will expect the format string arguments and
> serialize those into the buffers.
>
> About what we discussed in Portland and where Steven is currently going:
> it does not provide any kind of binary standard to export the data
> between different platforms or even from kernel 64-bits kernel to
> 32-bits userland. Steven also cleary states that he doesn't care about
> exporting this data to userspace in binary format. He wants a
> supplementary layer to do this formatting, which I don't think will
> produce the performance results we are looking for. Plus, I think
> feeding the data through the kernel which recorded the information to
> decode it is the wrong approach, especially when the system which
> recorded such information is a small embedded device, where getting the
> data _out_ is already non-trivial. Feeding it back in seems a bit crazy.
>
> Mathieu
> --
>
> ----------
> From: *Martin Bligh* <mbligh at google.com>
> Date: Mon, Oct 6, 2008 at 8:37 AM
> To: Mathieu Desnoyers <compudj at krystal.dyndns.org>
> Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
>
>
> On Mon, Oct 6, 2008 at 8:26 AM, Mathieu Desnoyers
> OK, it seemed mandatory to me, but if it's not, that's good.
> I know he wants the in-kernel parsing for ease-of-use, and getting things
> upstream ... but it seemed to me that there was nothing in what he was
> doing that made it impossible to get the data in binary form out to
> userspace.
> Exporting the buffers is obviously easy. I was under the impression you
> were
> recording strings in the buffers anyway, in which case I don't see why you
> care, but I might be totally mistaken. Even so, it seems what we'd need
> is just to make sure the buffer headers were exported, plus the decoding
> functions - making C files that will link with both the kernel and into a
> userspace library would be a little tricky, but not impossible?
>
> ----------
> From: *Mathieu Desnoyers* <compudj at krystal.dyndns.org>
> Date: Mon, Oct 6, 2008 at 8:56 AM
> To: Martin Bligh <mbligh at google.com>
> Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> Jiaying Zhang <jiayingz at google.com>, ltt-dev at lists.casi.polymtl.ca
>
>
> Yes, exporting garbage to userspace is easy too ;) Making sense out of
> it, especially without DWARF info, might be a bit more difficult.
> The LTTng buffer format records those markers format strings
> only once in a "metadata" channel so the mapping
>
> event id <-> marker name <-> format string
>
> can be extracted from the trace. We can therefore encode event size and
> typing in this table and manage to leave that metadata out of the high
> throughput tracing stream. By adding a layer that does not take
> advantage of such indirection, Steven is actually reserving event IDs
> for "internal use" when we could, in many cases, use those bits to put
> the event IDs which map to the marker event table. By separating the
> low-level event header management from the event ID registration
> mechanism, we are aiming at a much less efficient solution.
>
> Also, by limiting the event reservation so events never cross a page
> boundary, we are actually limiting the event size that can be exported
> through such stream to 4kB. To me, 4kB non-contiguous pages should be
> _one_ memory backend to use for the buffers (others being video memory
> which survives hot reboots or linearly addressable buffer allocated at
> boot time), which clearly does not have the same 4kB restrictions. I
> therefore don't see why the higher-level buffer management primitives
> (reserve/commit) should suffer from this specific lower-level buffer
> limitation, especially given we can encapsulate writes so it's easy to
> deal with page-crossing writes (c.f. vmap()-less buffers I posted last
> week).
> Linking 64-bits kernel objects into 32-bits userland executables seems
> messy to me. And this is without considering cross-architecture concerns
> (embedded developers with a small powerpc board but an x86 dev. machine
> might want to look at the trace from a non-ABI compatible architecture).
>
> ----------
> From: *Jiaying Zhang* <jiayingz at google.com>
> Date: Mon, Oct 6, 2008 at 10:55 AM
> To: Mathieu Desnoyers <compudj at krystal.dyndns.org>
> Cc: Michael Rubin <mrubin at google.com>, Jan Blunck <jblunck at suse.de>,
> ltt-dev at lists.casi.polymtl.ca, Martin Bligh <mbligh at google.com>
>
>
>
>
>
> Thanks a lot for sharing these numbers! Looks like we should
> use special probe functions for high-frequency tracing events.
> Also, do you know why enabling markers adds so much overhead?
>
> Jiaying
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20081007/54ef6257/attachment-0003.htm>