[lttng-dev] Tracepoints overhead in x86_64

Fri Aug 30 17:47:52 EDT 2013

Hello,

This question is more focused on tracepoints rather than LTTng, so feel
free to point me at LKML if I'm too off topic.

I am looking for a way to trace all the system call activity 24/7 (and do
some very customized processing, so LTTng doesn't fit very well in the
picture), and I have the specific requirement that the overhead must be
extremely slow, even in the worst case.

Being a LTTng user myself, I figured tracepoints would be the first natural
choice, so what I did was writing a small kernel module that does nothing
but registering an empty probe for "sys_enter" and "sys_exit", and I am a
bit concerned about the results that I obtained on my Intel Core i3 running
linux 3.8.0.

Basically this is my worst case:

while(1)
{
close(5000);
}

I let this run for 10 seconds, and these are the numbers that I get:

- without tracepoints: 13.1M close/s
- with tracepoints: 4.1M close/s

The overhead is far from being negligible, and digging into the problem it
seems like when the tracepoints are enabled, the system doesn't go through
the "system_call_fastpath" (in arch/x86/kernel/entry_64.S), using IRET
instead of SYSRET (A relevant commit seems this:
https://github.com/torvalds/linux/commit/7bf36bbc5e0c09271f9efe22162f8cc3f8ebd3d2
).

This is the first time I look into these things so understanding the logic
behind is pretty hard for me, but I managed to write a quick and dirty hack
that just forces a call to "trace_sys_enter" and "trace_sys_exit" in the
fast path (you can find the patch attached, I didn't have a lot of time to
spend on this so it's pretty inefficient because I do a bunch of
instructions even if the tracepoints are not enabled and there are obvious
bugs if the ptrace code gets enabled, but it proves my point) and these are
the results:

- without tracepoints (patched kernel): 11.5M close()/s
- with tracepoints (patched kernel): 9.6M close()/s

Of course my benchmark is an extreme situation, but measuring in a more
realistic scenario (using apache ab to stress a nginx server) I can still
notice a difference:

- without tracepoints: 16K HTTP requests/s
- with tracepoints: 15.1K HTTP requests/s
- without tracepoints (patched kernel): 16K HTTP requests/s
- with tracepoints (patched kernel): 15.8K HTTP requests/s

It's a real 6% vs 1% worst case overhead when using an intense server
application, and that doesn't count the cost of executing the body of the
probes themselves.

Has anyone ever faced this before? Am I just inexperienced with the topic
and stating the obvious? Are there any suggestions or documentation I
should look at?

Thank you for your help and for the amazing work on LTTng.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20130830/6afcc27d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tracepoints_fast_syscall_path.patch
Type: application/octet-stream
Size: 3093 bytes
Desc: not available
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20130830/6afcc27d/attachment-0001.obj>