<div dir="ltr"><div>Hello,<br></div><div><br></div><div>This question is more focused on tracepoints rather than LTTng, so feel free to point me at LKML if I'm too off topic.</div><div><br></div><div>I am looking for a way to trace all the system call activity 24/7 (and do some very customized processing, so LTTng doesn't fit very well in the picture), and I have the specific requirement that the overhead must be extremely slow, even in the worst case.</div>
<div><br></div><div>Being a LTTng user myself, I figured tracepoints would be the first natural choice, so what I did was writing a small kernel module that does nothing but registering an empty probe for "sys_enter" and "sys_exit", and I am a bit concerned about the results that I obtained on my Intel Core i3 running linux 3.8.0.</div>
<div><br></div><div>Basically this is my worst case:</div><div><br></div><div>while(1)</div><div>{</div><div><span style="white-space:pre-wrap"> </span>close(5000);</div><div>}</div><div><br></div><div>I let this run for 10 seconds, and these are the numbers that I get:</div>
<div><br></div><div>- without tracepoints: 13.1M close/s</div><div>- with tracepoints: 4.1M close/s</div><div><br></div><div>The overhead is far from being negligible, and digging into the problem it seems like when the tracepoints are enabled, the system doesn't go through the "system_call_fastpath" (in arch/x86/kernel/entry_64.S), using IRET instead of SYSRET (A relevant commit seems this: <a href="https://github.com/torvalds/linux/commit/7bf36bbc5e0c09271f9efe22162f8cc3f8ebd3d2" target="_blank">https://github.com/torvalds/linux/commit/7bf36bbc5e0c09271f9efe22162f8cc3f8ebd3d2</a>).</div>
<div><br></div><div>This is the first time I look into these things so understanding the logic behind is pretty hard for me, but I managed to write a quick and dirty hack that just forces a call to "trace_sys_enter" and "trace_sys_exit" in the fast path (you can find the patch attached, I didn't have a lot of time to spend on this so it's pretty inefficient because I do a bunch of instructions even if the tracepoints are not enabled and there are obvious bugs if the ptrace code gets enabled, but it proves my point) and these are the results:</div>
<div><br></div><div>- without tracepoints (patched kernel): 11.5M close()/s</div><div>- with tracepoints (patched kernel): 9.6M close()/s</div><div><br></div><div>Of course my benchmark is an extreme situation, but measuring in a more realistic scenario (using apache ab to stress a nginx server) I can still notice a difference:</div>
<div><br></div><div>- without tracepoints: 16K HTTP requests/s</div><div>- with tracepoints: 15.1K HTTP requests/s</div><div>- without tracepoints (patched kernel): 16K HTTP requests/s</div><div>- with tracepoints (patched kernel): 15.8K HTTP requests/s</div>
<div><br></div><div>It's a real 6% vs 1% worst case overhead when using an intense server application, and that doesn't count the cost of executing the body of the probes themselves.</div><div><br></div><div>Has anyone ever faced this before? Am I just inexperienced with the topic and stating the obvious? Are there any suggestions or documentation I should look at?</div>
<div><br></div><div>Thank you for your help and for the amazing work on LTTng.</div><div><br></div></div>