[lttng-dev] Userspace Tracing and Backtraces
François Doray
francois.doray at gmail.com
Tue Mar 17 14:50:27 EDT 2015
Hi,
I worked on these 2 optimizations to libunwind:
- I replaced the global cache by a thread-local cache. This removes the
need for some locking.
- Instead of restoring all registers for every stack frame, I restore only
EBP, EIP and ESP.
The code is here:
https://github.com/fdoray/libunwind/tree/minimal_regs
Warning: This version of the library is no longer signal-safe and has
undefined behavior if more registers than EBP/EIP/ESP are required to
unwind a stack frame (this never happened in the few tests that I made so
far). These two limitations could easily be overcome in the future.
Performance results, on x86_64:
unw_backtrace(), original libunwind
Mean time per backtrace: 6130 ns / 80% of samples between 1479 and 13837 ns
unw_backtrace(), modified libunwind ***
Mean time per backtrace: 4255 ns / 80% of samples between 1526 and 5252 ns.
unw_step()/unw_get_reg() [1], original libuwndin
Mean time per backtrace: 43520 ns / 80% of samples between 13705 and 58782
ns.
unw_step()/unw_get_reg(), modified libunwind
Mean time per backtrace: 5804 ns / 80% of samples between 2844 and 11325 ns.
Francois
[1] As in the example presented here:
http://www.nongnu.org/libunwind/man/libunwind(3).html
On Mon, Mar 16, 2015 at 8:35 PM, Brian Robbins <brianrob at microsoft.com>
wrote:
> Hi All,
>
>
>
> This is great. Thank you very much for the information.
>
>
>
> -Brian
>
>
>
> *From:* Francis Giraldeau [mailto:francis.giraldeau at gmail.com]
> *Sent:* Thursday, March 12, 2015 11:34 AM
> *To:* Mathieu Desnoyers
> *Cc:* Brian Robbins; lttng-dev at lists.lttng.org
> *Subject:* Re: [lttng-dev] Userspace Tracing and Backtraces
>
>
>
> 2015-03-10 21:47 GMT-04:00 Mathieu Desnoyers <
> mathieu.desnoyers at efficios.com>:
>
> Francis: Did you define UNW_LOCAL_ONLY before including
>
> the libunwind header in your benchmarks ? (see
>
> http://www.nongnu.org/libunwind/man/libunwind%283%29.html)
>
>
>
> The seems to change performance dramatically according to the
> documentation.
>
>
>
>
>
> Yes, this is the case. Time to unwind is higher at the beginning (probably
> related to internal cache build), and also vary according to call-stack
> depth.
>
>
>
> Agreed on having the backtrace as a context. The main question left is
>
> to figure out if we want to call libunwind from within the traced
> application
>
> execution context.
>
>
>
> Unfortunately, libunwind is not reentrant wrt signals. This is already
>
> a good argument for not calling it from within a tracepoint. I wonder
>
> if the authors of libunwind would be open to make it signal-reentrant
>
> in the future (not by disabling signals, but rather by keeping a TLS
>
> nesting counter, and returning an error if nested, for performance
>
> considerations).
>
>
>
> The functions unw_init_local() and unw_step() are signal safe [1]. The
> critical sections are protected using lock_acquire() that blocks all
> signals before taking the mutex, which prevent the recursion.
>
>
>
> #define lock_acquire(l,m) \
>
> do { \
>
> SIGPROCMASK (SIG_SETMASK, &unwi_full_mask, &(m)); \
>
> mutex_lock (l); \
>
> } while (0)
>
> #define lock_release(l,m) \
>
> do { \
>
> mutex_unlock (l); \
>
> SIGPROCMASK (SIG_SETMASK, &(m), NULL); \
>
> } while (0)
>
>
>
> To understand the implications, I did a small program to study nested
> signals [2], where a signal is sent from within a signal, or when
> segmentation fault occurs in a signal handler. Blocking a signal differs it
> when it is unblocked, while ignored signals are discarded. Blocked signals
> that can't be ignored have their default behaviour. It prevents a possible
> deadlock, let's say if lock_acquire() was nesting with a custom SIGSEGV
> handler trying to get the same lock.
>
>
>
> So, let's say that instead of blocking signals, we have a per-thread
> mutex, that returns if try_lock() fails. It would be faster, but from the
> user's point of view, the backtrace will be dropped randomly. I would
> prefer it a bit slower, but reliable.
>
>
>
> In addition, could it be possible that TLS is not signal safe [3]?
>
> or using the perf capture mechanism that you describe below?
>
> Perf is peeking at the userspace from kernel space, it's another story.
> I guess that libunwind was not ported to the kernel because it is a large
> chunk of complicated code that performs a lot of I/O and computation, while
> copying a portion of the stack is really about KISS and low runtime
> overhead.
>
> If using libunwind does not work out, another alternative I would
> consider
>
> would be to copy the stack like perf is doing from the kernel. However,
>
> in the spirit of compacting trace data, I would be tempted to do the
> following
>
> if we go down that route: check each pointer-aligned address for its
> content.
>
> If it looks like a pointer to an executable memory area (library,
> executable, or
>
> JIT'd code), we keep it. Else, we zero this information (not needed). We
> can
>
> then do a RLE-alike compression on the zeroes, so we can keep the layout
>
> of the stack after uncompression.
>
>
>
>
>
> Interesting! For comparison, here is a perf event [4] that shows there is
> a lot of room for reducing the event size. We should check if discarding
> other saved register values on the stack impacts restoring the instruction
> pointer register. Doing the unwind offline also solves signal safety,
> should be fast and scalable.
>
>
>
> Francis
>
>
>
> [1] http://www.nongnu.org/libunwind/man/unw_init_local(3).html
>
> [2] https://gist.github.com/giraldeau/98f08161e83a7ab800ea
>
> [3] https://sourceware.org/glibc/wiki/TLSandSignals
>
> [4] http://pastebin.com/sByfXXAQ
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150317/5198f207/attachment-0001.html>
More information about the lttng-dev
mailing list