[lttng-dev] [rp] [RFC] Userspace RCU library internal error handling

Thu Jun 21 15:03:06 EDT 2012

* Josh Triplett (josh at joshtriplett.org) wrote:
> On Thu, Jun 21, 2012 at 12:41:13PM -0400, Mathieu Desnoyers wrote:
> > Currently, liburcu calls "exit(-1)" upon internal consistency error.
> > This is not pretty, and usually frowned upon in libraries.
> 
> Agreed.
> 
> > One example of failure path where we use this is if pthread_mutex_lock()
> > would happen to fail within synchronize_rcu(). Clearly, this should
> > _never_ happen: it would typically be triggered only by memory
> > corruption (or other terrible things like that). That being said, we
> > clearly don't want to make "synchronize_rcu()" return errors like that
> > to the application, because it would complexify the application error
> > handling needlessly.
> 
> I think you can safely ignore any error conditions you know you can't
> trigger.  pthread_mutex_lock can only return an error under two
> conditions: an uninitialized mutex, or an error-checking mutex already
> locked by the current thread.  Neither of those can happen in this case.
> Given that, I'd suggest either calling pthread_mutex_lock and ignoring
> any possibility of error, or adding an assert.
> 
> > So instead of calling exit(-1), one possibility would be to do something
> > like this:
> > 
> > #include <signal.h>
> > #include <pthread.h>
> > #include <stdio.h>
> > 
> > #define urcu_die(fmt, ...)                      \
> >         do {    \
> >                 fprintf(stderr, fmt, ##__VA_ARGS__);    \
> >                 (void) pthread_kill(pthread_self(), SIGBUS);    \
> >         } while (0)
> > 
> > and call urcu_die(); in those "unrecoverable error" cases, instead of
> > calling exit(-1). Therefore, if an application chooses to trap those
> > signals, it can, which is otherwise not possible with a direct call to
> > exit().
> 
> It looks like you want to use signals as a kind of exception mechanism,
> to allow the application to clean up (though not to recover).  assert
> seems much clearer to me for "this can't happen" cases, and assert also
> generates a signal that the application can catch and clean up.

Within discussions with other LTTng developers, we considered the the
assert, but the thought that this case might be silently ignored on
production systems (which compile with assertions disabled) makes me
uncomfortable. This is why I would prefer a SIGBUS to an assertion.

Using assert() would be similar to turning the Linux kernel BUG_ON()
mechanism into no-ops on production systems because "it should never
happen" (tm) ;-)

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com