[lttng-dev] [rp] [RFC] Userspace RCU library internal error handling

Thu Jun 21 15:48:38 EDT 2012

* Josh Triplett (josh at joshtriplett.org) wrote:
> On Thu, Jun 21, 2012 at 03:03:06PM -0400, Mathieu Desnoyers wrote:
> > * Josh Triplett (josh at joshtriplett.org) wrote:
> > > On Thu, Jun 21, 2012 at 12:41:13PM -0400, Mathieu Desnoyers wrote:
> > > > Currently, liburcu calls "exit(-1)" upon internal consistency error.
> > > > This is not pretty, and usually frowned upon in libraries.
> > > 
> > > Agreed.
> > > 
> > > > One example of failure path where we use this is if pthread_mutex_lock()
> > > > would happen to fail within synchronize_rcu(). Clearly, this should
> > > > _never_ happen: it would typically be triggered only by memory
> > > > corruption (or other terrible things like that). That being said, we
> > > > clearly don't want to make "synchronize_rcu()" return errors like that
> > > > to the application, because it would complexify the application error
> > > > handling needlessly.
> > > 
> > > I think you can safely ignore any error conditions you know you can't
> > > trigger.  pthread_mutex_lock can only return an error under two
> > > conditions: an uninitialized mutex, or an error-checking mutex already
> > > locked by the current thread.  Neither of those can happen in this case.
> > > Given that, I'd suggest either calling pthread_mutex_lock and ignoring
> > > any possibility of error, or adding an assert.
> > > 
> > > > So instead of calling exit(-1), one possibility would be to do something
> > > > like this:
> > > > 
> > > > #include <signal.h>
> > > > #include <pthread.h>
> > > > #include <stdio.h>
> > > > 
> > > > #define urcu_die(fmt, ...)                      \
> > > >         do {    \
> > > >                 fprintf(stderr, fmt, ##__VA_ARGS__);    \
> > > >                 (void) pthread_kill(pthread_self(), SIGBUS);    \
> > > >         } while (0)
> > > > 
> > > > and call urcu_die(); in those "unrecoverable error" cases, instead of
> > > > calling exit(-1). Therefore, if an application chooses to trap those
> > > > signals, it can, which is otherwise not possible with a direct call to
> > > > exit().
> > > 
> > > It looks like you want to use signals as a kind of exception mechanism,
> > > to allow the application to clean up (though not to recover).  assert
> > > seems much clearer to me for "this can't happen" cases, and assert also
> > > generates a signal that the application can catch and clean up.
> > 
> > Within discussions with other LTTng developers, we considered the the
> > assert, but the thought that this case might be silently ignored on
> > production systems (which compile with assertions disabled) makes me
> > uncomfortable. This is why I would prefer a SIGBUS to an assertion.
> > 
> > Using assert() would be similar to turning the Linux kernel BUG_ON()
> > mechanism into no-ops on production systems because "it should never
> > happen" (tm) ;-)
> 
> Just don't define NDEBUG then. :)

Well, AFAIK, it is usual for some distribution packages to define
NDEBUG (maybe distro maintainers reading lttng-dev could confirm or
infirm this assumption ?). So in a context where upstream does not have
total control on the specific tweaks done by distro packages, I prefer
not to rely on NDEBUG not being defined to catch internal consistency
errors in the wild.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com