[ltt-dev] [rp] [RFC] URCU concurrent data structure API

Wed Aug 17 22:23:33 EDT 2011

* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Wed, Aug 17, 2011 at 12:40:39PM -0400, Mathieu Desnoyers wrote:
> > Hi,
> > 
> > I'm currently trying to find a good way to present the cds_ data
> > structure APIs within URCU for data structures depending on RCU for
> > their synchronization. The main problem is that we have many flavors of
> > rcu_read_lock/unlock and call_rcu to deal with.
> > 
> > Various approaches are possible:
> > 
> > 1) The current approach: require that the callers pass call_rcu as
> >    parameter to data structure init functions, and require that the
> >    callers hold rcu_read_lock across API invocation.
> > 
> >    downsides: holds rcu read lock across busy-waiting loops (for longer
> >    than actually needed). Passing call_rcu as parameter and putting
> >    requirements on the lock held when calling the API complexify the API,
> >    and makes it impossible to inline call_rcu invocations.
> 
> The function-call overhead for call_rcu() should not be a big deal.

Actually, for my RCU red-black tree implementation, it might be a big
deal. As a benchmark, with 1 RCU rbtree writer only (which is a writer
highest throughput, since writers have a mutually exclusive access to
the rb tree during updates):

With call_rcu as a function:
 933217 insertion-removal pairs of random keys/s

I first implemented a simple per-thread struct call_rcu_data cache,
because I figured that doing the call_rcu_data structure lookup involves
a lot of tests.

With call_rcu as a function, with call_rcu_data pointer cache:
1203136 insertion-removal pairs of random keys/s (28.9% acceleration)

With call_rcu inlined, in addition of the call_rcu_data pointer cache:
1228737 insertion-removal pairs of random keys/s (30% acceleration overall)

The main reason why call_rcu is so important for this RCU RB tree is
that this tree add/remove operations use copies and call_rcu for
basically all nodes of all rotates/transplant operation, which turns out
to be done very often. I was targeting a rather efficient xchg-based
implementation of call_rcu, hence the high-frequency of call_rcu usage.

> I am not all that concerned about an RCU read-side critical section
> covering the busy waiting -- my guess is that the busy waiting itself
> would become a problem long before the overly long RCU read-side
> critical section becomes a problem.

Fair point, although for multi-core RT embedded systems having fairly
low amount of memory available, this might be a "virtual" concern, but a
concern in any case.

> 
> > 2) Require all callers to pass call_rcu *and* rcu_read_lock/unlock as
> >    parameter to data structure init function.
> > 
> >    downsides: both call_rcu and read lock/unlock become function calls
> >    (slower). Complexify the API.
> > 
> > 3) Don't require caller to pass anything rcu-related to data structure
> >    init. Would require to compile one instance of each data structure
> >    per RCU flavor shared object (like we're doing with call_rcu now).
> > 
> >    Downside: we would need to ship per-rcu-flavor version of each data
> >    structure.
> > 
> >    Upside: simple API, no rcu read-side lock around busy-waiting loops,
> >    ability to inline both call_rcu and rcu_read_lock/unlock within the
> >    data structure handling code.
> 
> If we do #3, it is best to make sure that different library functions
> making different RCU-flavor choices can be linked into a single program.
> More preprocessor trick, I guess...

Yep. Before getting your message, I actually implemented #3 in the
master branch. It really is not so complicated. But let's see about the
other solutions,

> 
> > There are probably others, but I think it gives an idea of the main
> > scenarios I consider. I start to like (3) more and more, and I'm tempted
> > to move to it, but I would really like feedback on this API matter
> > before I take any decision.
> 
> Actually, #1 seems simpler than #3 and with few major downsides.
> 
> I can imagine the following additional possibilities:
> 
>  4) Create a table mapping from the call_rcu() variant to the
>     corresponding rcu_read_lock() and rcu_read_unlock() variants.  If busy
>     waiting is required, look up the rcu_read_lock() and rcu_read_unlock()
>     variants and then call them.  (If you are busy waiting anyway,
>     who cares about a bit of extra overhead?)
> 
>     I don't see this as a reasonable initial alternative, but it would
>     be a decent place to migrate to if starting from #1 and the long
>     RCU read-side critical sections did somehow turn out to be a problem.
> 
>  5) Like #4, but map to a function that does rcu_read_unlock() followed
>     immediately by rcu_read_lock().

#4-5 are, IMHO, a complete no-go, because the semantic of the following
user code would become very unclear:

  rcu_read_lock();
  var = rcu_dereference(...);   (A)
  urcu_rb_tree_insert(...);
  use(var);                     (B)
  rcu_read_unlock();

In the code above, A and B would belong to a different RCU read-side
C.S. just because urcu_rb_tree_insert() can release the read-side lock.
This is an unexpected side-effect that I really would not want in the
library.

> 
>  6) Like #1, but make the caller deal with deallocation.  Then the caller
>     gets to select which of the call_rcu() variants should be used.
>     (Yes, there might be a reason why this is a bad idea, I do need to
>     go review the implementations.)

I have to bring back my RCU red-black tree implementation as an example
for this one: there are often more than one call_rcu to invoke per API
call, so returning the pointers to free through call_rcu would become
really cumbersome.

> If #6 is impractical, I still like #1 with either #4 or #5 as fallbacks.
>
> So, what am I missing?

So far, it looks like there are strong arguments against 4-5-6. 2 seems
too slow (calls everywhere) and has a complex API, which leaves us with
1 or 3.

#1 does not let the data structure library any fine-grained control over
use of rcu read-side critical sections, which I think will hurt us in
the future (e.g. complete inability to use synchronize_rcu within the
library -- not that I want to use it, but at least it would be good to
be able to if needed). #1 also does require to perform a function call
to call_rcu, which presents a performance overhead that can become
significant in the case of frequent use (e.g. rcu red-black tree), and
requires to pass the call_rcu function pointer to an init function,
which pollutes the API.

So far, doing macro magic with solution #3 appears to be the best
approach we have. The current implementation (in urcu master branch
HEAD) supports linking various flavors of the CDS library within the
same application (using different symbols for each one).

Thoughts ?

Thanks for your feedback!

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com