[ltt-dev] [rp] [RFC] URCU concurrent data structure API

Thu Aug 18 13:24:00 EDT 2011

On Wed, Aug 17, 2011 at 10:23:33PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Wed, Aug 17, 2011 at 12:40:39PM -0400, Mathieu Desnoyers wrote:
> > > Hi,
> > > 
> > > I'm currently trying to find a good way to present the cds_ data
> > > structure APIs within URCU for data structures depending on RCU for
> > > their synchronization. The main problem is that we have many flavors of
> > > rcu_read_lock/unlock and call_rcu to deal with.
> > > 
> > > Various approaches are possible:
> > > 
> > > 1) The current approach: require that the callers pass call_rcu as
> > >    parameter to data structure init functions, and require that the
> > >    callers hold rcu_read_lock across API invocation.
> > > 
> > >    downsides: holds rcu read lock across busy-waiting loops (for longer
> > >    than actually needed). Passing call_rcu as parameter and putting
> > >    requirements on the lock held when calling the API complexify the API,
> > >    and makes it impossible to inline call_rcu invocations.
> > 
> > The function-call overhead for call_rcu() should not be a big deal.
> 
> Actually, for my RCU red-black tree implementation, it might be a big
> deal. As a benchmark, with 1 RCU rbtree writer only (which is a writer
> highest throughput, since writers have a mutually exclusive access to
> the rb tree during updates):
> 
> With call_rcu as a function:
>  933217 insertion-removal pairs of random keys/s
> 
> I first implemented a simple per-thread struct call_rcu_data cache,
> because I figured that doing the call_rcu_data structure lookup involves
> a lot of tests.
> 
> With call_rcu as a function, with call_rcu_data pointer cache:
> 1203136 insertion-removal pairs of random keys/s (28.9% acceleration)
> 
> With call_rcu inlined, in addition of the call_rcu_data pointer cache:
> 1228737 insertion-removal pairs of random keys/s (30% acceleration overall)
> 
> The main reason why call_rcu is so important for this RCU RB tree is
> that this tree add/remove operations use copies and call_rcu for
> basically all nodes of all rotates/transplant operation, which turns out
> to be done very often. I was targeting a rather efficient xchg-based
> implementation of call_rcu, hence the high-frequency of call_rcu usage.

Fair enough, but you still have the option of linking the nodes together
and handling them with a single call_rcu() invocation.  This would
amortize the overhead down to an acceptable level, right?

> > I am not all that concerned about an RCU read-side critical section
> > covering the busy waiting -- my guess is that the busy waiting itself
> > would become a problem long before the overly long RCU read-side
> > critical section becomes a problem.
> 
> Fair point, although for multi-core RT embedded systems having fairly
> low amount of memory available, this might be a "virtual" concern, but a
> concern in any case.
> 
> > 
> > > 2) Require all callers to pass call_rcu *and* rcu_read_lock/unlock as
> > >    parameter to data structure init function.
> > > 
> > >    downsides: both call_rcu and read lock/unlock become function calls
> > >    (slower). Complexify the API.
> > > 
> > > 3) Don't require caller to pass anything rcu-related to data structure
> > >    init. Would require to compile one instance of each data structure
> > >    per RCU flavor shared object (like we're doing with call_rcu now).
> > > 
> > >    Downside: we would need to ship per-rcu-flavor version of each data
> > >    structure.
> > > 
> > >    Upside: simple API, no rcu read-side lock around busy-waiting loops,
> > >    ability to inline both call_rcu and rcu_read_lock/unlock within the
> > >    data structure handling code.
> > 
> > If we do #3, it is best to make sure that different library functions
> > making different RCU-flavor choices can be linked into a single program.
> > More preprocessor trick, I guess...
> 
> Yep. Before getting your message, I actually implemented #3 in the
> master branch. It really is not so complicated. But let's see about the
> other solutions,

Understood -- I in fact did something very like #3 for the RCU
implementations themselves, so I cannot argue that it is cripplingly
complex.  ;-)

But it is worth looking into other approaches, because it would be
nice to avoid replicating this dirty trick across each and every data
structure that makes internal use of RCU.

> > > There are probably others, but I think it gives an idea of the main
> > > scenarios I consider. I start to like (3) more and more, and I'm tempted
> > > to move to it, but I would really like feedback on this API matter
> > > before I take any decision.
> > 
> > Actually, #1 seems simpler than #3 and with few major downsides.
> > 
> > I can imagine the following additional possibilities:
> > 
> >  4) Create a table mapping from the call_rcu() variant to the
> >     corresponding rcu_read_lock() and rcu_read_unlock() variants.  If busy
> >     waiting is required, look up the rcu_read_lock() and rcu_read_unlock()
> >     variants and then call them.  (If you are busy waiting anyway,
> >     who cares about a bit of extra overhead?)
> > 
> >     I don't see this as a reasonable initial alternative, but it would
> >     be a decent place to migrate to if starting from #1 and the long
> >     RCU read-side critical sections did somehow turn out to be a problem.
> > 
> >  5) Like #4, but map to a function that does rcu_read_unlock() followed
> >     immediately by rcu_read_lock().
> 
> #4-5 are, IMHO, a complete no-go, because the semantic of the following
> user code would become very unclear:
> 
>   rcu_read_lock();
>   var = rcu_dereference(...);   (A)
>   urcu_rb_tree_insert(...);
>   use(var);                     (B)
>   rcu_read_unlock();
> 
> In the code above, A and B would belong to a different RCU read-side
> C.S. just because urcu_rb_tree_insert() can release the read-side lock.
> This is an unexpected side-effect that I really would not want in the
> library.

One serviceable (though ugly) approach for #4 is as follows:

   rcu_read_lock();
   rcu_read_lock();
   var = rcu_dereference(...);   (A)
   urcu_rb_tree_insert(...);
   use(var);                     (B)
   rcu_read_unlock();
   rcu_read_unlock();

(And no, I don't particularly like this, just pointing out that it really
is a possibility.)

This defeats the libraries attempt to shorten RCU read-side critical
sections, but I am not sure that this is really all that big a deal.

> >  6) Like #1, but make the caller deal with deallocation.  Then the caller
> >     gets to select which of the call_rcu() variants should be used.
> >     (Yes, there might be a reason why this is a bad idea, I do need to
> >     go review the implementations.)
> 
> I have to bring back my RCU red-black tree implementation as an example
> for this one: there are often more than one call_rcu to invoke per API
> call, so returning the pointers to free through call_rcu would become
> really cumbersome.

Your RCU red-black tree could easily provide a free-up function that
was suitable for use with call_rcu(), and which did any needed following
of links connecting multiple nodes, right?  Yes, it would not be as
pretty as we might like, but it would not be all that ugly.

> > If #6 is impractical, I still like #1 with either #4 or #5 as fallbacks.
> >
> > So, what am I missing?
> 
> So far, it looks like there are strong arguments against 4-5-6. 2 seems
> too slow (calls everywhere) and has a complex API, which leaves us with
> 1 or 3.
> 
> #1 does not let the data structure library any fine-grained control over
> use of rcu read-side critical sections, which I think will hurt us in
> the future (e.g. complete inability to use synchronize_rcu within the
> library -- not that I want to use it, but at least it would be good to
> be able to if needed). #1 also does require to perform a function call
> to call_rcu, which presents a performance overhead that can become
> significant in the case of frequent use (e.g. rcu red-black tree), and
> requires to pass the call_rcu function pointer to an init function,
> which pollutes the API.

Actually, if you do use synchronize_rcu() anywhere in the library,
you need to explicitly prohibit the caller being in an RCU read-side
critical section.  So I am not so sure that doing synchronize_rcu()
in any library function is all that great an idea in the first
place, at least in the general case.

A library function doing synchronize_rcu() needs to be documented
and carefully handled to the same extent as would a library function
that does a blocking read across the network, right?

> So far, doing macro magic with solution #3 appears to be the best
> approach we have. The current implementation (in urcu master branch
> HEAD) supports linking various flavors of the CDS library within the
> same application (using different symbols for each one).
> 
> Thoughts ?

I am not yet convinced that we want to abandon #1 so quickly.  There
will probably be yet more user-level RCU algorithms produced, both
within this library and in other libraries, so minimizing the code
where the compiler needs to know which user-level algorithm is in
use seems to me to be a very good thing.

For example, if someone implements their own user-level RCU, approach
#1 would allow them to use that in conjunction with the cds_ algorithms
without even recompiling the urcu library.  In contrast, approach #3
would require them to add their implementation to the urcu library,
which would be a bad thing if their RCU implementation was so specialized
that we didn't want to accept it into urcu, right?

> Thanks for your feedback!

Definitely an interesting topic.  ;-)

							Thanx, Paul

> Mathieu
> 
> > 
> > 							Thanx, Paul
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com