[ltt-dev] [rp] [RFC] URCU concurrent data structure API

Thu Aug 18 14:44:49 EDT 2011

* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Wed, Aug 17, 2011 at 10:23:33PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > On Wed, Aug 17, 2011 at 12:40:39PM -0400, Mathieu Desnoyers wrote:
> > > > Hi,
> > > > 
> > > > I'm currently trying to find a good way to present the cds_ data
> > > > structure APIs within URCU for data structures depending on RCU for
> > > > their synchronization. The main problem is that we have many flavors of
> > > > rcu_read_lock/unlock and call_rcu to deal with.
> > > > 
> > > > Various approaches are possible:
> > > > 
> > > > 1) The current approach: require that the callers pass call_rcu as
> > > >    parameter to data structure init functions, and require that the
> > > >    callers hold rcu_read_lock across API invocation.
> > > > 
> > > >    downsides: holds rcu read lock across busy-waiting loops (for longer
> > > >    than actually needed). Passing call_rcu as parameter and putting
> > > >    requirements on the lock held when calling the API complexify the API,
> > > >    and makes it impossible to inline call_rcu invocations.
> > > 
> > > The function-call overhead for call_rcu() should not be a big deal.
> > 
> > Actually, for my RCU red-black tree implementation, it might be a big
> > deal. As a benchmark, with 1 RCU rbtree writer only (which is a writer
> > highest throughput, since writers have a mutually exclusive access to
> > the rb tree during updates):
> > 
> > With call_rcu as a function:
> >  933217 insertion-removal pairs of random keys/s
> > 
> > I first implemented a simple per-thread struct call_rcu_data cache,
> > because I figured that doing the call_rcu_data structure lookup involves
> > a lot of tests.
> > 
> > With call_rcu as a function, with call_rcu_data pointer cache:
> > 1203136 insertion-removal pairs of random keys/s (28.9% acceleration)
> > 
> > With call_rcu inlined, in addition of the call_rcu_data pointer cache:
> > 1228737 insertion-removal pairs of random keys/s (30% acceleration overall)
> > 
> > The main reason why call_rcu is so important for this RCU RB tree is
> > that this tree add/remove operations use copies and call_rcu for
> > basically all nodes of all rotates/transplant operation, which turns out
> > to be done very often. I was targeting a rather efficient xchg-based
> > implementation of call_rcu, hence the high-frequency of call_rcu usage.
> 
> Fair enough, but you still have the option of linking the nodes together
> and handling them with a single call_rcu() invocation.  This would
> amortize the overhead down to an acceptable level, right?

Sure, that could be arranged, but I would not like to add this kind of
complexity into already complex data containers. Having the ability to
just pass a pointer to call_rcu for deferred removal and then forget
about it (without having to pass a list around between functions) is
very convenient. If we want to keep userspace RCU easy enough to
implement data structures into, I don't think we want to require
developers to handle extra internal lists on a per-call basis. I would
probably not have gone down the route I tried with RCU RBtree if I would
not have had the convenient call_rcu() around.

> 
> > > I am not all that concerned about an RCU read-side critical section
> > > covering the busy waiting -- my guess is that the busy waiting itself
> > > would become a problem long before the overly long RCU read-side
> > > critical section becomes a problem.
> > 
> > Fair point, although for multi-core RT embedded systems having fairly
> > low amount of memory available, this might be a "virtual" concern, but a
> > concern in any case.
> > 
> > > 
> > > > 2) Require all callers to pass call_rcu *and* rcu_read_lock/unlock as
> > > >    parameter to data structure init function.
> > > > 
> > > >    downsides: both call_rcu and read lock/unlock become function calls
> > > >    (slower). Complexify the API.
> > > > 
> > > > 3) Don't require caller to pass anything rcu-related to data structure
> > > >    init. Would require to compile one instance of each data structure
> > > >    per RCU flavor shared object (like we're doing with call_rcu now).
> > > > 
> > > >    Downside: we would need to ship per-rcu-flavor version of each data
> > > >    structure.
> > > > 
> > > >    Upside: simple API, no rcu read-side lock around busy-waiting loops,
> > > >    ability to inline both call_rcu and rcu_read_lock/unlock within the
> > > >    data structure handling code.
> > > 
> > > If we do #3, it is best to make sure that different library functions
> > > making different RCU-flavor choices can be linked into a single program.
> > > More preprocessor trick, I guess...
> > 
> > Yep. Before getting your message, I actually implemented #3 in the
> > master branch. It really is not so complicated. But let's see about the
> > other solutions,
> 
> Understood -- I in fact did something very like #3 for the RCU
> implementations themselves, so I cannot argue that it is cripplingly
> complex.  ;-)
> 
> But it is worth looking into other approaches, because it would be
> nice to avoid replicating this dirty trick across each and every data
> structure that makes internal use of RCU.

Of course. Having the containers separated from the RCU flavor would
be very convenient, which makes me hesitate too.

> 
> > > > There are probably others, but I think it gives an idea of the main
> > > > scenarios I consider. I start to like (3) more and more, and I'm tempted
> > > > to move to it, but I would really like feedback on this API matter
> > > > before I take any decision.
> > > 
> > > Actually, #1 seems simpler than #3 and with few major downsides.
> > > 
> > > I can imagine the following additional possibilities:
> > > 
> > >  4) Create a table mapping from the call_rcu() variant to the
> > >     corresponding rcu_read_lock() and rcu_read_unlock() variants.  If busy
> > >     waiting is required, look up the rcu_read_lock() and rcu_read_unlock()
> > >     variants and then call them.  (If you are busy waiting anyway,
> > >     who cares about a bit of extra overhead?)
> > > 
> > >     I don't see this as a reasonable initial alternative, but it would
> > >     be a decent place to migrate to if starting from #1 and the long
> > >     RCU read-side critical sections did somehow turn out to be a problem.
> > > 
> > >  5) Like #4, but map to a function that does rcu_read_unlock() followed
> > >     immediately by rcu_read_lock().
> > 
> > #4-5 are, IMHO, a complete no-go, because the semantic of the following
> > user code would become very unclear:
> > 
> >   rcu_read_lock();
> >   var = rcu_dereference(...);   (A)
> >   urcu_rb_tree_insert(...);
> >   use(var);                     (B)
> >   rcu_read_unlock();
> > 
> > In the code above, A and B would belong to a different RCU read-side
> > C.S. just because urcu_rb_tree_insert() can release the read-side lock.
> > This is an unexpected side-effect that I really would not want in the
> > library.
> 
> One serviceable (though ugly) approach for #4 is as follows:
> 
>    rcu_read_lock();
>    rcu_read_lock();
>    var = rcu_dereference(...);   (A)
>    urcu_rb_tree_insert(...);
>    use(var);                     (B)
>    rcu_read_unlock();
>    rcu_read_unlock();
> 
> (And no, I don't particularly like this, just pointing out that it really
> is a possibility.)
> 
> This defeats the libraries attempt to shorten RCU read-side critical
> sections, but I am not sure that this is really all that big a deal.

I honestly don't think this one falls into the non-repulsing API
category. ;-)

> 
> > >  6) Like #1, but make the caller deal with deallocation.  Then the caller
> > >     gets to select which of the call_rcu() variants should be used.
> > >     (Yes, there might be a reason why this is a bad idea, I do need to
> > >     go review the implementations.)
> > 
> > I have to bring back my RCU red-black tree implementation as an example
> > for this one: there are often more than one call_rcu to invoke per API
> > call, so returning the pointers to free through call_rcu would become
> > really cumbersome.
> 
> Your RCU red-black tree could easily provide a free-up function that
> was suitable for use with call_rcu(), and which did any needed following
> of links connecting multiple nodes, right?  Yes, it would not be as
> pretty as we might like, but it would not be all that ugly.

About this, see my comment above about my wish to keep liburcu a
convenient breeding ground for new data structures, citing my experience
with RCU RB tree as an example.

> 
> > > If #6 is impractical, I still like #1 with either #4 or #5 as fallbacks.
> > >
> > > So, what am I missing?
> > 
> > So far, it looks like there are strong arguments against 4-5-6. 2 seems
> > too slow (calls everywhere) and has a complex API, which leaves us with
> > 1 or 3.
> > 
> > #1 does not let the data structure library any fine-grained control over
> > use of rcu read-side critical sections, which I think will hurt us in
> > the future (e.g. complete inability to use synchronize_rcu within the
> > library -- not that I want to use it, but at least it would be good to
> > be able to if needed). #1 also does require to perform a function call
> > to call_rcu, which presents a performance overhead that can become
> > significant in the case of frequent use (e.g. rcu red-black tree), and
> > requires to pass the call_rcu function pointer to an init function,
> > which pollutes the API.
> 
> Actually, if you do use synchronize_rcu() anywhere in the library,
> you need to explicitly prohibit the caller being in an RCU read-side
> critical section.  So I am not so sure that doing synchronize_rcu()
> in any library function is all that great an idea in the first
> place, at least in the general case.
> 
> A library function doing synchronize_rcu() needs to be documented
> and carefully handled to the same extent as would a library function
> that does a blocking read across the network, right?

Definitely agreed. synchronize_rcu should really be frowned upon in
data container libraries. I was mainly pointing out that having access
directly to the library functions provides more freedom to the data
structure implementers.

> 
> > So far, doing macro magic with solution #3 appears to be the best
> > approach we have. The current implementation (in urcu master branch
> > HEAD) supports linking various flavors of the CDS library within the
> > same application (using different symbols for each one).
> > 
> > Thoughts ?
> 
> I am not yet convinced that we want to abandon #1 so quickly.  There
> will probably be yet more user-level RCU algorithms produced, both
> within this library and in other libraries, so minimizing the code
> where the compiler needs to know which user-level algorithm is in
> use seems to me to be a very good thing.

Yes, agreed.

> For example, if someone implements their own user-level RCU, approach
> #1 would allow them to use that in conjunction with the cds_ algorithms
> without even recompiling the urcu library.  In contrast, approach #3
> would require them to add their implementation to the urcu library,
> which would be a bad thing if their RCU implementation was so specialized
> that we didn't want to accept it into urcu, right?

One way to do that might be to provide an automated build system that
generates .so for all urcu flavors from any given data container. E.g.,
if we can automatically append suffixes to cds symbols, we could get
somewhere. It would be like the map header trick, but automated.

> 
> > Thanks for your feedback!
> 
> Definitely an interesting topic.  ;-)

I think there is an important point to discuss with respect to this
question too: do we want to give the cds_ functions direct control on
locking, or do we want to leave that to the caller ?

I can list at least one case where giving control over locking to the
cds_ functions would be useful: if we want the update-side of RCU RB
tree to eventually scale, we could try assigning sub-parts of the tree
to different mutexes, and use one mutex for the first levels of the
tree. This would require the data structure to tie the mutexes to the
actual data, so there is really no way the application could do that.

So if we choose to let data structures control locking, then the
decision to generate one cds .so per RCU flavor would start to make more
sense, because RCU can be seen as "locking" in some ways (although it is
more formally synchronization than locking).

Giving control over RCU read-side is currently not much of an issue,
because we can afford to let the application keep the rcu read-side lock
without much practical limitation for the current data structures we
have, but I think we might get into limitations for other data
structures in the future, as shown with my multi-lock tree example.

Thoughts ?

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > 
> > > 							Thanx, Paul
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com