[lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

Wed May 2 11:54:30 EDT 2012

On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > > > On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
> > > > > > > * Mathieu Desnoyers (mathieu.desnoyers at efficios.com) wrote:
> > > > > > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > > > > > > On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
> > > > > > > > > > Hi!
> > > > > > > > > > 
> > > > > > > > > > After 1 year of development, with the last 6-7 months spent polishing
> > > > > > > > > > the API and testing the implementation, I think it is getting about time
> > > > > > > > > > to release the RCU lock-free hash table in a new Userspace RCU version
> > > > > > > > > > (0.7).
> > > > > > > > > > 
> > > > > > > > > > I recently described the guarantees provided by the hash table in more
> > > > > > > > > > detail, and created tests for the uniqueness guarantee for traversals
> > > > > > > > > > performed concurrently with add_unique and add_replace operations. I
> > > > > > > > > > also added test modes that create long hash chains, to test corner-cases
> > > > > > > > > > of the hash table.
> > > > > > > > > > 
> > > > > > > > > > One small thing I wonder is whether we should document that the hash
> > > > > > > > > > table update operations imply full memory barriers ?
> > > > > > > > > 
> > > > > > > > > The Linux kernel's rule seems good here -- if a hash-table operation is
> > > > > > > > > atomic and returns a value, it should imply a full barrier.  So:
> > > > > > > > > 
> > > > > > > > > cds_lfht_new(): No point in claiming barriers -- publishing the
> > > > > > > > > 	pointer to the hash table is where the barriers are important.
> > > > > > > > > 
> > > > > > > > > cds_lfht_destroy(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers.
> > > > > > > > > 
> > > > > > > > > cds_lfht_next_duplicate(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_first(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_next(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_add(): Atomic update, but no return value, so no barrier
> > > > > > > > > 	implied.
> > > > > > > > 
> > > > > > > > Yep, makes sense. We use cmpxchg internally to perform the update, but
> > > > > > > > it could make sense to eventually use a cmpxchg that has no memory
> > > > > > > > barriers to perform this update. So I agree on not providing a memory
> > > > > > > > barrier guarantee on the "add" operation, since it does not return any
> > > > > > > > value.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > cds_lfht_add_unique(): Atomic update that returns a value, so should
> > > > > > > > > 	imply a full memory barrier.
> > > > > > > > 
> > > > > > > > add_unique is a bit special:
> > > > > > > > 
> > > > > > > > - if it returns the node received as parameter, it means the add
> > > > > > > >   succeeded, which imply an update, and thus a memory barrier.
> > > > > > > 
> > > > > > > Hrm, thinking further: if we make the "add" operation not act as a
> > > > > > > full memory barrier, then the add_unique success should not act as a
> > > > > > > full mb neither.
> > > > > > 
> > > > > > Think of it as being similar to the Linux kernel's atomic_inc() and
> > > > > > atomic_add_return() primitives.  The latter guarantees memory barriers
> > > > > > and the former does not.
> > > > > 
> > > > > I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally
> > > > > different.
> > > > > 
> > > > > atomic_inc:
> > > > >  - input: increment
> > > > >  - output: none
> > > > >  - effect: increment target address value
> > > > > 
> > > > > atomic_add_return:
> > > > >  - input: increment
> > > > >  - output: incremented value
> > > > >  - effect: increment target address value, atomically reading the
> > > > >    resulting value.
> > > > > 
> > > > > hash table add:
> > > > >  - input: new node to add
> > > > >  - output: none
> > > > >  - effect: atomically add the new node into the table
> > > > > 
> > > > > hash table add_unique (success):
> > > > >  - input: new node to add
> > > > >  - output: (we just return whether the operation has succeeded)
> > > > >  - effect: atomically add the new node into the table
> > > > > 
> > > > > hash table add_unique (failure):
> > > > >  - input: new node to try adding
> > > > >  - output: (we just return whether the operation has succeeded)
> > > > >  - effect: simple lookup (read)
> > > > > 
> > > > > So as we see, the add_unique failure only performs a read. Adding a
> > > > > memory barrier before this read would require us to add a memory barrier
> > > > > also on the success path, which would degrade performance. The success
> > > > > path does: lookup failure, cmpxchg to add the node, retry if changed.
> > > > > Adding a memory barrier before the lookup would add an extra memory
> > > > > barrier in addition to the one located in the cmpxchg, and I don't think
> > > > > we want that overhead.
> > > > 
> > > > Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
> > > > architectures place the memory barrier before unconditionally, but place
> > > > a memory barrier after only if the cmpxchg() succeeds.  Of course,
> > > > architectures for which the cmpxchg instruction implies a barrier
> > > > don't need any explicit memory-barrier instructions.
> > > > 
> > > > I do see your point about the memory barrier in the failure case requiring
> > > > an additional memory barrier in the success case, however.
> > > > 
> > > > > > > > - if it returns a different node than the one received as parameter, it
> > > > > > > >   failed, and thus means that it only performed a lookup, so there is no
> > > > > > > >   guarantee that a memory barrier has been executed.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > cds_lfht_add_replace(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_replace(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_del(): Ditto.
> > > > > > > 
> > > > > > > One more point: "del" is similar to add_unique: if it succeeds, we
> > > > > > > execute a full memory barrier. If it fails, no memory barrier is
> > > > > > > guaranteed. But if we want to make the guarantees relax, we might not
> > > > > > > want to guarantee that a memory barrier is present in any of the "del"
> > > > > > > cases.
> > > > > > > 
> > > > > > > In the end, the only primitives for which I think it really makes sense
> > > > > > > to provide memory barrier semantic is the add_replace and replace : they
> > > > > > > actually act as an higher-level "cmpxchg" over the hash table nodes.
> > > > > > 
> > > > > > I believe that this should do the same -- memory barrier before, but no
> > > > > > memory barrier after on failure.
> > > > > > 
> > > > > > Another approach is C++11, in which there are a couple of arguments to
> > > > > > the compare-and-swap primitive specifying the memory ordering constraints
> > > > > > for the success and failure cases, respectively.
> > > > > 
> > > > > Hrm, I think adding this kind of flexibility might make the API too
> > > > > clobbered by details that "usual cases" don't care about.
> > > > 
> > > > ;-)
> > > > 
> > > > > > Unless you expect use cases with lots of failing cds_lfht_del() and
> > > > > > add_unique() calls, the performance difference should not be significant.
> > > > > 
> > > > > The problem is that I expect very, very frequent failing add_unique
> > > > > calls for use-cases like Binary Decision Diagrams (BDD) creation. This
> > > > > is why I don't think it is appropriate to put memory barriers on the
> > > > > failure cases, as these should have no more overhead than a simple
> > > > > lookup.
> > > > > 
> > > > > And if we choose not to provide memory barriers for add_unique failure,
> > > > > we should probably do the same for "del" to keep a consistent behavior
> > > > > over the API.
> > > > > 
> > > > > Thoughts ?
> > > > 
> > > > Given your BDD example, letting the failure cases avoid memory barriers
> > > > seems reasonable.  If problems arise, the default interfaces can be
> > > > fully ordered with faster versions where the failure cases avoid memory
> > > > barriers.
> > > 
> > > I think we agree on the failure cases of del and add_unique, which is
> > > good.
> > > 
> > > Now, I'm thinking slightly further about the success cases of add,
> > > add_unique and del.
> > > 
> > > Typically, when giving a node to add or add_unique for population into
> > > the hash table, we have to order the writes that initialize the node
> > > before its insertion into the hash table. So I think it would make sense
> > > to guarantee that we provide a memory barrier prior to add and
> > > add_unique (success) operations.
> > > 
> > > For the "del" case, it's the other way around: if a deletion succeeds,
> > > we want to have a memory barrier between the removal flag write and the
> > > following "free()" of the element. So we need to ensure that there is a
> > > memory barrier after a successful deletion.
> > > 
> > > If we think about it, the "add_replace" acts both as an "add" and "del":
> > > in all cases, we need to have a memory barrier before the add_replace,
> > > because we need to order the node initialization before the atomic
> > > insertion into the hash table, and we also need a memory barrier after,
> > > but only really in the case where add_replace replaced an old node, thus
> > > acting as a delete of that old node.
> > > 
> > > Considering all this, the barrier semantic I'm proposing is:
> > > 
> > > add:
> > > - full memory barrier before the insertion.
> > > 
> > > add_unique:
> > > - success: full memory barrier before the insertion.
> > > - failure: no memory barrier.
> > > 
> > > replace:
> > > - success: full memory barrier before the insertion, and after the
> > >   removal of the old node.
> > > - failure: no memory barrier.
> > > 
> > > add_replace:
> > > - returns an old node (replace): full memory barrier before the
> > >   insertion, and after the removal of the old node.
> > > - returns NULL (add): full memory barrier before the insertion.
> > > 
> > > del:
> > > - success: full memory barrier after the removal.
> > > - failure: no memory barrier.
> > > 
> > > I think those are the minimal barriers we need to provide to ensure that
> > > users will not have to add barriers of their own to use the API.
> > > 
> > > Thoughts ?
> > 
> > This means that some other CPU might see an add followed by a del as
> > happening in the opposite order.  Is that really OK?
> 
> Here I suppose you are talking about add and del that touch different
> nodes, since add/del touching the same node would modify the same memory
> location, so both operations would be seen in the right order by the
> other processors.
> 
> I guess the question here is: how does the remote CPU observe this ?
> Is it from a read-side traversal, or from update operations performed on
> the remote CPU ? Or from lookups of arbitrary key values ? Or from other
> update operations ?
> 
> For read-side traversal, given the traversal within a single hash chain
> goes from newer nodes to older nodes, the sequence "add" then "del",
> adding a new node, then deleting an old one (e.g. from the same hash
> chain) first adds the new node to the beginning of the hash chain, and
> then removes the old node that is later in the chain (in traversal
> order). In this case, if the "del" is published before the "add", this
> is entirely equivalent to have a read-side traversal that misses the
> added node, but sees the effect of the "del".
> 
> In terms of read-side traversal, what we seem to care about is more the
> "del" then "add" performed on different nodes (e.g. of the same hash
> chain): given that the "add" is at the beginning of the chain, and "del"
> is somewhere within the chain, a read-side traversal that observes the
> "add" would _always_ observe the effect of a prior "del". Thankfully,
> "del" then "add" is ordered in the scheme I propose, thanks to the
> memory barrier after "del" and also the one before "add".
> 
> For traversal of the entire table (not just same hash chain), the
> traversal vs key order is random, so we cannot assume anything about the
> observation order of add/del of different keys.

Indeed, I was thinking in terms of a pair of lookups separated by
smp_mb().

> If we perform two lookups of different keys, without explicit memory
> barriers between the lookups, we already have no guarantee whatsoever
> that the published information will be brought into the observer CPU in
> the right order: IOW, the lookups of different keys can be reordered. So
> I guess we don't care about add vs del ordering on the publish side for
> that one.
> 
> The other scenarios I see are related to update operations (add,
> add_unique, add_replace, replace, del). If a sequence of these
> operations is executed on the "observer" CPU, and expects to see the
> effect of the sequence of "add" then "del" performed on the "publisher"
> CPU in order, then, yes, an additional memory barrier would be required
> to order add before del.
> 
> The question here is: do we care about providing order of add before del
> only for the sake of the other update operations running on another CPU
> that would expect to see their effect in order ? I don't know, maybe.
> 
> I guess the safest route would be to provide the ordering guarantee,
> given that there exists a way to observe this and possibly care about
> it. I wonder if it would be better to put the barrier after the
> add/add_unique (success), or before the "del".
> 
> Also, do you think a write barrier would be sufficient, or we should
> keep full barriers ?

A write barrier would be sufficient in the case where there were only
two threads observing each other.  A full memory barrier would be needed
to prevent the assertion from firing in this sort of case (not sure that
this is exactly right, but something like this):

Initial contents: B, C

T0: add A; del B
T1: if (!lookup B) { add B; del C }
T2: r1 = lookup C; smp_mb(); r2 = lookup A

assert(lookup C || lookup A);

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > Thanks,
> > > > > 
> > > > > Mathieu
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > > Thoughts ?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Mathieu
> > > > > > > 
> > > > > > > > 
> > > > > > > > Yep, makes sense.
> > > > > > > > 
> > > > > > > > I'll add this documentation in the API.
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > > Mathieu
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > cds_lfht_is_node_deleted(): Not an update (let alone an atomic update),
> > > > > > > > > 	no barriers.
> > > > > > > > > 
> > > > > > > > > cds_lfht_resize(): Atomic update, but no return value, so no barrier
> > > > > > > > > 	implied.
> > > > > > > > > 
> > > > > > > > > cds_lfht_for_each(): Not an update (let alone an atomic update),
> > > > > > > > > 	no barriers.
> > > > > > > > > 
> > > > > > > > > cds_lfht_for_each_duplicate(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_for_each_entry(): Ditto.
> > > > > > > > > 
> > > > > > > > > cds_lfht_for_each_entry_duplicate(): Ditto.
> > > > > > > > > 
> > > > > > > > > 							Thanx, Paul
> > > > > > > > > 
> > > > > > > > > > I'm personally getting confident that the hash table API is clean
> > > > > > > > > > enough, and the implementation well tested, but I'd like to have your
> > > > > > > > > > thoughts on the readiness of the hash table for production use.
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > 
> > > > > > > > > > Mathieu
> > > > > > > > > > 
> > > > > > > > > > -- 
> > > > > > > > > > Mathieu Desnoyers
> > > > > > > > > > Operating System Efficiency R&D Consultant
> > > > > > > > > > EfficiOS Inc.
> > > > > > > > > > http://www.efficios.com
> > > > > > > > > > 
> > > > > > > > > > _______________________________________________
> > > > > > > > > > rp mailing list
> > > > > > > > > > rp at svcs.cs.pdx.edu
> > > > > > > > > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > _______________________________________________
> > > > > > > > > lttng-dev mailing list
> > > > > > > > > lttng-dev at lists.lttng.org
> > > > > > > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > Mathieu Desnoyers
> > > > > > > > Operating System Efficiency R&D Consultant
> > > > > > > > EfficiOS Inc.
> > > > > > > > http://www.efficios.com
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > lttng-dev mailing list
> > > > > > > > lttng-dev at lists.lttng.org
> > > > > > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > > > > > 
> > > > > > > -- 
> > > > > > > Mathieu Desnoyers
> > > > > > > Operating System Efficiency R&D Consultant
> > > > > > > EfficiOS Inc.
> > > > > > > http://www.efficios.com
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > rp mailing list
> > > > > > > rp at svcs.cs.pdx.edu
> > > > > > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Mathieu Desnoyers
> > > > > Operating System Efficiency R&D Consultant
> > > > > EfficiOS Inc.
> > > > > http://www.efficios.com
> > > > > 
> > > > > _______________________________________________
> > > > > rp mailing list
> > > > > rp at svcs.cs.pdx.edu
> > > > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > lttng-dev mailing list
> > > > lttng-dev at lists.lttng.org
> > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > 
> > 
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev at lists.lttng.org
> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>