[lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

Wed May 2 00:16:39 EDT 2012

* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > > On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
> > > > > > * Mathieu Desnoyers (mathieu.desnoyers at efficios.com) wrote:
> > > > > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > > > > > On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
> > > > > > > > > Hi!
> > > > > > > > > 
> > > > > > > > > After 1 year of development, with the last 6-7 months spent polishing
> > > > > > > > > the API and testing the implementation, I think it is getting about time
> > > > > > > > > to release the RCU lock-free hash table in a new Userspace RCU version
> > > > > > > > > (0.7).
> > > > > > > > > 
> > > > > > > > > I recently described the guarantees provided by the hash table in more
> > > > > > > > > detail, and created tests for the uniqueness guarantee for traversals
> > > > > > > > > performed concurrently with add_unique and add_replace operations. I
> > > > > > > > > also added test modes that create long hash chains, to test corner-cases
> > > > > > > > > of the hash table.
> > > > > > > > > 
> > > > > > > > > One small thing I wonder is whether we should document that the hash
> > > > > > > > > table update operations imply full memory barriers ?
> > > > > > > > 
> > > > > > > > The Linux kernel's rule seems good here -- if a hash-table operation is
> > > > > > > > atomic and returns a value, it should imply a full barrier.  So:
> > > > > > > > 
> > > > > > > > cds_lfht_new(): No point in claiming barriers -- publishing the
> > > > > > > > 	pointer to the hash table is where the barriers are important.
> > > > > > > > 
> > > > > > > > cds_lfht_destroy(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers.
> > > > > > > > 
> > > > > > > > cds_lfht_next_duplicate(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_first(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_next(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_add(): Atomic update, but no return value, so no barrier
> > > > > > > > 	implied.
> > > > > > > 
> > > > > > > Yep, makes sense. We use cmpxchg internally to perform the update, but
> > > > > > > it could make sense to eventually use a cmpxchg that has no memory
> > > > > > > barriers to perform this update. So I agree on not providing a memory
> > > > > > > barrier guarantee on the "add" operation, since it does not return any
> > > > > > > value.
> > > > > > > 
> > > > > > > > 
> > > > > > > > cds_lfht_add_unique(): Atomic update that returns a value, so should
> > > > > > > > 	imply a full memory barrier.
> > > > > > > 
> > > > > > > add_unique is a bit special:
> > > > > > > 
> > > > > > > - if it returns the node received as parameter, it means the add
> > > > > > >   succeeded, which imply an update, and thus a memory barrier.
> > > > > > 
> > > > > > Hrm, thinking further: if we make the "add" operation not act as a
> > > > > > full memory barrier, then the add_unique success should not act as a
> > > > > > full mb neither.
> > > > > 
> > > > > Think of it as being similar to the Linux kernel's atomic_inc() and
> > > > > atomic_add_return() primitives.  The latter guarantees memory barriers
> > > > > and the former does not.
> > > > 
> > > > I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally
> > > > different.
> > > > 
> > > > atomic_inc:
> > > >  - input: increment
> > > >  - output: none
> > > >  - effect: increment target address value
> > > > 
> > > > atomic_add_return:
> > > >  - input: increment
> > > >  - output: incremented value
> > > >  - effect: increment target address value, atomically reading the
> > > >    resulting value.
> > > > 
> > > > hash table add:
> > > >  - input: new node to add
> > > >  - output: none
> > > >  - effect: atomically add the new node into the table
> > > > 
> > > > hash table add_unique (success):
> > > >  - input: new node to add
> > > >  - output: (we just return whether the operation has succeeded)
> > > >  - effect: atomically add the new node into the table
> > > > 
> > > > hash table add_unique (failure):
> > > >  - input: new node to try adding
> > > >  - output: (we just return whether the operation has succeeded)
> > > >  - effect: simple lookup (read)
> > > > 
> > > > So as we see, the add_unique failure only performs a read. Adding a
> > > > memory barrier before this read would require us to add a memory barrier
> > > > also on the success path, which would degrade performance. The success
> > > > path does: lookup failure, cmpxchg to add the node, retry if changed.
> > > > Adding a memory barrier before the lookup would add an extra memory
> > > > barrier in addition to the one located in the cmpxchg, and I don't think
> > > > we want that overhead.
> > > 
> > > Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
> > > architectures place the memory barrier before unconditionally, but place
> > > a memory barrier after only if the cmpxchg() succeeds.  Of course,
> > > architectures for which the cmpxchg instruction implies a barrier
> > > don't need any explicit memory-barrier instructions.
> > > 
> > > I do see your point about the memory barrier in the failure case requiring
> > > an additional memory barrier in the success case, however.
> > > 
> > > > > > > - if it returns a different node than the one received as parameter, it
> > > > > > >   failed, and thus means that it only performed a lookup, so there is no
> > > > > > >   guarantee that a memory barrier has been executed.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > cds_lfht_add_replace(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_replace(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_del(): Ditto.
> > > > > > 
> > > > > > One more point: "del" is similar to add_unique: if it succeeds, we
> > > > > > execute a full memory barrier. If it fails, no memory barrier is
> > > > > > guaranteed. But if we want to make the guarantees relax, we might not
> > > > > > want to guarantee that a memory barrier is present in any of the "del"
> > > > > > cases.
> > > > > > 
> > > > > > In the end, the only primitives for which I think it really makes sense
> > > > > > to provide memory barrier semantic is the add_replace and replace : they
> > > > > > actually act as an higher-level "cmpxchg" over the hash table nodes.
> > > > > 
> > > > > I believe that this should do the same -- memory barrier before, but no
> > > > > memory barrier after on failure.
> > > > > 
> > > > > Another approach is C++11, in which there are a couple of arguments to
> > > > > the compare-and-swap primitive specifying the memory ordering constraints
> > > > > for the success and failure cases, respectively.
> > > > 
> > > > Hrm, I think adding this kind of flexibility might make the API too
> > > > clobbered by details that "usual cases" don't care about.
> > > 
> > > ;-)
> > > 
> > > > > Unless you expect use cases with lots of failing cds_lfht_del() and
> > > > > add_unique() calls, the performance difference should not be significant.
> > > > 
> > > > The problem is that I expect very, very frequent failing add_unique
> > > > calls for use-cases like Binary Decision Diagrams (BDD) creation. This
> > > > is why I don't think it is appropriate to put memory barriers on the
> > > > failure cases, as these should have no more overhead than a simple
> > > > lookup.
> > > > 
> > > > And if we choose not to provide memory barriers for add_unique failure,
> > > > we should probably do the same for "del" to keep a consistent behavior
> > > > over the API.
> > > > 
> > > > Thoughts ?
> > > 
> > > Given your BDD example, letting the failure cases avoid memory barriers
> > > seems reasonable.  If problems arise, the default interfaces can be
> > > fully ordered with faster versions where the failure cases avoid memory
> > > barriers.
> > 
> > I think we agree on the failure cases of del and add_unique, which is
> > good.
> > 
> > Now, I'm thinking slightly further about the success cases of add,
> > add_unique and del.
> > 
> > Typically, when giving a node to add or add_unique for population into
> > the hash table, we have to order the writes that initialize the node
> > before its insertion into the hash table. So I think it would make sense
> > to guarantee that we provide a memory barrier prior to add and
> > add_unique (success) operations.
> > 
> > For the "del" case, it's the other way around: if a deletion succeeds,
> > we want to have a memory barrier between the removal flag write and the
> > following "free()" of the element. So we need to ensure that there is a
> > memory barrier after a successful deletion.
> > 
> > If we think about it, the "add_replace" acts both as an "add" and "del":
> > in all cases, we need to have a memory barrier before the add_replace,
> > because we need to order the node initialization before the atomic
> > insertion into the hash table, and we also need a memory barrier after,
> > but only really in the case where add_replace replaced an old node, thus
> > acting as a delete of that old node.
> > 
> > Considering all this, the barrier semantic I'm proposing is:
> > 
> > add:
> > - full memory barrier before the insertion.
> > 
> > add_unique:
> > - success: full memory barrier before the insertion.
> > - failure: no memory barrier.
> > 
> > replace:
> > - success: full memory barrier before the insertion, and after the
> >   removal of the old node.
> > - failure: no memory barrier.
> > 
> > add_replace:
> > - returns an old node (replace): full memory barrier before the
> >   insertion, and after the removal of the old node.
> > - returns NULL (add): full memory barrier before the insertion.
> > 
> > del:
> > - success: full memory barrier after the removal.
> > - failure: no memory barrier.
> > 
> > I think those are the minimal barriers we need to provide to ensure that
> > users will not have to add barriers of their own to use the API.
> > 
> > Thoughts ?
> 
> This means that some other CPU might see an add followed by a del as
> happening in the opposite order.  Is that really OK?

Here I suppose you are talking about add and del that touch different
nodes, since add/del touching the same node would modify the same memory
location, so both operations would be seen in the right order by the
other processors.

I guess the question here is: how does the remote CPU observe this ?
Is it from a read-side traversal, or from update operations performed on
the remote CPU ? Or from lookups of arbitrary key values ? Or from other
update operations ?

For read-side traversal, given the traversal within a single hash chain
goes from newer nodes to older nodes, the sequence "add" then "del",
adding a new node, then deleting an old one (e.g. from the same hash
chain) first adds the new node to the beginning of the hash chain, and
then removes the old node that is later in the chain (in traversal
order). In this case, if the "del" is published before the "add", this
is entirely equivalent to have a read-side traversal that misses the
added node, but sees the effect of the "del".

In terms of read-side traversal, what we seem to care about is more the
"del" then "add" performed on different nodes (e.g. of the same hash
chain): given that the "add" is at the beginning of the chain, and "del"
is somewhere within the chain, a read-side traversal that observes the
"add" would _always_ observe the effect of a prior "del". Thankfully,
"del" then "add" is ordered in the scheme I propose, thanks to the
memory barrier after "del" and also the one before "add".

For traversal of the entire table (not just same hash chain), the
traversal vs key order is random, so we cannot assume anything about the
observation order of add/del of different keys.

If we perform two lookups of different keys, without explicit memory
barriers between the lookups, we already have no guarantee whatsoever
that the published information will be brought into the observer CPU in
the right order: IOW, the lookups of different keys can be reordered. So
I guess we don't care about add vs del ordering on the publish side for
that one.

The other scenarios I see are related to update operations (add,
add_unique, add_replace, replace, del). If a sequence of these
operations is executed on the "observer" CPU, and expects to see the
effect of the sequence of "add" then "del" performed on the "publisher"
CPU in order, then, yes, an additional memory barrier would be required
to order add before del.

The question here is: do we care about providing order of add before del
only for the sake of the other update operations running on another CPU
that would expect to see their effect in order ? I don't know, maybe.

I guess the safest route would be to provide the ordering guarantee,
given that there exists a way to observe this and possibly care about
it. I wonder if it would be better to put the barrier after the
add/add_unique (success), or before the "del".

Also, do you think a write barrier would be sufficient, or we should
keep full barriers ?

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Thanks,
> > > > 
> > > > Mathieu
> > > > 
> > > > 
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > > Thoughts ?
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > > Mathieu
> > > > > > 
> > > > > > > 
> > > > > > > Yep, makes sense.
> > > > > > > 
> > > > > > > I'll add this documentation in the API.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > Mathieu
> > > > > > > 
> > > > > > > > 
> > > > > > > > cds_lfht_is_node_deleted(): Not an update (let alone an atomic update),
> > > > > > > > 	no barriers.
> > > > > > > > 
> > > > > > > > cds_lfht_resize(): Atomic update, but no return value, so no barrier
> > > > > > > > 	implied.
> > > > > > > > 
> > > > > > > > cds_lfht_for_each(): Not an update (let alone an atomic update),
> > > > > > > > 	no barriers.
> > > > > > > > 
> > > > > > > > cds_lfht_for_each_duplicate(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_for_each_entry(): Ditto.
> > > > > > > > 
> > > > > > > > cds_lfht_for_each_entry_duplicate(): Ditto.
> > > > > > > > 
> > > > > > > > 							Thanx, Paul
> > > > > > > > 
> > > > > > > > > I'm personally getting confident that the hash table API is clean
> > > > > > > > > enough, and the implementation well tested, but I'd like to have your
> > > > > > > > > thoughts on the readiness of the hash table for production use.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > Mathieu
> > > > > > > > > 
> > > > > > > > > -- 
> > > > > > > > > Mathieu Desnoyers
> > > > > > > > > Operating System Efficiency R&D Consultant
> > > > > > > > > EfficiOS Inc.
> > > > > > > > > http://www.efficios.com
> > > > > > > > > 
> > > > > > > > > _______________________________________________
> > > > > > > > > rp mailing list
> > > > > > > > > rp at svcs.cs.pdx.edu
> > > > > > > > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > lttng-dev mailing list
> > > > > > > > lttng-dev at lists.lttng.org
> > > > > > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > > > > > 
> > > > > > > -- 
> > > > > > > Mathieu Desnoyers
> > > > > > > Operating System Efficiency R&D Consultant
> > > > > > > EfficiOS Inc.
> > > > > > > http://www.efficios.com
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > lttng-dev mailing list
> > > > > > > lttng-dev at lists.lttng.org
> > > > > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > > > > 
> > > > > > -- 
> > > > > > Mathieu Desnoyers
> > > > > > Operating System Efficiency R&D Consultant
> > > > > > EfficiOS Inc.
> > > > > > http://www.efficios.com
> > > > > > 
> > > > > > _______________________________________________
> > > > > > rp mailing list
> > > > > > rp at svcs.cs.pdx.edu
> > > > > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > > > > 
> > > > > 
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > Operating System Efficiency R&D Consultant
> > > > EfficiOS Inc.
> > > > http://www.efficios.com
> > > > 
> > > > _______________________________________________
> > > > rp mailing list
> > > > rp at svcs.cs.pdx.edu
> > > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > lttng-dev mailing list
> > > lttng-dev at lists.lttng.org
> > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com