[lttng-dev] [rp] [commit] rculfhash: document linearizability guarantees

Thu Apr 19 20:25:59 EDT 2012

On Thu, Apr 19, 2012 at 08:02:57PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Wed, Apr 18, 2012 at 04:10:17PM -0400, Mathieu Desnoyers wrote:
> > > Hi,
> > > 
> > > FYI, I pushed extra documentation of the RCU lock-free Hash Table found
> > > in userspace RCU library master branch regarding the linearizability
> > > guarantees it provides. Feedback is welcome,
> > 
> 
> Hi Paul,
> 
> Your reply brings interesting thoughts, see below,
> 
> > Interesting!  Please see below for my take on it.
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > commit 0f5543cb1780acef35878646e6cdc966f1406c18
> > > Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > > Date:   Wed Apr 18 16:05:29 2012 -0400
> > > 
> > >     rculfhash: document linearizability guarantees
> > >     
> > >     Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > > 
> > > diff --git a/rculfhash.c b/rculfhash.c
> > > index 6f470fd..d22b44d 100644
> > > --- a/rculfhash.c
> > > +++ b/rculfhash.c
> > > @@ -98,6 +98,33 @@
> > >   *   hash table nodes. These tables are invariant after they are
> > >   *   populated into the hash table.
> > >   *
> > > + * Linearizability Guarantees:
> > 
> > I suggest calling these ordering guarantees.  Otherwise, people will
> > glance at the heading and assume that the hash table is linearizable,
> > which from what I can see is not the case.  (And I do -not- recommend
> > making it be the case, just to be sure that there is no confusion.)
> 
> Ordering guarantees works for me, but I'm curious to know the reasons
> that make you think it's not linearizeable. AFAIU, linearizability
> requires:
> 
> - That there is an atomic step at which each operation takes place,
>   located between the start and the end of each method (linearization
>   point).

The above can fail due to reordering of readers by either the CPU
or the compiler.

> - Progress guarantees for each method.
> 
> So looking at linearization points for updates:
> 
> - add: takes place atomically at the cmpxchg that links the new node
>   into the linked list.

Yep.

> - add_replace: takes place atomically at the uatomic_cmpxchg responsible
>   for linking the new node into the linked list, replacing an existing
>   node (add_replace) atomically by setting the "removed" flag into the
>   replaced node's next pointer at the same time as the new node is
>   inserted.

Yep.

> - add_unique: takes place atomically at the cmpxchg that either
>   links the new node into the linked list, or otherwise detects
>   that the prev node next pointer has changed, triggering a retry.

Yep.

> - del: takes place atomically by setting the "removed" flag into the
>   next pointer of the node to remove (atomic "or").

Yep.

> Linearization points for reads: those seem to apply to individual
> operations (e.g. lookup, get_first, get_next, get_next_duplicate):
> 
> - lookup: atomically reading the next pointer of the node prior to
>   either:
>   - the node containing the key we are looking for,
>   - the node containing the first key above the key we are looking for,
>   - the end of the list (next pointer is NULL).

Nope.  This could be reordered by either compiler or CPU with another
read operation.

> - get_first: atomically reading the head next pointer.

Ditto.

> - get_next: atomically reading the next pointer of the current node.

Ordered with respect to the corresponding get_first() due to dependency
ordering, also with respect to another get_next() or get_next_duplicate()
in the same traversal, but not ordered with respect to lookup().

> - get_next_duplicate: atomically reading the next pointer of the
>   current node.

Ditto.

> Progress guarantees:
> 
> Updates:
> 
> Each retry faced by an update is always caused by another concurrent
> operation that itself progresses (lock-freedom type of progress
> guarantee).
> 
> Reads:
> 
> Read operations never loop: the linked lists only go in one direction,
> from the head to the tail.

I agree with both of these.  Updates are lock-free (assuming use of
call_rcu() and sufficient memory) and readers are wait-free.

> We have to note that for linearizability, I used the basic read
> operations (which happen atomically) rather than sequence of such
> operations.
> 
> One item I'm not convinced could be called entirely linearizable is the
> resize operation: it happens lazily at some point in time that is not
> usually within the update operations. For that one, I'd be tempted
> to say that the hash table is not linearizable. Thoughts ?

The possibility of reordering of read operations means that it is not
linearizable even in the absence of resize operations.

> > > + *
> > > + * To discuss these guarantees, we first define "read" operations as any
> > > + * of the following operations surrounded by an RCU read-side lock/unlock
> > > + * pair:
> > > + *  - cds_lfht_lookup
> > > + *  - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
> > > + *  - cds_lfht_first followed iteration with cds_lfht_next
> > > + *
> > > + * We define "write" operations as any of cds_lfht_add,
> > > + * cds_lfht_add_unique, cds_lfht_add_replace, cds_lfht_del.
> > > + *
> > > + * The following guarantees are offered by this hash table:
> > > + *
> > > + * A) "read" after "write" will always return the result of the latest
> > > + *    write.
> > 
> > Hmmm...  "Latest" by what measure?  What is the guarantee really
> > saying if there are a large number of writes and reads all involving
> > the same hash key?
> > 
> > My guess is that the guarantee is actually of the following form:
> > If there is ordering between a write and a later read, then the
> > read is guaranteed to see the write or some later write.
> 
> Yes, although I think thinking in terms of "sequence of reads" might be
> more appropriate than "basic read operation". By that, I mean thinking
> in terms of iteration over the entire hash table, or iteration over a
> key and all its duplicates, rather than the basic lookup/get next/get
> next duplicate/get first operations.

Only if you have something explicitly ordering the reads, like an
explicit memory barrier or some such.  The reads in a given traversal
are ordered by dependency ordering, but only with respect to other
reads in that same traversal.

> > As I understand it, this guarantee requires that the read and write
> > operate on the same key.
> 
> Well, a sequence of "read" operations can be an iteration on the whole
> table (get first, then get_next until we reach the end of the table). So
> what I'm trying to say here is that if we have a sequence of read
> operations for which the first operation is ordered after a write, those
> are guaranteed to see this write or some later write.

Yes, but only for the reads in that same traversal or a later traversal
by this same thread.  No guarantees for an independent lookup() operation,
for example.

> Of course, if the sequence of read operations is executed across the
> write, it may or may not see the write.
> 
> And if the sequence of read operations is known to have its last read
> ordered before a write, it is ensured that it will not see this write.

Agreed.

> > This might be easier given a guarantee that writes involving a given
> > key are seen in order.  I believe that the hash table does provide this
> > guarantee and that it would be good to state it explicitly.
> 
> I'm wondering if the reasoning about a single "key" you are proposing
> takes into account that the hash table supports duplicate keys ?

If I understand correctly, the duplicate keys get hidden and uncovered
in order.  But I have not analyzed this case carefully.

> > > + * B) "write" after "read" will never be returned by the read.
> > 
> > There is a useful guarantee involving a pair of reads: If a pair
> > of reads of a given key are ordered (e.g., by a memory barrier),
> > then the second read will return the same value as the first
> > read or some later value.
> 
> Yep, e.g. if we use add_replace to replace a given node with a version
> that has an incremented counter, reads that are ordered with memory
> barriers are guaranteed to see this counter always incrementing.

Agreed.  Cross-thread ordering might be supplied by communication
with some other variable combined with appropriate memory barriers.

> > > + * C) It is guaranteed that after a grace period following a "del" and
> > > + *    "replace" operation, no reference to the removed items exists in
> > > + *    the hash table.
> > 
> > I would instead say something like: If a grace period separates a "del"
> > or "replace" operation and a subsequent read operation, then that reader
> > is guaranteed not to see the removed item.
> 
> Your statement is true, but the actual idea I want to convey is stronger
> than that: after that grace period, we guarantee that no read nor update
> operation can see the removed pointer (even just internally).

OK, then how about this?

	If a grace period separates a "del" or "replace" operation
	and a subsequent operation, then that subsequent operation is
	guaranteed not to see the removed item.

							Thanx, Paul

> > > + * D) Uniqueness guarantee: when using add_unique and/or add_replace to
> > > + *    insert nodes into the table, if there was previously one node or
> > > + *    less with the same key being inserted by one or more concurrent
> > > + *    add_unique and/or add_replace, all concurrent "read" performed on
> > > + *    the hash table are guaranteed to find one, and only one node with
> > > + *    that key.
> > 
> > How about: Given a hash table that does not contain duplicate items
> > for a given key, there will only be one item in the hash table after
> > an arbitrary sequence of add_unique and/or add_replace operations.
> > Note, however, that a pair of concurrent read operations might well
> > access two different items with that key.
> 
> Yes, that works,
> 
> Thanks for the feedback!
> 
> Mathieu
> 
> > 
> > >   * Bucket node tables:
> > >   *
> > >   * hash table	hash table	the last	all bucket node tables
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > 
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
> 
> _______________________________________________
> rp mailing list
> rp at svcs.cs.pdx.edu
> http://svcs.cs.pdx.edu/mailman/listinfo/rp
>