[lttng-dev] [rp] [commit] rculfhash: document linearizability guarantees
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Thu Apr 19 20:25:59 EDT 2012
On Thu, Apr 19, 2012 at 08:02:57PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Wed, Apr 18, 2012 at 04:10:17PM -0400, Mathieu Desnoyers wrote:
> > > Hi,
> > >
> > > FYI, I pushed extra documentation of the RCU lock-free Hash Table found
> > > in userspace RCU library master branch regarding the linearizability
> > > guarantees it provides. Feedback is welcome,
> >
>
> Hi Paul,
>
> Your reply brings interesting thoughts, see below,
>
> > Interesting! Please see below for my take on it.
> >
> > Thanx, Paul
> >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > commit 0f5543cb1780acef35878646e6cdc966f1406c18
> > > Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > > Date: Wed Apr 18 16:05:29 2012 -0400
> > >
> > > rculfhash: document linearizability guarantees
> > >
> > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > >
> > > diff --git a/rculfhash.c b/rculfhash.c
> > > index 6f470fd..d22b44d 100644
> > > --- a/rculfhash.c
> > > +++ b/rculfhash.c
> > > @@ -98,6 +98,33 @@
> > > * hash table nodes. These tables are invariant after they are
> > > * populated into the hash table.
> > > *
> > > + * Linearizability Guarantees:
> >
> > I suggest calling these ordering guarantees. Otherwise, people will
> > glance at the heading and assume that the hash table is linearizable,
> > which from what I can see is not the case. (And I do -not- recommend
> > making it be the case, just to be sure that there is no confusion.)
>
> Ordering guarantees works for me, but I'm curious to know the reasons
> that make you think it's not linearizeable. AFAIU, linearizability
> requires:
>
> - That there is an atomic step at which each operation takes place,
> located between the start and the end of each method (linearization
> point).
The above can fail due to reordering of readers by either the CPU
or the compiler.
> - Progress guarantees for each method.
>
> So looking at linearization points for updates:
>
> - add: takes place atomically at the cmpxchg that links the new node
> into the linked list.
Yep.
> - add_replace: takes place atomically at the uatomic_cmpxchg responsible
> for linking the new node into the linked list, replacing an existing
> node (add_replace) atomically by setting the "removed" flag into the
> replaced node's next pointer at the same time as the new node is
> inserted.
Yep.
> - add_unique: takes place atomically at the cmpxchg that either
> links the new node into the linked list, or otherwise detects
> that the prev node next pointer has changed, triggering a retry.
Yep.
> - del: takes place atomically by setting the "removed" flag into the
> next pointer of the node to remove (atomic "or").
Yep.
> Linearization points for reads: those seem to apply to individual
> operations (e.g. lookup, get_first, get_next, get_next_duplicate):
>
> - lookup: atomically reading the next pointer of the node prior to
> either:
> - the node containing the key we are looking for,
> - the node containing the first key above the key we are looking for,
> - the end of the list (next pointer is NULL).
Nope. This could be reordered by either compiler or CPU with another
read operation.
> - get_first: atomically reading the head next pointer.
Ditto.
> - get_next: atomically reading the next pointer of the current node.
Ordered with respect to the corresponding get_first() due to dependency
ordering, also with respect to another get_next() or get_next_duplicate()
in the same traversal, but not ordered with respect to lookup().
> - get_next_duplicate: atomically reading the next pointer of the
> current node.
Ditto.
> Progress guarantees:
>
> Updates:
>
> Each retry faced by an update is always caused by another concurrent
> operation that itself progresses (lock-freedom type of progress
> guarantee).
>
> Reads:
>
> Read operations never loop: the linked lists only go in one direction,
> from the head to the tail.
I agree with both of these. Updates are lock-free (assuming use of
call_rcu() and sufficient memory) and readers are wait-free.
> We have to note that for linearizability, I used the basic read
> operations (which happen atomically) rather than sequence of such
> operations.
>
> One item I'm not convinced could be called entirely linearizable is the
> resize operation: it happens lazily at some point in time that is not
> usually within the update operations. For that one, I'd be tempted
> to say that the hash table is not linearizable. Thoughts ?
The possibility of reordering of read operations means that it is not
linearizable even in the absence of resize operations.
> > > + *
> > > + * To discuss these guarantees, we first define "read" operations as any
> > > + * of the following operations surrounded by an RCU read-side lock/unlock
> > > + * pair:
> > > + * - cds_lfht_lookup
> > > + * - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
> > > + * - cds_lfht_first followed iteration with cds_lfht_next
> > > + *
> > > + * We define "write" operations as any of cds_lfht_add,
> > > + * cds_lfht_add_unique, cds_lfht_add_replace, cds_lfht_del.
> > > + *
> > > + * The following guarantees are offered by this hash table:
> > > + *
> > > + * A) "read" after "write" will always return the result of the latest
> > > + * write.
> >
> > Hmmm... "Latest" by what measure? What is the guarantee really
> > saying if there are a large number of writes and reads all involving
> > the same hash key?
> >
> > My guess is that the guarantee is actually of the following form:
> > If there is ordering between a write and a later read, then the
> > read is guaranteed to see the write or some later write.
>
> Yes, although I think thinking in terms of "sequence of reads" might be
> more appropriate than "basic read operation". By that, I mean thinking
> in terms of iteration over the entire hash table, or iteration over a
> key and all its duplicates, rather than the basic lookup/get next/get
> next duplicate/get first operations.
Only if you have something explicitly ordering the reads, like an
explicit memory barrier or some such. The reads in a given traversal
are ordered by dependency ordering, but only with respect to other
reads in that same traversal.
> > As I understand it, this guarantee requires that the read and write
> > operate on the same key.
>
> Well, a sequence of "read" operations can be an iteration on the whole
> table (get first, then get_next until we reach the end of the table). So
> what I'm trying to say here is that if we have a sequence of read
> operations for which the first operation is ordered after a write, those
> are guaranteed to see this write or some later write.
Yes, but only for the reads in that same traversal or a later traversal
by this same thread. No guarantees for an independent lookup() operation,
for example.
> Of course, if the sequence of read operations is executed across the
> write, it may or may not see the write.
>
> And if the sequence of read operations is known to have its last read
> ordered before a write, it is ensured that it will not see this write.
Agreed.
> > This might be easier given a guarantee that writes involving a given
> > key are seen in order. I believe that the hash table does provide this
> > guarantee and that it would be good to state it explicitly.
>
> I'm wondering if the reasoning about a single "key" you are proposing
> takes into account that the hash table supports duplicate keys ?
If I understand correctly, the duplicate keys get hidden and uncovered
in order. But I have not analyzed this case carefully.
> > > + * B) "write" after "read" will never be returned by the read.
> >
> > There is a useful guarantee involving a pair of reads: If a pair
> > of reads of a given key are ordered (e.g., by a memory barrier),
> > then the second read will return the same value as the first
> > read or some later value.
>
> Yep, e.g. if we use add_replace to replace a given node with a version
> that has an incremented counter, reads that are ordered with memory
> barriers are guaranteed to see this counter always incrementing.
Agreed. Cross-thread ordering might be supplied by communication
with some other variable combined with appropriate memory barriers.
> > > + * C) It is guaranteed that after a grace period following a "del" and
> > > + * "replace" operation, no reference to the removed items exists in
> > > + * the hash table.
> >
> > I would instead say something like: If a grace period separates a "del"
> > or "replace" operation and a subsequent read operation, then that reader
> > is guaranteed not to see the removed item.
>
> Your statement is true, but the actual idea I want to convey is stronger
> than that: after that grace period, we guarantee that no read nor update
> operation can see the removed pointer (even just internally).
OK, then how about this?
If a grace period separates a "del" or "replace" operation
and a subsequent operation, then that subsequent operation is
guaranteed not to see the removed item.
Thanx, Paul
> > > + * D) Uniqueness guarantee: when using add_unique and/or add_replace to
> > > + * insert nodes into the table, if there was previously one node or
> > > + * less with the same key being inserted by one or more concurrent
> > > + * add_unique and/or add_replace, all concurrent "read" performed on
> > > + * the hash table are guaranteed to find one, and only one node with
> > > + * that key.
> >
> > How about: Given a hash table that does not contain duplicate items
> > for a given key, there will only be one item in the hash table after
> > an arbitrary sequence of add_unique and/or add_replace operations.
> > Note, however, that a pair of concurrent read operations might well
> > access two different items with that key.
>
> Yes, that works,
>
> Thanks for the feedback!
>
> Mathieu
>
> >
> > > * Bucket node tables:
> > > *
> > > * hash table hash table the last all bucket node tables
> > >
> > > --
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > >
> >
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>
> _______________________________________________
> rp mailing list
> rp at svcs.cs.pdx.edu
> http://svcs.cs.pdx.edu/mailman/listinfo/rp
>
More information about the lttng-dev
mailing list