[lttng-dev] [commit] rculfhash: document linearizability guarantees

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Thu Apr 19 20:02:57 EDT 2012


* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Wed, Apr 18, 2012 at 04:10:17PM -0400, Mathieu Desnoyers wrote:
> > Hi,
> > 
> > FYI, I pushed extra documentation of the RCU lock-free Hash Table found
> > in userspace RCU library master branch regarding the linearizability
> > guarantees it provides. Feedback is welcome,
> 

Hi Paul,

Your reply brings interesting thoughts, see below,

> Interesting!  Please see below for my take on it.
> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > commit 0f5543cb1780acef35878646e6cdc966f1406c18
> > Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > Date:   Wed Apr 18 16:05:29 2012 -0400
> > 
> >     rculfhash: document linearizability guarantees
> >     
> >     Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > 
> > diff --git a/rculfhash.c b/rculfhash.c
> > index 6f470fd..d22b44d 100644
> > --- a/rculfhash.c
> > +++ b/rculfhash.c
> > @@ -98,6 +98,33 @@
> >   *   hash table nodes. These tables are invariant after they are
> >   *   populated into the hash table.
> >   *
> > + * Linearizability Guarantees:
> 
> I suggest calling these ordering guarantees.  Otherwise, people will
> glance at the heading and assume that the hash table is linearizable,
> which from what I can see is not the case.  (And I do -not- recommend
> making it be the case, just to be sure that there is no confusion.)

Ordering guarantees works for me, but I'm curious to know the reasons
that make you think it's not linearizeable. AFAIU, linearizability
requires:

- That there is an atomic step at which each operation takes place,
  located between the start and the end of each method (linearization
  point).
- Progress guarantees for each method.

So looking at linearization points for updates:

- add: takes place atomically at the cmpxchg that links the new node
  into the linked list.

- add_replace: takes place atomically at the uatomic_cmpxchg responsible
  for linking the new node into the linked list, replacing an existing
  node (add_replace) atomically by setting the "removed" flag into the
  replaced node's next pointer at the same time as the new node is
  inserted.

- add_unique: takes place atomically at the cmpxchg that either
  links the new node into the linked list, or otherwise detects
  that the prev node next pointer has changed, triggering a retry.

- del: takes place atomically by setting the "removed" flag into the
  next pointer of the node to remove (atomic "or").

Linearization points for reads: those seem to apply to individual
operations (e.g. lookup, get_first, get_next, get_next_duplicate):

- lookup: atomically reading the next pointer of the node prior to
  either:
  - the node containing the key we are looking for,
  - the node containing the first key above the key we are looking for,
  - the end of the list (next pointer is NULL).

- get_first: atomically reading the head next pointer.

- get_next: atomically reading the next pointer of the current node.

- get_next_duplicate: atomically reading the next pointer of the
  current node.

Progress guarantees:

Updates:

Each retry faced by an update is always caused by another concurrent
operation that itself progresses (lock-freedom type of progress
guarantee).

Reads:

Read operations never loop: the linked lists only go in one direction,
from the head to the tail.

We have to note that for linearizability, I used the basic read
operations (which happen atomically) rather than sequence of such
operations.

One item I'm not convinced could be called entirely linearizable is the
resize operation: it happens lazily at some point in time that is not
usually within the update operations. For that one, I'd be tempted
to say that the hash table is not linearizable. Thoughts ?

> 
> > + *
> > + * To discuss these guarantees, we first define "read" operations as any
> > + * of the following operations surrounded by an RCU read-side lock/unlock
> > + * pair:
> > + *  - cds_lfht_lookup
> > + *  - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
> > + *  - cds_lfht_first followed iteration with cds_lfht_next
> > + *
> > + * We define "write" operations as any of cds_lfht_add,
> > + * cds_lfht_add_unique, cds_lfht_add_replace, cds_lfht_del.
> > + *
> > + * The following guarantees are offered by this hash table:
> > + *
> > + * A) "read" after "write" will always return the result of the latest
> > + *    write.
> 
> Hmmm...  "Latest" by what measure?  What is the guarantee really
> saying if there are a large number of writes and reads all involving
> the same hash key?
> 
> My guess is that the guarantee is actually of the following form:
> If there is ordering between a write and a later read, then the
> read is guaranteed to see the write or some later write.

Yes, although I think thinking in terms of "sequence of reads" might be
more appropriate than "basic read operation". By that, I mean thinking
in terms of iteration over the entire hash table, or iteration over a
key and all its duplicates, rather than the basic lookup/get next/get
next duplicate/get first operations.

> As I understand it, this guarantee requires that the read and write
> operate on the same key.

Well, a sequence of "read" operations can be an iteration on the whole
table (get first, then get_next until we reach the end of the table). So
what I'm trying to say here is that if we have a sequence of read
operations for which the first operation is ordered after a write, those
are guaranteed to see this write or some later write.

Of course, if the sequence of read operations is executed across the
write, it may or may not see the write.

And if the sequence of read operations is known to have its last read
ordered before a write, it is ensured that it will not see this write.

> 
> This might be easier given a guarantee that writes involving a given
> key are seen in order.  I believe that the hash table does provide this
> guarantee and that it would be good to state it explicitly.

I'm wondering if the reasoning about a single "key" you are proposing
takes into account that the hash table supports duplicate keys ?

> 
> > + * B) "write" after "read" will never be returned by the read.
> 
> There is a useful guarantee involving a pair of reads: If a pair
> of reads of a given key are ordered (e.g., by a memory barrier),
> then the second read will return the same value as the first
> read or some later value.

Yep, e.g. if we use add_replace to replace a given node with a version
that has an incremented counter, reads that are ordered with memory
barriers are guaranteed to see this counter always incrementing.

> 
> > + * C) It is guaranteed that after a grace period following a "del" and
> > + *    "replace" operation, no reference to the removed items exists in
> > + *    the hash table.
> 
> I would instead say something like: If a grace period separates a "del"
> or "replace" operation and a subsequent read operation, then that reader
> is guaranteed not to see the removed item.

Your statement is true, but the actual idea I want to convey is stronger
than that: after that grace period, we guarantee that no read nor update
operation can see the removed pointer (even just internally).

> 
> > + * D) Uniqueness guarantee: when using add_unique and/or add_replace to
> > + *    insert nodes into the table, if there was previously one node or
> > + *    less with the same key being inserted by one or more concurrent
> > + *    add_unique and/or add_replace, all concurrent "read" performed on
> > + *    the hash table are guaranteed to find one, and only one node with
> > + *    that key.
> 
> How about: Given a hash table that does not contain duplicate items
> for a given key, there will only be one item in the hash table after
> an arbitrary sequence of add_unique and/or add_replace operations.
> Note, however, that a pair of concurrent read operations might well
> access two different items with that key.

Yes, that works,

Thanks for the feedback!

Mathieu

> 
> >   * Bucket node tables:
> >   *
> >   * hash table	hash table	the last	all bucket node tables
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com



More information about the lttng-dev mailing list