[lttng-dev] [rp] rculfhash ordering guarantees

Mon Apr 23 22:54:54 EDT 2012

* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Mon, Apr 23, 2012 at 03:45:32PM -0400, Mathieu Desnoyers wrote:
> > Hi Paul,
> > 
> > Here is the updated text I plan for the next update. Comments are
> > welcome, thanks !
> 
> Looks much improved!  The inevitable questions and comments interspersed.
> 
> 							Thanx, Paul
> 
> > Mathieu
> > 
> >  * Ordering Guarantees:
> >  *
> >  * To discuss these guarantees, we first define "read" operation as any
> >  * of the the basic cds_lfht_lookup, cds_lfht_next_duplicate,
> >  * cds_lfht_first, cds_lfht_next operation.
> >  *
> >  * We define "read traversal" operation as any of the following
> >  * group of operations
> >  *  - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
> >  *  - cds_lfht_first followed iteration with cds_lfht_next
> 
> Can't cds_lfht_next() and cds_lfht_next_duplicate() be intermixed?
> Something like the following:
> 
> 	rcu_read_lock();
> 	p = cds_lfht_lookup(...);
> 	while (p != NULL) {
> 		do_something(p);
> 		if (p->dups_of_interest)
> 			p = cds_lfht_next_duplicate(...);
> 		else
> 			p = cds_lfht_next(...);
> 	}
> 	rcu_read_unlock();

Yes, they can technically be intermixed, but I doubt that the
"cds_lfht_next" after a lookup will be of any interest: it will get the
nodes with keys following the current node's key (in split-ordered
order).

A use-case that would make more sense would be:

- cds_lfht_first

- iterate on the table with cds_lfht_next(), until we find an
  interesting key.

- Then, iterate on all the duplicate of this key.

So I will document that both can be intermixed.

> 
> >  *
> >  * We define "write" operations as any of cds_lfht_add,
> >  * cds_lfht_add_unique, cds_lfht_add_replace, cds_lfht_del.
> >  *
> >  * We define "prior" and "later" node as nodes observable by reads and
> >  * read traversals respectively before and after a write or sequence of
> >  * write operations.
> >  *
> >  * It is implicit from the requirement of the read, read traversal, and
> >  * write operations that RCU read-side locks need to be held across
> >  * traversals, and between a read or read traversal and invocation of a
> >  * write that receives a node as argument.
> 
> This paragraph is implying a lot...  How about the following?
> 
> 	Hash-table operations are often chained, for example, the
> 	pointer returned by a cds_lfht_lookup() might be passed to a
> 	cds_lfht_next(), whose return value might in turn be passed to
> 	another hash-table operation.  This entire chain of operations
> 	must be enclosed by a pair of matching rcu_read_lock() and
> 	rcu_read_unlock() operations.

looks good, will update.

> 
> >  *
> >  * The following ordering guarantees are offered by this hash table:
> >  *
> >  * A.1) "read" after "write": if there is ordering between a write and a
> >  *      later read, then the read is guaranteed to see the write or some
> >  *      later write.
> >  * A.2) "read traversal" after "write": given that there is dependency
> >  *      ordering between reads in a "read traversal", if there is
> >  *      ordering between a write and the first read of the traversal,
> >  *      then the "read traversal" is guaranteed to see the write or
> >  *      some later write.
> >  * B.1) "write" after "read": if there is ordering between a read and a
> >  *      later write, then the read will never see the write.
> >  * B.2) "write" after "read traversal": given that there is dependency
> >  *      ordering between reads in a "read traversal", if there is
> >  *      ordering between the last read of the traversal and a later
> >  *      write, then the "read traversal" will never see the write.
> >  * C)   "write" while "read traversal": if a write occurs during a "read
> >  *      traversal", the traversal may, or may not, see the write.
> >  * D)   "write" vs "write": writes occur atomically between their
> >  *      invocation and the moment they return. There is a full memory
> >  *      barrier before and after the atomic "commit" point of each
> >  *      write.
> 
> This is a description of the implementation, while the others are guarantees.
> Of course, that begs the question of what the guarantee is...  Maybe the
> following?
> 
> D.1	"write" after "write": if there is ordering between a write and
> 	a later write, then the later write is guaranteed to see the
> 	effects of the first write.
> D.2	Concurrent "write" pairs: The system will assign an arbitrary order
> 	to any pair of concurrent conflicting writes.  Non-conflicting
> 	writes (for example, to different keys) are unordered.

yep, looks good.

I also documented more thoroughly the "cds_lfht_add_unique", which acts
as a write (if it returns success) or read (if it returns failure).

Here is the updated version:

 * Ordering Guarantees:
 *
 * To discuss these guarantees, we first define "read" operation as any
 * of the the basic cds_lfht_lookup, cds_lfht_next_duplicate,
 * cds_lfht_first, cds_lfht_next operation, as well as
 * cds_lfht_add_unique (failure). 
 *
 * We define "read traversal" operation as any of the following
 * group of operations
 *  - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
 *    (and/or cds_lfht_next, although less common).
 *  - cds_lfht_add_unique (failure) followed by iteration with
 *    cds_lfht_next_duplicate (and/or cds_lfht_next, although less
 *    common).
 *  - cds_lfht_first followed iteration with cds_lfht_next (and/or
 *    cds_lfht_next_duplicate, although less common).
 *
 * We define "write" operations as any of cds_lfht_add,
 * cds_lfht_add_unique (success), cds_lfht_add_replace, cds_lfht_del.
 *
 * When cds_lfht_add_unique succeeds (returns the node passed as
 * parameter), it acts as a "write" operation. When cds_lfht_add_unique
 * fails (returns a node different from the one passed as parameter), it
 * acts as a "read" operation. A cds_lfht_add_unique failure is a
 * cds_lfht_lookup "read" operation, therefore, any ordering guarantee
 * referring to "lookup" imply any of "lookup" or cds_lfht_add_unique
 * (failure).
 *
 * We define "prior" and "later" node as nodes observable by reads and
 * read traversals respectively before and after a write or sequence of
 * write operations.
 *
 * Hash-table operations are often chained, for example, the pointer
 * returned by a cds_lfht_lookup() might be passed to a cds_lfht_next(),
 * whose return value might in turn be passed to another hash-table
 * operation. This entire chain of operations must be enclosed by a
 * pair of matching rcu_read_lock() and rcu_read_unlock() operations.
 *
 * The following ordering guarantees are offered by this hash table:
 *
 * A.1) "read" after "write": if there is ordering between a write and a
 *      later read, then the read is guaranteed to see the write or some
 *      later write.
 * A.2) "read traversal" after "write": given that there is dependency
 *      ordering between reads in a "read traversal", if there is
 *      ordering between a write and the first read of the traversal,
 *      then the "read traversal" is guaranteed to see the write or
 *      some later write.
 * B.1) "write" after "read": if there is ordering between a read and a
 *      later write, then the read will never see the write.
 * B.2) "write" after "read traversal": given that there is dependency
 *      ordering between reads in a "read traversal", if there is
 *      ordering between the last read of the traversal and a later
 *      write, then the "read traversal" will never see the write.
 * C)   "write" while "read traversal": if a write occurs during a "read
 *      traversal", the traversal may, or may not, see the write.
 * D.1) "write" after "write": if there is ordering between a write and
 *      a later write, then the later write is guaranteed to see the
 *      effects of the first write.
 * D.2) Concurrent "write" pairs: The system will assign an arbitrary
 *      order to any pair of concurrent conflicting writes.
 *      Non-conflicting writes (for example, to different keys) are
 *      unordered.
 * E)   If a grace period separates a "del" or "replace" operation
 *      and a subsequent operation, then that subsequent operation is
 *      guaranteed not to see the removed item.
 * F)   Uniqueness guarantee: given a hash table that does not contain
 *      duplicate items for a given key, there will only be one item in
 *      the hash table after an arbitrary sequence of add_unique and/or
 *      add_replace operations. Note, however, that a pair of
 *      concurrent read operations might well access two different items
 *      with that key.
 * G.1) If a pair of lookups for a given key are ordered (e.g. by a
 *      memory barrier), then the second lookup will return the same
 *      node as the previous lookup, or some later node.
 * G.2) A "read traversal" that starts after the end of a prior "read
 *      traversal" (ordered by memory barriers) is guaranteed to see the
 *      same nodes as the previous traversal, or some later nodes.
 *
 * Progress guarantees:
 *
 * * Reads are wait-free. These operations always move forward in the
 *   hash table linked list, and this list has no loop.
 * * Writes are lock-free. Any retry loop performed by a write operation
 *   is triggered by progress made within another update operation.
 *

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com