[lttng-dev] [rp] rculfhash ordering guarantees

Mon Apr 23 23:26:40 EDT 2012

On Mon, Apr 23, 2012 at 10:54:54PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Mon, Apr 23, 2012 at 03:45:32PM -0400, Mathieu Desnoyers wrote:
> > > Hi Paul,
> > > 
> > > Here is the updated text I plan for the next update. Comments are
> > > welcome, thanks !
> > 
> > Looks much improved!  The inevitable questions and comments interspersed.
> > 
> > 							Thanx, Paul
> > 
> > > Mathieu
> > > 
> > >  * Ordering Guarantees:
> > >  *
> > >  * To discuss these guarantees, we first define "read" operation as any
> > >  * of the the basic cds_lfht_lookup, cds_lfht_next_duplicate,
> > >  * cds_lfht_first, cds_lfht_next operation.
> > >  *
> > >  * We define "read traversal" operation as any of the following
> > >  * group of operations
> > >  *  - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
> > >  *  - cds_lfht_first followed iteration with cds_lfht_next
> > 
> > Can't cds_lfht_next() and cds_lfht_next_duplicate() be intermixed?
> > Something like the following:
> > 
> > 	rcu_read_lock();
> > 	p = cds_lfht_lookup(...);
> > 	while (p != NULL) {
> > 		do_something(p);
> > 		if (p->dups_of_interest)
> > 			p = cds_lfht_next_duplicate(...);
> > 		else
> > 			p = cds_lfht_next(...);
> > 	}
> > 	rcu_read_unlock();
> 
> Yes, they can technically be intermixed, but I doubt that the
> "cds_lfht_next" after a lookup will be of any interest: it will get the
> nodes with keys following the current node's key (in split-ordered
> order).

I missed a 50% chance of doing something useful.  Good thing I didn't buy
a lottery ticket.  ;-)

> A use-case that would make more sense would be:
> 
> - cds_lfht_first
> 
> - iterate on the table with cds_lfht_next(), until we find an
>   interesting key.
> 
> - Then, iterate on all the duplicate of this key.
> 
> So I will document that both can be intermixed.

Sounds good!

> > >  *
> > >  * We define "write" operations as any of cds_lfht_add,
> > >  * cds_lfht_add_unique, cds_lfht_add_replace, cds_lfht_del.
> > >  *
> > >  * We define "prior" and "later" node as nodes observable by reads and
> > >  * read traversals respectively before and after a write or sequence of
> > >  * write operations.
> > >  *
> > >  * It is implicit from the requirement of the read, read traversal, and
> > >  * write operations that RCU read-side locks need to be held across
> > >  * traversals, and between a read or read traversal and invocation of a
> > >  * write that receives a node as argument.
> > 
> > This paragraph is implying a lot...  How about the following?
> > 
> > 	Hash-table operations are often chained, for example, the
> > 	pointer returned by a cds_lfht_lookup() might be passed to a
> > 	cds_lfht_next(), whose return value might in turn be passed to
> > 	another hash-table operation.  This entire chain of operations
> > 	must be enclosed by a pair of matching rcu_read_lock() and
> > 	rcu_read_unlock() operations.
> 
> looks good, will update.
> 
> > 
> > >  *
> > >  * The following ordering guarantees are offered by this hash table:
> > >  *
> > >  * A.1) "read" after "write": if there is ordering between a write and a
> > >  *      later read, then the read is guaranteed to see the write or some
> > >  *      later write.
> > >  * A.2) "read traversal" after "write": given that there is dependency
> > >  *      ordering between reads in a "read traversal", if there is
> > >  *      ordering between a write and the first read of the traversal,
> > >  *      then the "read traversal" is guaranteed to see the write or
> > >  *      some later write.
> > >  * B.1) "write" after "read": if there is ordering between a read and a
> > >  *      later write, then the read will never see the write.
> > >  * B.2) "write" after "read traversal": given that there is dependency
> > >  *      ordering between reads in a "read traversal", if there is
> > >  *      ordering between the last read of the traversal and a later
> > >  *      write, then the "read traversal" will never see the write.
> > >  * C)   "write" while "read traversal": if a write occurs during a "read
> > >  *      traversal", the traversal may, or may not, see the write.
> > >  * D)   "write" vs "write": writes occur atomically between their
> > >  *      invocation and the moment they return. There is a full memory
> > >  *      barrier before and after the atomic "commit" point of each
> > >  *      write.
> > 
> > This is a description of the implementation, while the others are guarantees.
> > Of course, that begs the question of what the guarantee is...  Maybe the
> > following?
> > 
> > D.1	"write" after "write": if there is ordering between a write and
> > 	a later write, then the later write is guaranteed to see the
> > 	effects of the first write.
> > D.2	Concurrent "write" pairs: The system will assign an arbitrary order
> > 	to any pair of concurrent conflicting writes.  Non-conflicting
> > 	writes (for example, to different keys) are unordered.
> 
> yep, looks good.
> 
> I also documented more thoroughly the "cds_lfht_add_unique", which acts
> as a write (if it returns success) or read (if it returns failure).
> 
> Here is the updated version:

Much better!  Just a couple of remaining nits, one of them my fault rather
than yours.

							Thanx, Paul

>  * Ordering Guarantees:
>  *
>  * To discuss these guarantees, we first define "read" operation as any
>  * of the the basic cds_lfht_lookup, cds_lfht_next_duplicate,
>  * cds_lfht_first, cds_lfht_next operation, as well as
>  * cds_lfht_add_unique (failure). 
>  *
>  * We define "read traversal" operation as any of the following
>  * group of operations
>  *  - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate
>  *    (and/or cds_lfht_next, although less common).
>  *  - cds_lfht_add_unique (failure) followed by iteration with
>  *    cds_lfht_next_duplicate (and/or cds_lfht_next, although less
>  *    common).
>  *  - cds_lfht_first followed iteration with cds_lfht_next (and/or
>  *    cds_lfht_next_duplicate, although less common).
>  *
>  * We define "write" operations as any of cds_lfht_add,
>  * cds_lfht_add_unique (success), cds_lfht_add_replace, cds_lfht_del.
>  *
>  * When cds_lfht_add_unique succeeds (returns the node passed as
>  * parameter), it acts as a "write" operation. When cds_lfht_add_unique
>  * fails (returns a node different from the one passed as parameter), it
>  * acts as a "read" operation. A cds_lfht_add_unique failure is a
>  * cds_lfht_lookup "read" operation, therefore, any ordering guarantee
>  * referring to "lookup" imply any of "lookup" or cds_lfht_add_unique
>  * (failure).
>  *
>  * We define "prior" and "later" node as nodes observable by reads and
>  * read traversals respectively before and after a write or sequence of
>  * write operations.
>  *
>  * Hash-table operations are often chained, for example, the pointer
>  * returned by a cds_lfht_lookup() might be passed to a cds_lfht_next(),
>  * whose return value might in turn be passed to another hash-table
>  * operation. This entire chain of operations must be enclosed by a
>  * pair of matching rcu_read_lock() and rcu_read_unlock() operations.

Sigh.  I obviously wasn't thinking very clearly when I wrote this.
Someone is going to confuse chained operations with hash chains.  One way
to resolve this potential source of confusion is to replace "chain" with
"cascade" as follows:

	Hash-table operations are often cascaded, for example, the
	pointer returned by a cds_lfht_lookup() might be passed to a
	cds_lfht_next(), whose return value might in turn be passed to
	another hash-table operation. This entire cascaded series of
	operations must be enclosed by a pair of matching rcu_read_lock()
	and rcu_read_unlock() operations.

>  *
>  * The following ordering guarantees are offered by this hash table:
>  *
>  * A.1) "read" after "write": if there is ordering between a write and a
>  *      later read, then the read is guaranteed to see the write or some
>  *      later write.
>  * A.2) "read traversal" after "write": given that there is dependency
>  *      ordering between reads in a "read traversal", if there is
>  *      ordering between a write and the first read of the traversal,
>  *      then the "read traversal" is guaranteed to see the write or
>  *      some later write.
>  * B.1) "write" after "read": if there is ordering between a read and a
>  *      later write, then the read will never see the write.
>  * B.2) "write" after "read traversal": given that there is dependency
>  *      ordering between reads in a "read traversal", if there is
>  *      ordering between the last read of the traversal and a later
>  *      write, then the "read traversal" will never see the write.
>  * C)   "write" while "read traversal": if a write occurs during a "read
>  *      traversal", the traversal may, or may not, see the write.
>  * D.1) "write" after "write": if there is ordering between a write and
>  *      a later write, then the later write is guaranteed to see the
>  *      effects of the first write.
>  * D.2) Concurrent "write" pairs: The system will assign an arbitrary
>  *      order to any pair of concurrent conflicting writes.
>  *      Non-conflicting writes (for example, to different keys) are
>  *      unordered.
>  * E)   If a grace period separates a "del" or "replace" operation
>  *      and a subsequent operation, then that subsequent operation is
>  *      guaranteed not to see the removed item.
>  * F)   Uniqueness guarantee: given a hash table that does not contain
>  *      duplicate items for a given key, there will only be one item in
>  *      the hash table after an arbitrary sequence of add_unique and/or
>  *      add_replace operations. Note, however, that a pair of
>  *      concurrent read operations might well access two different items
>  *      with that key.
>  * G.1) If a pair of lookups for a given key are ordered (e.g. by a
>  *      memory barrier), then the second lookup will return the same
>  *      node as the previous lookup, or some later node.
>  * G.2) A "read traversal" that starts after the end of a prior "read
>  *      traversal" (ordered by memory barriers) is guaranteed to see the
>  *      same nodes as the previous traversal, or some later nodes.

And we should explicitly state that concurrent reads are unordered.
For example, if a pair of reads to the same key run concurrently with
an insertion of that same key, the reads remain unordered regardless
of their return values.  In other words, you cannot rely on the values
returns by the reads to deduce ordering.

>  *
>  * Progress guarantees:
>  *
>  * * Reads are wait-free. These operations always move forward in the
>  *   hash table linked list, and this list has no loop.
>  * * Writes are lock-free. Any retry loop performed by a write operation
>  *   is triggered by progress made within another update operation.
>  *
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>