[lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

Sun May 6 11:56:40 EDT 2012

On Sat, May 05, 2012 at 02:22:12PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > [...]
> > > > > > 
> > > > > > A write barrier would be sufficient in the case where there were only
> > > > > > two threads observing each other.  A full memory barrier would be needed
> > > > > > to prevent the assertion from firing in this sort of case (not sure that
> > > > > > this is exactly right, but something like this):
> > > > > > 
> > > > > > Initial contents: B, C
> > > > > > 
> > > > > > T0: add A; del B
> > > > > > T1: if (!lookup B) { add B; del C }
> > > > > > T2: r1 = lookup C; smp_mb(); r2 = lookup A
> > > > > > 
> > > > > > assert(lookup C || lookup A);
> > > > > 
> > > > > What you are bringing here as counter-example is, I think, transitivity.
> > > > 
> > > > Yep!
> > > > 
> > > > > I'm trying to figure out how your example could fail, and I cannot see
> > > > > how. Follows a detail of the closest scenario I get to failing is the
> > > > > following, but it does not fail. After that, I'm proposing a different
> > > > > scenario, which I think will be more appropriate for the current topic.
> > > > > 
> > > > > Attempted detail of your scenario:
> > > > > 
> > > > > T0                 T1                 T2
> > > > > 
> > > > > add A
> > > > > wmb
> > > > >   (add A globally
> > > > >    observable)
> > > > 
> > > > Almost.  All that this really guarantees is that if someone sees
> > > > the "add B", they will also see the "add A".
> > > > 
> > > > > del B
> > > > >   (del B globally
> > > > >    observable)
> > > > >                    (add A NOT brought
> > > > >                     into cache)
> > > > >                    (del B brought into
> > > > >                     cache)
> > > > >                    read B
> > > > >                    (test false)
> > > > >                    add B
> > > > >                    wmb
> > > > >                    (add B globally
> > > > >                     observable)
> > > > >                    del C
> > > > >                    (del C globally
> > > > >                     observable)
> > > > 
> > > > Here, if someone sees the "del C", they will see the "add B", and
> > > > they also will have lost the opportunity to modify B before T1
> > > > reads from it and modifies it.
> > > > 
> > > > >                                       (add A NOT brought
> > > > >                                        into cache)
> > > > >                                       (del C brought into
> > > > >                                        cache)
> > > > >                                       read C -> not there.
> > > > >                                       mb
> > > > >                                       (add A brought
> > > > >                                        into cache)
> > > > >                                       read A -> there -> success.
> > > > 
> > > > So we see that C is not there.  We know that B would be there if
> > > > we looked at it.  But we don't look at B, we look at A.  But the
> > > > ordering back to T0's "add A" requires transitivity, which wmb
> > > > does not guarantee.
> > > 
> > > OK, got it!
> > > 
> > > > 
> > > > > If I look at the "transitivity" section in Linux memory-barriers.txt, I
> > > > > notice that the example is mainly around using read barrier around loads
> > > > > rather than general barrier. Let's see if I can modify that example to
> > > > > come up with an example error case:
> > > > > 
> > > > > Initial content: empty
> > > > > 
> > > > > T0: add X
> > > > > T1: r1 = lookup X; smp_mb; r2 = lookup Y
> > > > > T2: add Y; r3 = lookup X
> > > > > 
> > > > > assert( !(r1 && !r2 && !r3) )
> > > > > 
> > > > > The key thing here is that if the barrier in T2 after "add Y" is a
> > > > > smp_wmb rather than a smp_mb, this could allow the "r3 = lookup X" to be
> > > > > reordered before add Y, thus allowing the assertion to fail.
> > > > 
> > > > Your example is simpler, and demonstrates the need just as well, so
> > > > let's go with your example.
> > > > 
> > > > > I think it would be more intuitive for users if lookups vs updates
> > > > > performed on the same thread are ordered with full memory barriers.
> > > > > Given that we don't want to add extra barriers in read operations, it
> > > > > would make sense to guarantee full memory barriers before and after
> > > > > updates.
> > > > > 
> > > > > So how about we use full memory barriers before and after each of: add,
> > > > > del (success), add_unique (success), replace, and add_replace ? If we
> > > > > ever want to relax those ordering guarantees, then we can always add new
> > > > > update APIs with a "weaker" ordering.
> > > > > 
> > > > > Thoughts ?
> > > > 
> > > > That line of reasoning makes a lot of sense to me!
> > > 
> > > Sounds good.
> > > 
> > > Here is what I propose, thoughts ?
> > 
> > Just to make sure I understand -- the reason that the "del" functions
> > say "no memory barrier" instead of "acts like rcu_dereference()" is
> > that the "del" functions don't return anything.
> 
> Hrm, you bring an interesting point. I think we should change the two
> "rcu_dereference()" in _cds_lfht_del for "CMM_LOAD_SHARED()". The
> difference between the two is that CMM_LOAD_SHARED() does not imply a
> read barrier between the read and following uses of the data pointed to
> by the pointer read.
> 
> Same thing for "cds_lfht_is_node_deleted": the rcu_dereference() should
> be changed for a CMM_LOAD_SHARED(), because we never use the loaded
> pointer as a pointer to other data. There are a few other locations
> where the pointer is only used for its flags.
> 
> Here is what I propose:
> 
> diff --git a/rculfhash.c b/rculfhash.c
> index b9f795f..6e27436 100644
> --- a/rculfhash.c
> +++ b/rculfhash.c
> @@ -923,7 +923,7 @@ int _cds_lfht_replace(struct cds_lfht *ht, unsigned long size,
>  	bucket = lookup_bucket(ht, size, bit_reverse_ulong(old_node->reverse_hash));
>  	_cds_lfht_gc_bucket(bucket, new_node);
> 
> -	assert(is_removed(rcu_dereference(old_node->next)));
> +	assert(is_removed(CMM_LOAD_SHARED(old_node->next)));

This is good.

>  	return 0;
>  }
> 
> @@ -1061,7 +1061,7 @@ int _cds_lfht_del(struct cds_lfht *ht, unsigned long size,
>  	 * logical removal flag). Return -ENOENT if the node had
>  	 * previously been removed.
>  	 */
> -	next = rcu_dereference(node->next);
> +	next = CMM_LOAD_SHARED(node->next);
>  	if (caa_unlikely(is_removed(next)))
>  		return -ENOENT;
>  	assert(!is_bucket(next));

As long as "next" is not dereferenced anywhere, this is good.  Which
appears to be the case.

> @@ -1082,7 +1082,7 @@ int _cds_lfht_del(struct cds_lfht *ht, unsigned long size,
>  	bucket = lookup_bucket(ht, size, bit_reverse_ulong(node->reverse_hash));
>  	_cds_lfht_gc_bucket(bucket, node);
> 
> -	assert(is_removed(rcu_dereference(node->next)));
> +	assert(is_removed(CMM_LOAD_SHARED(node->next)));

This one is fine as well.

>  	/*
>  	 * Last phase: atomically exchange node->next with a version
>  	 * having "REMOVAL_OWNER_FLAG" set. If the returned node->next
> @@ -1510,7 +1510,7 @@ void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash,
>  		}
>  		node = clear_flag(next);
>  	}
> -	assert(!node || !is_bucket(rcu_dereference(node->next)));
> +	assert(!node || !is_bucket(CMM_LOAD_SHARED(node->next)));

As is this one.

>  	iter->node = node;
>  	iter->next = next;
>  }
> @@ -1543,7 +1543,7 @@ void cds_lfht_next_duplicate(struct cds_lfht *ht, cds_lfht_match_fct match,
>  		}
>  		node = clear_flag(next);
>  	}
> -	assert(!node || !is_bucket(rcu_dereference(node->next)));
> +	assert(!node || !is_bucket(CMM_LOAD_SHARED(node->next)));

Same here.

>  	iter->node = node;
>  	iter->next = next;
>  }
> @@ -1565,7 +1565,7 @@ void cds_lfht_next(struct cds_lfht *ht, struct cds_lfht_iter *iter)
>  		}
>  		node = clear_flag(next);
>  	}
> -	assert(!node || !is_bucket(rcu_dereference(node->next)));
> +	assert(!node || !is_bucket(CMM_LOAD_SHARED(node->next)));

And here.

>  	iter->node = node;
>  	iter->next = next;
>  }
> @@ -1668,7 +1668,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node)
> 
>  int cds_lfht_is_node_deleted(struct cds_lfht_node *node)
>  {
> -	return is_removed(rcu_dereference(node->next));
> +	return is_removed(CMM_LOAD_SHARED(node->next));

And also here.

>  }
> 
>  static
> 
> If it's ok for you, I will first commit this change, and then commit the
> memory barrier documentation with your reviewed-by.

Looks good!

							Thanx, Paul

> Thanks!
> 
> Mathieu
> 
> > 
> > Assuming my understanding is correct:
> > 
> > Reviewed-by: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
> > 
> > 							Thanx, Paul
> > 
> > > diff --git a/urcu/rculfhash.h b/urcu/rculfhash.h
> > > index 2d8a310..2938e5e 100644
> > > --- a/urcu/rculfhash.h
> > > +++ b/urcu/rculfhash.h
> > > @@ -203,6 +203,7 @@ void cds_lfht_count_nodes(struct cds_lfht *ht,
> > >   *
> > >   * Call with rcu_read_lock held.
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function acts as a rcu_dereference() to read the node pointer.
> > >   */
> > >  void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash,
> > >  		cds_lfht_match_fct match, const void *key,
> > > @@ -226,6 +227,7 @@ void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash,
> > >   * node returned by a previous cds_lfht_next.
> > >   * Call with rcu_read_lock held.
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function acts as a rcu_dereference() to read the node pointer.
> > >   */
> > >  void cds_lfht_next_duplicate(struct cds_lfht *ht,
> > >  		cds_lfht_match_fct match, const void *key,
> > > @@ -239,6 +241,7 @@ void cds_lfht_next_duplicate(struct cds_lfht *ht,
> > >   * Output in "*iter". *iter->node set to NULL if table is empty.
> > >   * Call with rcu_read_lock held.
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function acts as a rcu_dereference() to read the node pointer.
> > >   */
> > >  void cds_lfht_first(struct cds_lfht *ht, struct cds_lfht_iter *iter);
> > > 
> > > @@ -252,6 +255,7 @@ void cds_lfht_first(struct cds_lfht *ht, struct cds_lfht_iter *iter);
> > >   * pointing to the last table node.
> > >   * Call with rcu_read_lock held.
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function acts as a rcu_dereference() to read the node pointer.
> > >   */
> > >  void cds_lfht_next(struct cds_lfht *ht, struct cds_lfht_iter *iter);
> > > 
> > > @@ -264,6 +268,8 @@ void cds_lfht_next(struct cds_lfht *ht, struct cds_lfht_iter *iter);
> > >   * This function supports adding redundant keys into the table.
> > >   * Call with rcu_read_lock held.
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function issues a full memory barrier before and after its
> > > + * atomic commit.
> > >   */
> > >  void cds_lfht_add(struct cds_lfht *ht, unsigned long hash,
> > >  		struct cds_lfht_node *node);
> > > @@ -288,6 +294,12 @@ void cds_lfht_add(struct cds_lfht *ht, unsigned long hash,
> > >   * to add keys into the table, no duplicated keys should ever be
> > >   * observable in the table. The same guarantee apply for combination of
> > >   * add_unique and add_replace (see below).
> > > + *
> > > + * Upon success, this function issues a full memory barrier before and
> > > + * after its atomic commit. Upon failure, this function acts like a
> > > + * simple lookup operation: it acts as a rcu_dereference() to read the
> > > + * node pointer. The failure case does not guarantee any other memory
> > > + * barrier.
> > >   */
> > >  struct cds_lfht_node *cds_lfht_add_unique(struct cds_lfht *ht,
> > >  		unsigned long hash,
> > > @@ -321,6 +333,9 @@ struct cds_lfht_node *cds_lfht_add_unique(struct cds_lfht *ht,
> > >   * schemes will never generate duplicated keys. It also allows us to
> > >   * guarantee that a combination of add_replace and add_unique updates
> > >   * will never generate duplicated keys.
> > > + *
> > > + * This function issues a full memory barrier before and after its
> > > + * atomic commit.
> > >   */
> > >  struct cds_lfht_node *cds_lfht_add_replace(struct cds_lfht *ht,
> > >  		unsigned long hash,
> > > @@ -352,6 +367,10 @@ struct cds_lfht_node *cds_lfht_add_replace(struct cds_lfht *ht,
> > >   *
> > >   * The semantic of replacement vs lookups is the same as
> > >   * cds_lfht_add_replace().
> > > + *
> > > + * Upon success, this function issues a full memory barrier before and
> > > + * after its atomic commit. Upon failure, this function does not issue
> > > + * any memory barrier.
> > >   */
> > >  int cds_lfht_replace(struct cds_lfht *ht,
> > >  		struct cds_lfht_iter *old_iter,
> > > @@ -377,6 +396,9 @@ int cds_lfht_replace(struct cds_lfht *ht,
> > >   * After successful removal, a grace period must be waited for before
> > >   * freeing the memory reserved for old node (which can be accessed with
> > >   * cds_lfht_iter_get_node).
> > > + * Upon success, this function issues a full memory barrier before and
> > > + * after its atomic commit. Upon failure, this function does not issue
> > > + * any memory barrier.
> > >   */
> > >  int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node);
> > > 
> > > @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node);
> > >   * function.
> > >   * Call with rcu_read_lock held.
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function does not issue any memory barrier.
> > >   */
> > >  int cds_lfht_is_node_deleted(struct cds_lfht_node *node);
> > > 
> > > @@ -400,6 +423,7 @@ int cds_lfht_is_node_deleted(struct cds_lfht_node *node);
> > >   * @new_size: update to this hash table size.
> > >   *
> > >   * Threads calling this API need to be registered RCU read-side threads.
> > > + * This function does not (necessarily) issue memory barriers.
> > >   */
> > >  void cds_lfht_resize(struct cds_lfht *ht, unsigned long new_size);
> > > 
> > > @@ -407,6 +431,7 @@ void cds_lfht_resize(struct cds_lfht *ht, unsigned long new_size);
> > >   * Note: it is safe to perform element removal (del), replacement, or
> > >   * any hash table update operation during any of the following hash
> > >   * table traversals.
> > > + * These functions act as rcu_dereference() to read the node pointers.
> > >   */
> > >  #define cds_lfht_for_each(ht, iter, node)				\
> > >  	for (cds_lfht_first(ht, iter),					\
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > > _______________________________________________
> > > rp mailing list
> > > rp at svcs.cs.pdx.edu
> > > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> > > 
> > 
> > 
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev at lists.lttng.org
> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>