[ltt-dev] [rp-private] [RFC] Lock-free RCU queue for userspace RCU library

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Tue Jul 13 11:35:16 EDT 2010


* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Mon, Jul 12, 2010 at 12:32:23AM -0400, Mathieu Desnoyers wrote:
> > Hi,
> > 
> > I looked a bit more closely at Paolo's lock-free RCU queue implementation and
> > found the idea great, but there were a few irritating points with it. Mainly,
> > implementing a whole memory allocation scheme for a data structure seemed
> > overkill and error-prone. Also, I prefer to decouple the implementation of RCU
> > data structures from memory allocation whenever possible, ideally making it
> > possible to put the list heads directly within already allocated data
> > structures, just like the rcu list. In addition, the very fact that it allocated
> > memory on the enqueue meant that we would have to deal with -ENOMEM upon
> > enqueue. Also, relying on memory allocation/deferred free directly within the
> > data structure implementation means that it is not exactly lock-free, as we may
> > have locks in the underlying memory allocator implementation. This means the
> > enqueue path would also not be signal-safe.
> > 
> > So, I think I found a neat way to handle lock-free rcu queues with reference
> > counting instead. Note that the "dummy" node is really needed, otherwise we end
> > up in situations where the "empty" queue is hard to deal with racelessly. FYI,
> > the ABA problem is dealt with by using rcu read locks. Comments on the code
> > below are very welcome.
> 
> I believe that you need to explicitly requeue the dummy node when
> it is dequeued, and I don't see code below to do this.  Also, the
> dequeue function should be more explicit in its header comment: don't
> modify the node until a grace period passes.  I don't see how the
> reference count helps, possibly indicating a failure of imagination
> on my part.

OK, I see that a major subtlety seems to have eluded you. I'll try to explain
how the reference counting actually replaces the dummy node below.

> 
> See below for details.
> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > 
> > /*
> >  * rculfqueue.h
> >  *
> >  * Userspace RCU library - Lock-Free RCU Queue
> >  *
> >  * Copyright 2010 - Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> >  *
> >  * This library is free software; you can redistribute it and/or
> >  * modify it under the terms of the GNU Lesser General Public
> >  * License as published by the Free Software Foundation; either
> >  * version 2.1 of the License, or (at your option) any later version.
> >  *
> >  * This library is distributed in the hope that it will be useful,
> >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >  * Lesser General Public License for more details.
> >  *
> >  * You should have received a copy of the GNU Lesser General Public
> >  * License along with this library; if not, write to the Free Software
> >  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> >  */
> > 
> > #include <urcu/urcu_ref.h>
> > #include <assert.h>
> > 
> > /*
> >  * Lock-free RCU queue using reference counting. This implementation keeps a
> >  * dummy head node to ensure we can always update the queue locklessly. Given
> >  * that this is a queue, the dummy head node must always advance as we dequeue
> >  * entries. Therefore, we keep a reference count on each entry we are
> >  * dequeueing, so they can be kept as dummy head node until the next dequeue, at
> >  * which point their reference count will be decremented.
> >  */
> > 
> > #define URCU_LFQ_PERMANENT_REF		128
> > 
> > struct rcu_lfq_node {
> > 	struct rcu_lfq_node *next;
> > 	struct urcu_ref ref;
> > };
> 
> The rcu_lfq_node is expected to be a field in a larger data structure
> that is queued, correct?

Yep. Typical use is with container_of().

> 
> > struct rcu_lfq_queue {
> > 	struct rcu_lfq_node *head, *tail;
> > 	struct rcu_lfq_node init;	/* Dummy initialization node */
> > };
> > 
> > void rcu_lfq_node_init(struct rcu_lfq_node *node)
> > {
> > 	node->next = NULL;
> > 	urcu_ref_init(&node->ref);
> > }
> > 
> > void rcu_lfq_init(struct rcu_lfq_queue *q)
> > {
> > 	rcu_lfq_node_init(&q->init);
> > 	/* Make sure the initial node is never freed. */
> > 	urcu_ref_set(&q->init.ref, URCU_LFQ_PERMANENT_REF);
> > 	q->head = q->tail = &q->init;
> > }
> > 
> > void rcu_lfq_enqueue(struct rcu_lfq_queue *q, struct rcu_lfq_node *node)
> > {
> > 	urcu_ref_get(&node->ref);
> 
> OK, you acquire a reference before enqueuing and release the reference
> after dequeueing.  How is this helping?  Because the reference is
> initialized to 1, the dequeue will not invoke release.  This means that
> the caller is still responsible for waiting for an RCU grace period
> before freeing, correct?

Well, it's a bit more subtle than that. Enqueue increments the reference count,
and Dequeue decrements the refcount of the current "dummy" head node on each
dequeue. It does _not_ decrement the refcount of the node it returns, because
the node will still be used as a dummy head until the next dequeue call. Note
that each dequeue operation moves the dummy head forward (which allows us to
unref the old dummy head).

So I initialize the queue at a "init" dummy head, which is only used at the
beginning. It has a "permanent" reference so make sure put operations will not
try to free it. As we queue/dequeue elements in the list, the previous node
returned by dequeue is used as dummy head.

So instead of requeing a dummy head (which could cause problems because a dummy
element (with a single address) would not meet the RCU read lock ABA protection
requirements to do not reuse addresses until a grace period is passed), we
ensure that we always have a dummy head by using the previously returned node.

> 
> > 	/*
> > 	 * uatomic_cmpxchg() implicit memory barrier orders earlier stores to
> > 	 * node before publication.
> > 	 */
> > 
> > 	rcu_read_lock();
> > 	for (;;) {
> > 		struct rcu_lfq_node *tail = rcu_dereference(q->tail);
> > 		struct rcu_lfq_node *next;
> > 
> > 		/*
> > 		 * Typically expect tail to be NULL.
> > 		 */
> > 		next = uatomic_cmpxchg(&tail->next, NULL, node);
> > 		if (next == NULL) {
> > 			/*
> > 			 * Tail was at the end of queue, we successfully
> > 			 * appended to it.
> > 			 * Now move tail (another enqueue might beat
> > 			 * us to it, that's fine).
> > 			 */
> > 			uatomic_cmpxchg(&q->tail, tail, node);
> > 			rcu_read_unlock();
> > 			return;
> > 		} else {
> > 			/*
> > 			 * Failure to append to current tail. Help moving tail
> > 			 * further and retry.
> > 			 */
> > 			uatomic_cmpxchg(&q->tail, tail, next);
> 
> What prevents this node from being dequeued before the tail pointer
> moves off of it?  The sequence of events that I am concerned about is
> as follows:
> 
> o	CPU 0 enqueues element A on an empty list (which contains only
> 	the dummy element).
> 
> o	CPU 1 dequeues the dummy element (and presumably somehow requeues
> 	it, but I don't see how this happens).

We cannot dequeue the dummy element if the list is empty. The dummy element will
have a NULL next, which indicates that the list is empty. However, in this case,
given that we have enqueued element A, then we can dequeue A and unref the dummy
init head.

>                                              Assuming that requeueing
> 	of the dummy element actually occurs, suppose that CPU 1 is
> 	preempted just before it would have advanced the tail pointer.
> 
> o	CPU 2 dequeues element A from the list.  The caller presumably can
> 	neither free it or requeue it until a grace period passes.  But 
> 	the caller must also refrain from modifying the ->next pointer
> 	during this time, as doing so would break the helper in this case.
> 

So let's rewrite the scenario considering the moving-dummy-head scheme:

 o    CPU 0 enqueues element A on an empty list (which contains only the dummy
      init element).

 o    CPU 1 dequeues element A, unref dummy init head. (element A is the new
      dummy head).

 o    CPU 2 dequeues from the list, only to find out that the dummy head ->next
      is NULL. Queue is therefore empty.

 o    CPU 0 eventually sets the tail to element A.


Now let's try a more elaborate scheme, where we would try to see if the dummy
head can be unref before we set the tail to it:

 o    CPU 0 enqueues element A on an empty list (which contains only the dummy
      init element).

 o    CPU 1 try to  enqueue element B on list (which contains dummy init -> A).
      However, tail->next is not NULL (because CPU 0 did not push the tail yet),
      so help moving the tail and retry.

 o    CPU 2 dequeues element A, unref dummy init head. (element A is the new
      dummy head).

 o    CPU 3 dequeues from the list, only to find out that the dummy head ->next
      is NULL. Queue is therefore empty.

 o    CPU 0 eventually try setting the tail to element A. This fails, because
      CPU 1 helped out moving the tail to A.


So, basically, an enqueuer cannot enqueue an element as long as the tail is not
up to date. Therefore, by keeping the refcount on the each element after they
are dequeued (dropping the refcount at the following dequeue), we ensure that
an entry is kept valid between the moment we add it to the list and the moment
we update the tail.

> 	This might well be what you were getting at with your statement
> 	about not reusing the lfq node in the rcu_lfq_dequeue() comment
> 	header, but more clarity would be good.  ;-)

I think a lot of my explanations here could be added to the code as comments.

> 
> So this seems OK.  But suppose we have the other ordering:
> 
> o	CPU 0 enqueues element A on an empty list, but is preempted before
> 	updating the ->tail pointer.
> 
> o	CPU 1 dequeues, but gets the dummy element, which it presumably
> 	requeues as follows:
> 
> 	o	next = uatomic_cmpxchg(&tail->next, NULL, node);
> 
> 	o	But this fails, since the dummy's ->next pointer references
> 		the half-enqueued element A.
> 
> 	o	So it tries to help: uatomic_cmpxchg(&q->tail, tail, next);
> 		This should move the tail ahead to element A, which
> 		should allow the next pass through the loop to succeed.
> 
> So this seems OK as well.

Please see scenarios above, which I hope should take your concern into account.

> 
> How are you testing/validating this?  ;-)

Only testing at the moment, with urcu:

tests/test_urcu_lfq (lock-free queue)
tests/test_urcu_lfs (lock-free stack)
tests/test_urcu_wfs (wait-free stack push/blocking pop)

By varying the number of enqueuers/dequeuers, I can manage to create different
scenarios. I empty the data structures at the end of the test, and check if the
number of enqueue/dequeue or push/pop match (print a WARNING if not). I noticed
that adding artificial delays within the algorithms helps finding races much
more quickly. Typically, adding a delay only in the enqueue, or only in the
dequeue seems to work best. I still do this by hand though. It could eventually
be automated.

> 
> > 			continue;
> 
> The above "continue" is unnecessary, right?

True. Although I find it a bit clearer to put something in each branch (either a
return or a continue). It's easier to make sure we did not miss any case. 

> 
> > 		}
> > 	}
> > }
> > 
> > /*
> >  * The entry returned by dequeue must be taken care of by doing a urcu_ref_put,
> >  * which calls the release primitive when the reference count drops to zero. A
> >  * grace period must be waited before performing the actual memory reclamation
> >  * in the release primitive.
> >  * The entry lfq node returned by dequeue must no be re-used before the
> 
> s/no/not/
> 
> >  * reference count reaches zero.
> >  */
> > struct rcu_lfq_node *
> > rcu_lfq_dequeue(struct rcu_lfq_queue *q, void (*release)(struct urcu_ref *))
> > {
> > 	rcu_read_lock();
> > 	for (;;) {
> > 		struct rcu_lfq_node *head = rcu_dereference(q->head);
> > 		struct rcu_lfq_node *next = rcu_dereference(head->next);
> 
> And head->next is guaranteed to exist due to the dummy node always being
> present, right?

As I explained above, it is guaranteed to exist thanks to the previously
dequeued node which is kept around as a dummy head until the next dequeue.

> 
> > 		if (next) {
> > 			if (uatomic_cmpxchg(&q->head, head, next) == head) {
> > 				rcu_read_unlock();
> > 				urcu_ref_put(&head->ref, release);
> 
> This drops the reference count to one, not to zero, right?

Well, it depends. Note that we drop the refcount of "head", not "next". So after
enqueueing the first node, when we dequeue, we would try to unref the "init
dummy" node. Because its refcount is permanent (128), we decrement it to 127.

Then when we enqueue some more, and dequeue, then we decrement the refcount of
the first dequeued element, which can be either 2 or 1. It all depends whether
or not this element has already been unref by the first dequeue call. So we
decrement it from 2 to 1 if it has not been unref'd by the first dequeue, or we
decrement it from 1 to 0 if the first dequeue caller has already unref'd it.

> 
> > 				return next;
> 
> What if local variable "next" points to the dummy node?  I don't understand
> how this works unless we requeue it.

It can't, because by definition the "dummy" node I use here is always the very
head of the list. I don't "requeue" dummy nodes, it's the previous dequeued node
that "becomes" a dummy node until the next dequeue.

Thanks for the thorough review !

Mathieu

> 
> > 			} else {
> > 				/* Concurrently pushed, retry */
> > 				continue;
> > 			}
> > 		} else {
> > 			/* Empty */
> > 			rcu_read_unlock();
> > 			return NULL;
> > 		}
> > 	}
> > }
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> > _______________________________________________
> > rp-private mailing list
> > rp-private at svcs.cs.pdx.edu
> > http://svcs.cs.pdx.edu/mailman/listinfo/rp-private

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com




More information about the lttng-dev mailing list