[ltt-dev] [rp-private] [RFC] Lock-free RCU queue for userspace RCU library

Paul E. McKenney paulmck at linux.vnet.ibm.com
Fri Jul 16 16:45:29 EDT 2010


On Tue, Jul 13, 2010 at 11:35:16AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > On Mon, Jul 12, 2010 at 12:32:23AM -0400, Mathieu Desnoyers wrote:
> > > Hi,
> > > 
> > > I looked a bit more closely at Paolo's lock-free RCU queue implementation and
> > > found the idea great, but there were a few irritating points with it. Mainly,
> > > implementing a whole memory allocation scheme for a data structure seemed
> > > overkill and error-prone. Also, I prefer to decouple the implementation of RCU
> > > data structures from memory allocation whenever possible, ideally making it
> > > possible to put the list heads directly within already allocated data
> > > structures, just like the rcu list. In addition, the very fact that it allocated
> > > memory on the enqueue meant that we would have to deal with -ENOMEM upon
> > > enqueue. Also, relying on memory allocation/deferred free directly within the
> > > data structure implementation means that it is not exactly lock-free, as we may
> > > have locks in the underlying memory allocator implementation. This means the
> > > enqueue path would also not be signal-safe.
> > > 
> > > So, I think I found a neat way to handle lock-free rcu queues with reference
> > > counting instead. Note that the "dummy" node is really needed, otherwise we end
> > > up in situations where the "empty" queue is hard to deal with racelessly. FYI,
> > > the ABA problem is dealt with by using rcu read locks. Comments on the code
> > > below are very welcome.
> > 
> > I believe that you need to explicitly requeue the dummy node when
> > it is dequeued, and I don't see code below to do this.  Also, the
> > dequeue function should be more explicit in its header comment: don't
> > modify the node until a grace period passes.  I don't see how the
> > reference count helps, possibly indicating a failure of imagination
> > on my part.
> 
> OK, I see that a major subtlety seems to have eluded you. I'll try to explain
> how the reference counting actually replaces the dummy node below.

I was indeed assuming that the close-in-time reference-count manipulations
were to the same object, which does appear to not be the case.

> > See below for details.
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > 
> > > /*
> > >  * rculfqueue.h
> > >  *
> > >  * Userspace RCU library - Lock-Free RCU Queue
> > >  *
> > >  * Copyright 2010 - Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> > >  *
> > >  * This library is free software; you can redistribute it and/or
> > >  * modify it under the terms of the GNU Lesser General Public
> > >  * License as published by the Free Software Foundation; either
> > >  * version 2.1 of the License, or (at your option) any later version.
> > >  *
> > >  * This library is distributed in the hope that it will be useful,
> > >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > >  * Lesser General Public License for more details.
> > >  *
> > >  * You should have received a copy of the GNU Lesser General Public
> > >  * License along with this library; if not, write to the Free Software
> > >  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> > >  */
> > > 
> > > #include <urcu/urcu_ref.h>
> > > #include <assert.h>
> > > 
> > > /*
> > >  * Lock-free RCU queue using reference counting. This implementation keeps a
> > >  * dummy head node to ensure we can always update the queue locklessly. Given
> > >  * that this is a queue, the dummy head node must always advance as we dequeue
> > >  * entries. Therefore, we keep a reference count on each entry we are
> > >  * dequeueing, so they can be kept as dummy head node until the next dequeue, at
> > >  * which point their reference count will be decremented.
> > >  */
> > > 
> > > #define URCU_LFQ_PERMANENT_REF		128
> > > 
> > > struct rcu_lfq_node {
> > > 	struct rcu_lfq_node *next;
> > > 	struct urcu_ref ref;
> > > };
> > 
> > The rcu_lfq_node is expected to be a field in a larger data structure
> > that is queued, correct?
> 
> Yep. Typical use is with container_of().

Good!

> > > struct rcu_lfq_queue {
> > > 	struct rcu_lfq_node *head, *tail;
> > > 	struct rcu_lfq_node init;	/* Dummy initialization node */
> > > };
> > > 
> > > void rcu_lfq_node_init(struct rcu_lfq_node *node)
> > > {
> > > 	node->next = NULL;
> > > 	urcu_ref_init(&node->ref);
> > > }
> > > 
> > > void rcu_lfq_init(struct rcu_lfq_queue *q)
> > > {
> > > 	rcu_lfq_node_init(&q->init);
> > > 	/* Make sure the initial node is never freed. */
> > > 	urcu_ref_set(&q->init.ref, URCU_LFQ_PERMANENT_REF);
> > > 	q->head = q->tail = &q->init;
> > > }
> > > 
> > > void rcu_lfq_enqueue(struct rcu_lfq_queue *q, struct rcu_lfq_node *node)
> > > {
> > > 	urcu_ref_get(&node->ref);
> > 
> > OK, you acquire a reference before enqueuing and release the reference
> > after dequeueing.  How is this helping?  Because the reference is
> > initialized to 1, the dequeue will not invoke release.  This means that
> > the caller is still responsible for waiting for an RCU grace period
> > before freeing, correct?
> 
> Well, it's a bit more subtle than that. Enqueue increments the reference count,
> and Dequeue decrements the refcount of the current "dummy" head node on each
> dequeue. It does _not_ decrement the refcount of the node it returns, because
> the node will still be used as a dummy head until the next dequeue call. Note
> that each dequeue operation moves the dummy head forward (which allows us to
> unref the old dummy head).
> 
> So I initialize the queue at a "init" dummy head, which is only used at the
> beginning. It has a "permanent" reference so make sure put operations will not
> try to free it. As we queue/dequeue elements in the list, the previous node
> returned by dequeue is used as dummy head.
> 
> So instead of requeing a dummy head (which could cause problems because a dummy
> element (with a single address) would not meet the RCU read lock ABA protection
> requirements to do not reuse addresses until a grace period is passed), we
> ensure that we always have a dummy head by using the previously returned node.

So the sequence of events that you are concerned about is as follows?

o	The queue consists of the dummy element followed by element A.

o	CPU 0 picks up the header pointer (to the dummy element)
	and the pointer to element A in preparation to dequeue,
	and is then preempted.

o	CPU 1 dequeues the dummy element, requeues it, then dequeues
	element A.  The list now contains only the dummy element.

o	CPU 1 enqueues element B. The list now contains the dummy
	element followed by element B.

o	CPU 0 does its compare-and-swap, which links the old element A
	back into the list.  Ouch!!!

This certainly reinforces my prejudice in favor of wait-free enqueues
combined with blocking dequeues!  ;-)

> > > 	/*
> > > 	 * uatomic_cmpxchg() implicit memory barrier orders earlier stores to
> > > 	 * node before publication.
> > > 	 */
> > > 
> > > 	rcu_read_lock();
> > > 	for (;;) {
> > > 		struct rcu_lfq_node *tail = rcu_dereference(q->tail);
> > > 		struct rcu_lfq_node *next;
> > > 
> > > 		/*
> > > 		 * Typically expect tail to be NULL.
> > > 		 */
> > > 		next = uatomic_cmpxchg(&tail->next, NULL, node);
> > > 		if (next == NULL) {
> > > 			/*
> > > 			 * Tail was at the end of queue, we successfully
> > > 			 * appended to it.
> > > 			 * Now move tail (another enqueue might beat
> > > 			 * us to it, that's fine).
> > > 			 */
> > > 			uatomic_cmpxchg(&q->tail, tail, node);
> > > 			rcu_read_unlock();
> > > 			return;
> > > 		} else {
> > > 			/*
> > > 			 * Failure to append to current tail. Help moving tail
> > > 			 * further and retry.
> > > 			 */
> > > 			uatomic_cmpxchg(&q->tail, tail, next);
> > 
> > What prevents this node from being dequeued before the tail pointer
> > moves off of it?  The sequence of events that I am concerned about is
> > as follows:
> > 
> > o	CPU 0 enqueues element A on an empty list (which contains only
> > 	the dummy element).
> > 
> > o	CPU 1 dequeues the dummy element (and presumably somehow requeues
> > 	it, but I don't see how this happens).
> 
> We cannot dequeue the dummy element if the list is empty. The dummy element will
> have a NULL next, which indicates that the list is empty. However, in this case,
> given that we have enqueued element A, then we can dequeue A and unref the dummy
> init head.

And then the dummy is never used again, correct?  The head pointer
references the next pointer of the last element that was dequeued,
which cannot be freed (or otherwise reused) until some other element
is enqueued.  So the last item dequeued acts like a list header, and
thus is in some sense still a part of the queue.

Hmmm...

> >                                              Assuming that requeueing
> > 	of the dummy element actually occurs, suppose that CPU 1 is
> > 	preempted just before it would have advanced the tail pointer.
> > 
> > o	CPU 2 dequeues element A from the list.  The caller presumably can
> > 	neither free it or requeue it until a grace period passes.  But 
> > 	the caller must also refrain from modifying the ->next pointer
> > 	during this time, as doing so would break the helper in this case.
> > 
> 
> So let's rewrite the scenario considering the moving-dummy-head scheme:
> 
>  o    CPU 0 enqueues element A on an empty list (which contains only the dummy
>       init element).
> 
>  o    CPU 1 dequeues element A, unref dummy init head. (element A is the new
>       dummy head).
> 
>  o    CPU 2 dequeues from the list, only to find out that the dummy head ->next
>       is NULL. Queue is therefore empty.
> 
>  o    CPU 0 eventually sets the tail to element A.
> 
> 
> Now let's try a more elaborate scheme, where we would try to see if the dummy
> head can be unref before we set the tail to it:
> 
>  o    CPU 0 enqueues element A on an empty list (which contains only the dummy
>       init element).
> 
>  o    CPU 1 try to  enqueue element B on list (which contains dummy init -> A).
>       However, tail->next is not NULL (because CPU 0 did not push the tail yet),
>       so help moving the tail and retry.
> 
>  o    CPU 2 dequeues element A, unref dummy init head. (element A is the new
>       dummy head).
> 
>  o    CPU 3 dequeues from the list, only to find out that the dummy head ->next
>       is NULL. Queue is therefore empty.
> 
>  o    CPU 0 eventually try setting the tail to element A. This fails, because
>       CPU 1 helped out moving the tail to A.
> 
> 
> So, basically, an enqueuer cannot enqueue an element as long as the tail is not
> up to date. Therefore, by keeping the refcount on the each element after they
> are dequeued (dropping the refcount at the following dequeue), we ensure that
> an entry is kept valid between the moment we add it to the list and the moment
> we update the tail.
> 
> > 	This might well be what you were getting at with your statement
> > 	about not reusing the lfq node in the rcu_lfq_dequeue() comment
> > 	header, but more clarity would be good.  ;-)
> 
> I think a lot of my explanations here could be added to the code as comments.
> 
> > 
> > So this seems OK.  But suppose we have the other ordering:
> > 
> > o	CPU 0 enqueues element A on an empty list, but is preempted before
> > 	updating the ->tail pointer.
> > 
> > o	CPU 1 dequeues, but gets the dummy element, which it presumably
> > 	requeues as follows:
> > 
> > 	o	next = uatomic_cmpxchg(&tail->next, NULL, node);
> > 
> > 	o	But this fails, since the dummy's ->next pointer references
> > 		the half-enqueued element A.
> > 
> > 	o	So it tries to help: uatomic_cmpxchg(&q->tail, tail, next);
> > 		This should move the tail ahead to element A, which
> > 		should allow the next pass through the loop to succeed.
> > 
> > So this seems OK as well.
> 
> Please see scenarios above, which I hope should take your concern into account.

Definitely looks like it would work in single-queue use...

> > How are you testing/validating this?  ;-)
> 
> Only testing at the moment, with urcu:
> 
> tests/test_urcu_lfq (lock-free queue)
> tests/test_urcu_lfs (lock-free stack)
> tests/test_urcu_wfs (wait-free stack push/blocking pop)
> 
> By varying the number of enqueuers/dequeuers, I can manage to create different
> scenarios. I empty the data structures at the end of the test, and check if the
> number of enqueue/dequeue or push/pop match (print a WARNING if not). I noticed
> that adding artificial delays within the algorithms helps finding races much
> more quickly. Typically, adding a delay only in the enqueue, or only in the
> dequeue seems to work best. I still do this by hand though. It could eventually
> be automated.

Fair enough!

> > > 			continue;
> > 
> > The above "continue" is unnecessary, right?
> 
> True. Although I find it a bit clearer to put something in each branch (either a
> return or a continue). It's easier to make sure we did not miss any case. 

OK.

> > > 		}
> > > 	}
> > > }
> > > 
> > > /*
> > >  * The entry returned by dequeue must be taken care of by doing a urcu_ref_put,
> > >  * which calls the release primitive when the reference count drops to zero. A
> > >  * grace period must be waited before performing the actual memory reclamation
> > >  * in the release primitive.
> > >  * The entry lfq node returned by dequeue must no be re-used before the
> > 
> > s/no/not/
> > 
> > >  * reference count reaches zero.
> > >  */
> > > struct rcu_lfq_node *
> > > rcu_lfq_dequeue(struct rcu_lfq_queue *q, void (*release)(struct urcu_ref *))
> > > {
> > > 	rcu_read_lock();
> > > 	for (;;) {
> > > 		struct rcu_lfq_node *head = rcu_dereference(q->head);
> > > 		struct rcu_lfq_node *next = rcu_dereference(head->next);
> > 
> > And head->next is guaranteed to exist due to the dummy node always being
> > present, right?
> 
> As I explained above, it is guaranteed to exist thanks to the previously
> dequeued node which is kept around as a dummy head until the next dequeue.

Yep!

> > 
> > > 		if (next) {
> > > 			if (uatomic_cmpxchg(&q->head, head, next) == head) {
> > > 				rcu_read_unlock();
> > > 				urcu_ref_put(&head->ref, release);
> > 
> > This drops the reference count to one, not to zero, right?
> 
> Well, it depends. Note that we drop the refcount of "head", not "next". So after
> enqueueing the first node, when we dequeue, we would try to unref the "init
> dummy" node. Because its refcount is permanent (128), we decrement it to 127.

And the dummy node is never used again, so 128 is larger than needed.

But I won't complain about 128.  ;-)

> Then when we enqueue some more, and dequeue, then we decrement the refcount of
> the first dequeued element, which can be either 2 or 1. It all depends whether
> or not this element has already been unref by the first dequeue call. So we
> decrement it from 2 to 1 if it has not been unref'd by the first dequeue, or we
> decrement it from 1 to 0 if the first dequeue caller has already unref'd it.

OK, yes, the comment says that the caller must urcu_ref_put().

> > > 				return next;
> > 
> > What if local variable "next" points to the dummy node?  I don't understand
> > how this works unless we requeue it.
> 
> It can't, because by definition the "dummy" node I use here is always the very
> head of the list. I don't "requeue" dummy nodes, it's the previous dequeued node
> that "becomes" a dummy node until the next dequeue.
> 
> Thanks for the thorough review !

OK, so now that there is some chance that I actually understand the
algorithm...  ;-)

Suppose that an application has a pipelined sequence of processing
steps, with a queue between each.  Let's have three steps and two queues
(S1, S2, and S3 for steps and Q1 and Q2 for queues).

Given the above implementation, the code for S2 must do something
like the following:

	for (;;) {
		while (np = rcu_lfq_dequeue(&Q1, freenode)) {
			do_something_with(np);
			np1 = copy_node(np);
			rcu_lfq_enqueue(&Q2, np1);
			urcu_ref_put(np);
		}
	}

Yes, you can reduce the copy overhead by having np and np1 reference
bare nodes with a pointer to the real data, so only a minimal amount
of data need be copied.

In contrast, given an implementation that (re)uses a dummy element and that
has a wait-free enqueue with blocking dequeue, you could do the following:

	for (;;) {
		while (np = rcu_lfq_dequeue(&Q1, freenode)) {
			do_something_with(np);
			rcu_lfq_enqueue(&Q2, np);
		}
	}

Only S3 need wait for a grace period, given that there are no cycles in
the queueing network.

This sort of thing is one reason that I am OK with hardware transactional
memory that has small transactions.  ;-)

							Thanx, Paul

> Mathieu
> 
> > 
> > > 			} else {
> > > 				/* Concurrently pushed, retry */
> > > 				continue;
> > > 			}
> > > 		} else {
> > > 			/* Empty */
> > > 			rcu_read_unlock();
> > > 			return NULL;
> > > 		}
> > > 	}
> > > }
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > > _______________________________________________
> > > rp-private mailing list
> > > rp-private at svcs.cs.pdx.edu
> > > http://svcs.cs.pdx.edu/mailman/listinfo/rp-private
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
> 
> _______________________________________________
> rp-private mailing list
> rp-private at svcs.cs.pdx.edu
> http://svcs.cs.pdx.edu/mailman/listinfo/rp-private




More information about the lttng-dev mailing list