[ltt-dev] [rp-private] [RFC] Lock-free RCU queue for userspace RCU library

Mathieu Desnoyers mathieu.desnoyers at efficios.com
Fri Jul 16 17:04:39 EDT 2010


* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
[...]
> So the sequence of events that you are concerned about is as follows?
> 
> o	The queue consists of the dummy element followed by element A.
> 
> o	CPU 0 picks up the header pointer (to the dummy element)
> 	and the pointer to element A in preparation to dequeue,
> 	and is then preempted.
> 
> o	CPU 1 dequeues the dummy element, requeues it, then dequeues
> 	element A.  The list now contains only the dummy element.
> 
> o	CPU 1 enqueues element B. The list now contains the dummy
> 	element followed by element B.
> 
> o	CPU 0 does its compare-and-swap, which links the old element A
> 	back into the list.  Ouch!!!
> 
> This certainly reinforces my prejudice in favor of wait-free enqueues
> combined with blocking dequeues!  ;-)

Yes, the failure mode involves re-inserting the "same" dummy element at the
wrong moment while a concurrent read-cmpxchg is in progress, expecting to use
the dummy element address as indicator that nothing changed underneath.


> > We cannot dequeue the dummy element if the list is empty. The dummy element will
> > have a NULL next, which indicates that the list is empty. However, in this case,
> > given that we have enqueued element A, then we can dequeue A and unref the dummy
> > init head.
> 
> And then the dummy is never used again, correct?  The head pointer
> references the next pointer of the last element that was dequeued,
> which cannot be freed (or otherwise reused) until some other element
> is enqueued.  So the last item dequeued acts like a list header, and
> thus is in some sense still a part of the queue.

Yep, that's why I keep a refcount on it.

> 
> Hmmm...
>

[...]
 
> > Please see scenarios above, which I hope should take your concern into account.
> 
> Definitely looks like it would work in single-queue use...
>
[...] 
> > > 
> > > > 		if (next) {
> > > > 			if (uatomic_cmpxchg(&q->head, head, next) == head) {
> > > > 				rcu_read_unlock();
> > > > 				urcu_ref_put(&head->ref, release);
> > > 
> > > This drops the reference count to one, not to zero, right?
> > 
> > Well, it depends. Note that we drop the refcount of "head", not "next". So after
> > enqueueing the first node, when we dequeue, we would try to unref the "init
> > dummy" node. Because its refcount is permanent (128), we decrement it to 127.
> 
> And the dummy node is never used again, so 128 is larger than needed.
> 
> But I won't complain about 128.  ;-)

Yeah, that's pretty much arbitrary.

> > > > 				return next;
> > > 
> > > What if local variable "next" points to the dummy node?  I don't understand
> > > how this works unless we requeue it.
> > 
> > It can't, because by definition the "dummy" node I use here is always the very
> > head of the list. I don't "requeue" dummy nodes, it's the previous dequeued node
> > that "becomes" a dummy node until the next dequeue.
> > 
> > Thanks for the thorough review !
> 
> OK, so now that there is some chance that I actually understand the
> algorithm...  ;-)

;-)

> 
> Suppose that an application has a pipelined sequence of processing
> steps, with a queue between each.  Let's have three steps and two queues
> (S1, S2, and S3 for steps and Q1 and Q2 for queues).

Yeah, I think this is a downside of this approach, especially if we can't find a
way to cascade the calls to urcu_ref_put "freenode" handlers. I am thinking
along the lines that a "freenode" callback, in the middle of a cascade, does not
necessarily have to free the node. It could decrement the next level refcount
(or something like that).

We also have to consider that very often, batch dequeueing is preferrable (e.g.
batched RCU callback execution), which argues in favor of a blocking dequeue.

> 
> Given the above implementation, the code for S2 must do something
> like the following:
> 
> 	for (;;) {
> 		while (np = rcu_lfq_dequeue(&Q1, freenode)) {
> 			do_something_with(np);
> 			np1 = copy_node(np);
> 			rcu_lfq_enqueue(&Q2, np1);
> 			urcu_ref_put(np);
> 		}
> 	}
> 
> Yes, you can reduce the copy overhead by having np and np1 reference
> bare nodes with a pointer to the real data, so only a minimal amount
> of data need be copied.

This is where I'd like to come up with a clever scheme to handle cascading
through cascaded refcounting rather than do the copy. I think it might be
doable. But it does not come to my mind at this very moment, which is busy
packing and organising my vacation! ;) I'll be offline for 2 weeks starting this
evening, back on August 2nd.

Thanks !

Mathieu

> 
> In contrast, given an implementation that (re)uses a dummy element and that
> has a wait-free enqueue with blocking dequeue, you could do the following:
> 
> 	for (;;) {
> 		while (np = rcu_lfq_dequeue(&Q1, freenode)) {
> 			do_something_with(np);
> 			rcu_lfq_enqueue(&Q2, np);
> 		}
> 	}
> 
> Only S3 need wait for a grace period, given that there are no cycles in
> the queueing network.
> 
> This sort of thing is one reason that I am OK with hardware transactional
> memory that has small transactions.  ;-)
> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com




More information about the lttng-dev mailing list