[ltt-dev] [rp-private] [RFC] Lock-free RCU queue for userspace RCU library
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Mon Jul 12 20:01:59 EDT 2010
On Mon, Jul 12, 2010 at 12:32:23AM -0400, Mathieu Desnoyers wrote:
> Hi,
>
> I looked a bit more closely at Paolo's lock-free RCU queue implementation and
> found the idea great, but there were a few irritating points with it. Mainly,
> implementing a whole memory allocation scheme for a data structure seemed
> overkill and error-prone. Also, I prefer to decouple the implementation of RCU
> data structures from memory allocation whenever possible, ideally making it
> possible to put the list heads directly within already allocated data
> structures, just like the rcu list. In addition, the very fact that it allocated
> memory on the enqueue meant that we would have to deal with -ENOMEM upon
> enqueue. Also, relying on memory allocation/deferred free directly within the
> data structure implementation means that it is not exactly lock-free, as we may
> have locks in the underlying memory allocator implementation. This means the
> enqueue path would also not be signal-safe.
>
> So, I think I found a neat way to handle lock-free rcu queues with reference
> counting instead. Note that the "dummy" node is really needed, otherwise we end
> up in situations where the "empty" queue is hard to deal with racelessly. FYI,
> the ABA problem is dealt with by using rcu read locks. Comments on the code
> below are very welcome.
I believe that you need to explicitly requeue the dummy node when
it is dequeued, and I don't see code below to do this. Also, the
dequeue function should be more explicit in its header comment: don't
modify the node until a grace period passes. I don't see how the
reference count helps, possibly indicating a failure of imagination
on my part.
See below for details.
Thanx, Paul
> Thanks,
>
> Mathieu
>
>
> /*
> * rculfqueue.h
> *
> * Userspace RCU library - Lock-Free RCU Queue
> *
> * Copyright 2010 - Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> *
> * This library is free software; you can redistribute it and/or
> * modify it under the terms of the GNU Lesser General Public
> * License as published by the Free Software Foundation; either
> * version 2.1 of the License, or (at your option) any later version.
> *
> * This library is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> * Lesser General Public License for more details.
> *
> * You should have received a copy of the GNU Lesser General Public
> * License along with this library; if not, write to the Free Software
> * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> */
>
> #include <urcu/urcu_ref.h>
> #include <assert.h>
>
> /*
> * Lock-free RCU queue using reference counting. This implementation keeps a
> * dummy head node to ensure we can always update the queue locklessly. Given
> * that this is a queue, the dummy head node must always advance as we dequeue
> * entries. Therefore, we keep a reference count on each entry we are
> * dequeueing, so they can be kept as dummy head node until the next dequeue, at
> * which point their reference count will be decremented.
> */
>
> #define URCU_LFQ_PERMANENT_REF 128
>
> struct rcu_lfq_node {
> struct rcu_lfq_node *next;
> struct urcu_ref ref;
> };
The rcu_lfq_node is expected to be a field in a larger data structure
that is queued, correct?
> struct rcu_lfq_queue {
> struct rcu_lfq_node *head, *tail;
> struct rcu_lfq_node init; /* Dummy initialization node */
> };
>
> void rcu_lfq_node_init(struct rcu_lfq_node *node)
> {
> node->next = NULL;
> urcu_ref_init(&node->ref);
> }
>
> void rcu_lfq_init(struct rcu_lfq_queue *q)
> {
> rcu_lfq_node_init(&q->init);
> /* Make sure the initial node is never freed. */
> urcu_ref_set(&q->init.ref, URCU_LFQ_PERMANENT_REF);
> q->head = q->tail = &q->init;
> }
>
> void rcu_lfq_enqueue(struct rcu_lfq_queue *q, struct rcu_lfq_node *node)
> {
> urcu_ref_get(&node->ref);
OK, you acquire a reference before enqueuing and release the reference
after dequeueing. How is this helping? Because the reference is
initialized to 1, the dequeue will not invoke release. This means that
the caller is still responsible for waiting for an RCU grace period
before freeing, correct?
> /*
> * uatomic_cmpxchg() implicit memory barrier orders earlier stores to
> * node before publication.
> */
>
> rcu_read_lock();
> for (;;) {
> struct rcu_lfq_node *tail = rcu_dereference(q->tail);
> struct rcu_lfq_node *next;
>
> /*
> * Typically expect tail to be NULL.
> */
> next = uatomic_cmpxchg(&tail->next, NULL, node);
> if (next == NULL) {
> /*
> * Tail was at the end of queue, we successfully
> * appended to it.
> * Now move tail (another enqueue might beat
> * us to it, that's fine).
> */
> uatomic_cmpxchg(&q->tail, tail, node);
> rcu_read_unlock();
> return;
> } else {
> /*
> * Failure to append to current tail. Help moving tail
> * further and retry.
> */
> uatomic_cmpxchg(&q->tail, tail, next);
What prevents this node from being dequeued before the tail pointer
moves off of it? The sequence of events that I am concerned about is
as follows:
o CPU 0 enqueues element A on an empty list (which contains only
the dummy element).
o CPU 1 dequeues the dummy element (and presumably somehow requeues
it, but I don't see how this happens). Assuming that requeueing
of the dummy element actually occurs, suppose that CPU 1 is
preempted just before it would have advanced the tail pointer.
o CPU 2 dequeues element A from the list. The caller presumably can
neither free it or requeue it until a grace period passes. But
the caller must also refrain from modifying the ->next pointer
during this time, as doing so would break the helper in this case.
This might well be what you were getting at with your statement
about not reusing the lfq node in the rcu_lfq_dequeue() comment
header, but more clarity would be good. ;-)
So this seems OK. But suppose we have the other ordering:
o CPU 0 enqueues element A on an empty list, but is preempted before
updating the ->tail pointer.
o CPU 1 dequeues, but gets the dummy element, which it presumably
requeues as follows:
o next = uatomic_cmpxchg(&tail->next, NULL, node);
o But this fails, since the dummy's ->next pointer references
the half-enqueued element A.
o So it tries to help: uatomic_cmpxchg(&q->tail, tail, next);
This should move the tail ahead to element A, which
should allow the next pass through the loop to succeed.
So this seems OK as well.
How are you testing/validating this? ;-)
> continue;
The above "continue" is unnecessary, right?
> }
> }
> }
>
> /*
> * The entry returned by dequeue must be taken care of by doing a urcu_ref_put,
> * which calls the release primitive when the reference count drops to zero. A
> * grace period must be waited before performing the actual memory reclamation
> * in the release primitive.
> * The entry lfq node returned by dequeue must no be re-used before the
s/no/not/
> * reference count reaches zero.
> */
> struct rcu_lfq_node *
> rcu_lfq_dequeue(struct rcu_lfq_queue *q, void (*release)(struct urcu_ref *))
> {
> rcu_read_lock();
> for (;;) {
> struct rcu_lfq_node *head = rcu_dereference(q->head);
> struct rcu_lfq_node *next = rcu_dereference(head->next);
And head->next is guaranteed to exist due to the dummy node always being
present, right?
> if (next) {
> if (uatomic_cmpxchg(&q->head, head, next) == head) {
> rcu_read_unlock();
> urcu_ref_put(&head->ref, release);
This drops the reference count to one, not to zero, right?
> return next;
What if local variable "next" points to the dummy node? I don't understand
how this works unless we requeue it.
> } else {
> /* Concurrently pushed, retry */
> continue;
> }
> } else {
> /* Empty */
> rcu_read_unlock();
> return NULL;
> }
> }
> }
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>
> _______________________________________________
> rp-private mailing list
> rp-private at svcs.cs.pdx.edu
> http://svcs.cs.pdx.edu/mailman/listinfo/rp-private
More information about the lttng-dev
mailing list