[ltt-dev] lock-free data structures (was Re: [PATCH 12/12] centralize definition of BITS_PER_LONG)

Fri Feb 19 14:25:04 EST 2010

>> Some time we should also add double-long compare-and-swap, that's very
>> useful for lock-free lists.
>
> Do you have pointers to papers describing this double-wide CAS
> linked-list structure ?

I found 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.8494&rep=rep1&type=pdf 
(page 4) which is a lock-free queue.

The two words are used like this:

   struct gen_ptr {
     long gen;
     void *ptr;
   }

where gen is always incremented whenever ptr is changed.

> Yes, good idea! Although this won't be available on all architectures.
> We might have to think of a mutex-based compatibility layer for these.

You don't need a mutex if you use it for a lock-free queue.  You're just 
better off providing two version of the queue, one for double-word CAS 
and a not-really-lock-free one for simple CAS.

The elementary operation in that algorithm are reading a struct gen_ptr 
in a way that can be compared-and-swapped later, and doing a 
compare-and-swap that checks consistency of generation and pointer and 
increments the generation.  If you have double-word CAS, everything is 
very simple because consistency is taken care of by the double-word CAS. 
  So with double-word CAS the above are

   void *atomicRead (struct gen_ptr *val)
   {
       return val->ptr;
   }

and

   long atomicCmpXchgIncrGeneration (struct gen_ptr *val, void *old_ptr,
                                     void *new_ptr)
   {
       int old_gen = val->gen;
       struct gen_ptr read =
           atomicCmpXchgDouble (val, old_gen, old_ptr,
                                old_gen + 1, new_ptr)
       return read.gen;
     }

but even without double-word CAS you can do it because if the 
generations match, so will the pointers.  So the CAS will be on the 
generation in this case, and you use the low-order bit as a tag bit for 
"operation in progress".  It would go something like this (untested):

   struct gen_ptr atomicRead (struct gen_ptr *val)
   {
       struct gen_ptr read;
       for (;;) {
           read.gen = val->gen;
           if (read.gen & 1) {
               /* Unlucky.  */
               continue;
           } else {
               read.ptr = val->ptr;
               rmb ();
               if (read.gen == val->gen)
                   return read;
           }
       }
   }

   long
   atomicCmpXchgWithGeneration (struct gen_ptr *val, long old_gen,
                                void *new_ptr)
   {
       assert (!(old_gen & 1));
       int read_gen = CAS (&val->gen, old_gen, old_gen + 3);
       if (read_gen == old_gen) {
           /* We are the only ones that have access to val->ptr here.
              Since the low bit is 1, other threads will lock on
              atomicRead and fail the CAS in their
              atomicCmpXchgWithGeneration.  Update val->ptr and untag
              the generation number.  */
           val->ptr = new_ptr;
           wmb ();
           val->gen--;
       }
       return read_gen;
   }

Now, while the above is certainly fun, I don't think it belongs in 
liburcu.  However...

> I wonder if we could use a clever RCU structure
> to mimick the double-wide CAS.

I don't know, I think lock-free queues and RCU are used for very 
different access patterns (and if you need lock-free, chances are that 
lock-free will kill your cache and not give you _that much_ performance, 
so maybe you'd better rethink everything...).  However I'm not expert at 
all.

Still, related to this, the paper above gives only half of the story. 
The lock-free queue needs a free list traditionally, and the paper 
mentions using a lock-free stack as a free list (BTW the lock-free stack 
doesn't need double-word CAS).  But then, the nodes on the freelist can 
never be returned to the OS because you can always race with the 
operations on the lock-free queue.

So, maybe RCU could be used to detect quiescent periods when there is no 
operation being performed on the lock-free queue.  *Both enqueue and 
dequeue* would be wrapped in rcu_read_{,un}lock(), unlike normal RCU 
algorithms because this one is lock free.  Then, in dequeue you'd 
replace the paper's line D19 with

     defer_rcu(free, head);

Hmm...

Paolo