[ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)

Wed Feb 11 03:58:52 EST 2009

* Lai Jiangshan (laijs at cn.fujitsu.com) wrote:
> Mathieu Desnoyers wrote:
> > 
> > I just did a mb() version of the urcu :
> > 
> > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > 
> > Time per read : 48.4086 cycles
> > (about 6-7 times slower, as expected)
> > 
> 
> I had read many papers of Paul.
> (http://www.rdrop.com/users/paulmck/RCU/)
> and I know Paul did his endeavor to remove memory barrier in
> RCU read site in kernel. His work is of consequence.
> 
> But, I think,
> 1) Userspace RCU's read site can pay for the latency of
> memory barrier(include atomic operator).
>    Userspace does not access to shared data so frequently as kernel.
> and userspace's read site is not so fast as kernel.
> 
> 2) Userspace uses RCU is for RCU's excellence, not saving a little cpu cycles
>    (http://lwn.net/Articles/263130/)
>    One of the most important excellence is lock-free.
> 
> 
> If my thinking is right, the following opinion has some meaning too.
> 
> Use All-SYSTEM 's RCU for Userspace RCU.
> 
> All-SYSTEM 's RCU is QRCU which is implemented by Paul.
> http://lwn.net/Articles/223752/
> 
> Any system which has mechanisms equivalent to atomic_op,
> __wait_event, wake_up, mutex, This system can also implement QRCU.
> So most system can implement QRCU, and I say QRCU is All-SYSTEM 's RCU.
> 
> Obviously, we can implement a portable QRCU highly simply in NPTL.
> and read lock is:
> 	for (;;) {
> 		int idx = qp->completed & 0x1;
> 		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
> 			return idx;
> 	}
> "atomic_inc_not_zero" is called once likely, it's fast enough.
> 

Hi Lai,

There are a few reasons why we need rcu in userspace for tracing :

- We need very fast per-cpu read-side synchronization for data structure
  handling. Updates are rare (enabling/disabling tracing). Therefore,
  your argument about userspace not needing "fast" rcu does not hold in
  this case. Note that LTTng has the performance it has today in the
  kernel because I made sure to use no memory barriers when unnecessary
  and because I used the minimal amount of atomic operations required.
  Those represent costly synchronization primitives on quite a few
  architectures.
- Being lock-free (atomic). To trace code executed in signal handlers,
  we need to be able to nest over any user code. With the solution you
  propose above, the busy-loop in the read-lock does not seems to be
  signal-safe : if it nests over a writer, it could busy-loop forever.

Mathieu

> Lai.
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68