[ltt-dev] Choice of wait scheme for UP-aware userspace RCU

Sat Sep 26 19:10:39 EDT 2009

Hi !

I recently tackled the task to make userspace RCU more efficient on
uniprocessor. Busy-waiting did not seem to be the best solution around.

So I end up with 4 possible choices. Given I cannot specialize the
library for UP (don't want to end up confusing my users), I need to
consider both SMP and UP performance in the same build. They are all
available under a branch in the git://lttng.org/userspace-rcu.git tree.

We have the choice between :

  * urcu/busyloop

Busy-waiting approach. Works well on 2+ processors, but not quite
efficient on UP machines or in scenario where there is more threads than
processors. This is what we had last week.

For all the following approaches, I use a busy-loop for a fixed number
of cycles (to support SMP), and if this fails, we go for the "slow
path".

  * urcu/futex

Approach using futex wait in synchronize_rcu() and futex wake at the end
of the rcu_read_lock() (or after marking Q.S. in QSBR). My preliminary
tests shows that it would very likely be appropriate for QSBR, because
the wait period is long enough. Also, there is no addition to the fast
path in this case.

However, for signal and mb-based RCU, we need to add a test after the
read-side C.S., which impact read-side performances a tiny bit. I'd like
to avoid that.

Downside: using sys_futex() on the read-side makes the read-side
non-completely wait-free, because the Linux kernel sys_futex
implementation uses a hash bucket spinlock.

  * urcu/yield

This approach is basic enough: busy-looping for a fixed amount of cycles
while waiting for readers (to deal with SMP systems), and then call
sched_yield. The periods for which the writer waits for a specific
reader is marked, so the reader can notice it when it calls
rcu_read_lock() and call sched_yield(). Similar solution for QSBR.

  * urcu/qsbr/adaptusleep

This is an approach I thought could help dealing with the long qsbr Q.S.
delays without using sys_futex: using timestamps at each
rcu_quiescent_state(), and use the duration of the previous window to
approximate a usleep delay. The readers call sched_yield() after
changing their Q.S. and when going offline if they notice that the
writer thread is waiting for them.

However, it appears to be slower than the sys_futex approach, mainly
because we only approximate the sleep time. sys_futex puts us back on
the runqueue ASAP. It also shares the downside of sys_futex(): doing a
gettimeofday in rcu_quiescent_state() needs to take a read sequence
lock, which kills wait-free guarantees on the read-side. However, we
could mitigate this problem by changing the kernel implementation to a
RCU-based scheme (I already have one ready, but no time to deploy it).

* Preliminary conclusions

QSBR:

I think we could afford the sys_futex for QSBR, and specify that within
the kernel syscall implementations we might find use of non wait-free
algorithms. This solution really gave the best results on my UP machine.

URCU mb/signal:

I don't like the added runtime cost of the extra check added in
rcu_read_unlock() for the sys_futex approach. I had very similar
performance results using the urcu/yield approach, which only depends on
calling sched_yield() both in synchronize_rcu() and rcu_read_lock()
(without adding any data read to the fast path, just a branch).

Comments ? Please feel free to play with the testbenchs on your own
machines, ideally by putting cpus offline with
   /sys/devices/system/cpu*/online

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68