[ltt-dev] Choice of wait scheme for UP-aware userspace RCU

Josh Triplett josh at joshtriplett.org
Sat Sep 26 19:34:10 EDT 2009


On Sat, Sep 26, 2009 at 07:10:39PM -0400, Mathieu Desnoyers wrote:
> I recently tackled the task to make userspace RCU more efficient on
> uniprocessor. Busy-waiting did not seem to be the best solution around.
> 
> So I end up with 4 possible choices. Given I cannot specialize the
> library for UP (don't want to end up confusing my users), I need to
> consider both SMP and UP performance in the same build. They are all
> available under a branch in the git://lttng.org/userspace-rcu.git tree.

You still have the option of detecting the number of processors at
runtime (to disable busywaits on uniprocessor systems, for instance).
You can also make decisions based on the number of threads registered to
urcu.

> We have the choice between :
> 
>   * urcu/busyloop
> 
> Busy-waiting approach. Works well on 2+ processors, but not quite
> efficient on UP machines or in scenario where there is more threads than
> processors. This is what we had last week.
> 
> For all the following approaches, I use a busy-loop for a fixed number
> of cycles (to support SMP), and if this fails, we go for the "slow
> path".

You could cut out the busyloop on UP at runtime.

Also, on x86, you could include a pause instruction in your busyloop,
which will improve performance on multicore/multithread systems.

>   * urcu/futex
> 
> Approach using futex wait in synchronize_rcu() and futex wake at the end
> of the rcu_read_lock() (or after marking Q.S. in QSBR). My preliminary
> tests shows that it would very likely be appropriate for QSBR, because
> the wait period is long enough. Also, there is no addition to the fast
> path in this case.
> 
> However, for signal and mb-based RCU, we need to add a test after the
> read-side C.S., which impact read-side performances a tiny bit. I'd like
> to avoid that.
> 
> Downside: using sys_futex() on the read-side makes the read-side
> non-completely wait-free, because the Linux kernel sys_futex
> implementation uses a hash bucket spinlock.

We could potentially fix that in the kernel.  Ideally sys_futex should
become entirely wait-free.  (In the cases where it doesn't need to
block, obviously. ;) )

>   * urcu/yield
> 
> This approach is basic enough: busy-looping for a fixed amount of cycles
> while waiting for readers (to deal with SMP systems), and then call
> sched_yield. The periods for which the writer waits for a specific
> reader is marked, so the reader can notice it when it calls
> rcu_read_lock() and call sched_yield(). Similar solution for QSBR.

You should not call sched_yield.  Among other things, it has
unpredictable semantics across kernel versions, and it fundamentally
seems worse than blocking on an appropriate event.

>   * urcu/qsbr/adaptusleep
> 
> This is an approach I thought could help dealing with the long qsbr Q.S.
> delays without using sys_futex: using timestamps at each
> rcu_quiescent_state(), and use the duration of the previous window to
> approximate a usleep delay. The readers call sched_yield() after
> changing their Q.S. and when going offline if they notice that the
> writer thread is waiting for them.
> 
> However, it appears to be slower than the sys_futex approach, mainly
> because we only approximate the sleep time. sys_futex puts us back on
> the runqueue ASAP. It also shares the downside of sys_futex(): doing a
> gettimeofday in rcu_quiescent_state() needs to take a read sequence
> lock, which kills wait-free guarantees on the read-side. However, we
> could mitigate this problem by changing the kernel implementation to a
> RCU-based scheme (I already have one ready, but no time to deploy it).

This approach sounds fairly sensible for the case where we don't have
futexes, with the exception that you shouldn't call sched_yield.

> * Preliminary conclusions
> 
> QSBR:
> 
> I think we could afford the sys_futex for QSBR, and specify that within
> the kernel syscall implementations we might find use of non wait-free
> algorithms. This solution really gave the best results on my UP machine.

Sounds sensible.

- Josh Triplett




More information about the lttng-dev mailing list