[ltt-dev] Choice of wait scheme for UP-aware userspace RCU

Sat Sep 26 21:34:57 EDT 2009

* Josh Triplett (josh at joshtriplett.org) wrote:
> On Sat, Sep 26, 2009 at 07:10:39PM -0400, Mathieu Desnoyers wrote:
> > I recently tackled the task to make userspace RCU more efficient on
> > uniprocessor. Busy-waiting did not seem to be the best solution around.
> > 
> > So I end up with 4 possible choices. Given I cannot specialize the
> > library for UP (don't want to end up confusing my users), I need to
> > consider both SMP and UP performance in the same build. They are all
> > available under a branch in the git://lttng.org/userspace-rcu.git tree.
> 
> You still have the option of detecting the number of processors at
> runtime (to disable busywaits on uniprocessor systems, for instance).
> You can also make decisions based on the number of threads registered to
> urcu.
> 

If I end up in a situation where a serious tradeoff would have to be
made, I'd consider this option. However, things like:

- cpu hotplug while the program is running
- thread creation/removal while the program is running
- the fact that it's when the number of _busy_ threads is higher than
  the available cpu resources that we can see busy-waiting as
  not-so-good. So it's not strictly dependent on the number of cpus vs
  number of threads.
- finding the number of online cpus portably might be a bit awkward (but
  I know at least x86 has a vsyscall for that, but again, cpu hotplug..)

So in any case, I think it won't hurt to deal with the "waiting" problem
using non busy-looping solutions. Even if we loose a few reads/s to do
so, the increased support for a very wide range of workloads seems to
outweight small overhead.

> > We have the choice between :
> > 
> >   * urcu/busyloop
> > 
> > Busy-waiting approach. Works well on 2+ processors, but not quite
> > efficient on UP machines or in scenario where there is more threads than
> > processors. This is what we had last week.
> > 
> > For all the following approaches, I use a busy-loop for a fixed number
> > of cycles (to support SMP), and if this fails, we go for the "slow
> > path".
> 
> You could cut out the busyloop on UP at runtime.
> 

I could cut it in the following situations:
- If there is only one thread in the program (in which case we could
  just skip the waiting phase altogether, as with tiny rcu).
- If there is only one CPU in the system.

But again.. cpu hotplug and thread addition/removal make me a bit
uncomfortable.

OTOH, we can afford to be "wrong" about the number of cpus for a period
of time, because this is "just" an optimization. As long as we don't
remove memory barriers nor algorithm parts that are needed for
correctness.

> Also, on x86, you could include a pause instruction in your busyloop,
> which will improve performance on multicore/multithread systems.
> 

I use rep; nop already, is that what you have in mind ? It's a
"cpu_relax()", similar to the linux kernel.

> >   * urcu/futex
> > 
> > Approach using futex wait in synchronize_rcu() and futex wake at the end
> > of the rcu_read_lock() (or after marking Q.S. in QSBR). My preliminary
> > tests shows that it would very likely be appropriate for QSBR, because
> > the wait period is long enough. Also, there is no addition to the fast
> > path in this case.
> > 
> > However, for signal and mb-based RCU, we need to add a test after the
> > read-side C.S., which impact read-side performances a tiny bit. I'd like
> > to avoid that.
> > 
> > Downside: using sys_futex() on the read-side makes the read-side
> > non-completely wait-free, because the Linux kernel sys_futex
> > implementation uses a hash bucket spinlock.
> 
> We could potentially fix that in the kernel.  Ideally sys_futex should
> become entirely wait-free.  (In the cases where it doesn't need to
> block, obviously. ;) )

Yes, especially for futex wakeup operations.

> 
> >   * urcu/yield
> > 
> > This approach is basic enough: busy-looping for a fixed amount of cycles
> > while waiting for readers (to deal with SMP systems), and then call
> > sched_yield. The periods for which the writer waits for a specific
> > reader is marked, so the reader can notice it when it calls
> > rcu_read_lock() and call sched_yield(). Similar solution for QSBR.
> 
> You should not call sched_yield.  Among other things, it has
> unpredictable semantics across kernel versions, and it fundamentally
> seems worse than blocking on an appropriate event.

Hrm ok. You're about to convince me that my "futex" approach should
probably be used for mb and signal RCU also, even though it adds a small
overhead on the read-side.

> 
> >   * urcu/qsbr/adaptusleep
> > 
> > This is an approach I thought could help dealing with the long qsbr Q.S.
> > delays without using sys_futex: using timestamps at each
> > rcu_quiescent_state(), and use the duration of the previous window to
> > approximate a usleep delay. The readers call sched_yield() after
> > changing their Q.S. and when going offline if they notice that the
> > writer thread is waiting for them.
> > 
> > However, it appears to be slower than the sys_futex approach, mainly
> > because we only approximate the sleep time. sys_futex puts us back on
> > the runqueue ASAP. It also shares the downside of sys_futex(): doing a
> > gettimeofday in rcu_quiescent_state() needs to take a read sequence
> > lock, which kills wait-free guarantees on the read-side. However, we
> > could mitigate this problem by changing the kernel implementation to a
> > RCU-based scheme (I already have one ready, but no time to deploy it).
> 
> This approach sounds fairly sensible for the case where we don't have
> futexes, with the exception that you shouldn't call sched_yield.

Yes, that would make sense for QSBR, but I would not read gettimeofday,
even though it's a vsyscall on most architectures and fairly cheap, at
each rcu_read_unlock().

By the wait, another solution, proposed by Thomas Gleixner (in a
slightly different context: call_rcu()/worker thread), was to use
cond vars from the pthreads. However, it does not fit well in this case,
where we try to have the read-side threads waking up synchronize_rcu().
This would imply taking a mutex on the rcu_read_unlock() fastpath, which
I'm not even remotely considering doing for obvious performance
and scalability degradation reasons.

> 
> > * Preliminary conclusions
> > 
> > QSBR:
> > 
> > I think we could afford the sys_futex for QSBR, and specify that within
> > the kernel syscall implementations we might find use of non wait-free
> > algorithms. This solution really gave the best results on my UP machine.
> 
> Sounds sensible.
> 

.. and futex probably makes sense for urcu mb/signal too, even if it
adds a bit of overhead.

I'm just unsure about the fallback solution when there is no
sys_futex(). The adaptive G.P. delay makes sense for QSBR, but not
really for mb/signal urcu, because the C.S. duration is typically just
too small. That would only leave sched_yield() (yuk !) for cases when
sys_futex() (or equivalent) is not available.

Further thoughts ?

Thanks for the comments,

Mathieu

> - Josh Triplett

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68