[ltt-dev] Choice of wait scheme for UP-aware userspace RCU

Josh Triplett josh at joshtriplett.org
Sat Sep 26 21:43:09 EDT 2009


On Sat, Sep 26, 2009 at 09:34:57PM -0400, Mathieu Desnoyers wrote:
> * Josh Triplett (josh at joshtriplett.org) wrote:
> > On Sat, Sep 26, 2009 at 07:10:39PM -0400, Mathieu Desnoyers wrote:
> > > We have the choice between :
> > > 
> > >   * urcu/busyloop
> > > 
> > > Busy-waiting approach. Works well on 2+ processors, but not quite
> > > efficient on UP machines or in scenario where there is more threads than
> > > processors. This is what we had last week.
> > > 
> > > For all the following approaches, I use a busy-loop for a fixed number
> > > of cycles (to support SMP), and if this fails, we go for the "slow
> > > path".
> > 
> > You could cut out the busyloop on UP at runtime.
> > 
> 
> I could cut it in the following situations:
> - If there is only one thread in the program (in which case we could
>   just skip the waiting phase altogether, as with tiny rcu).
> - If there is only one CPU in the system.
> 
> But again.. cpu hotplug and thread addition/removal make me a bit
> uncomfortable.
> 
> OTOH, we can afford to be "wrong" about the number of cpus for a period
> of time, because this is "just" an optimization. As long as we don't
> remove memory barriers nor algorithm parts that are needed for
> correctness.

Exactly.  CPU hotplug doesn't happen very often, so you can definitely
afford to assume the number of CPUs won't change for a while, as long as
you'll still run correctly if it does.

> > Also, on x86, you could include a pause instruction in your busyloop,
> > which will improve performance on multicore/multithread systems.
> > 
> 
> I use rep; nop already, is that what you have in mind ? It's a
> "cpu_relax()", similar to the linux kernel.

Yeah, exactly.

> > >   * urcu/futex
> > > 
> > > Approach using futex wait in synchronize_rcu() and futex wake at the end
> > > of the rcu_read_lock() (or after marking Q.S. in QSBR). My preliminary
> > > tests shows that it would very likely be appropriate for QSBR, because
> > > the wait period is long enough. Also, there is no addition to the fast
> > > path in this case.
> > > 
> > > However, for signal and mb-based RCU, we need to add a test after the
> > > read-side C.S., which impact read-side performances a tiny bit. I'd like
> > > to avoid that.
> > > 
> > > Downside: using sys_futex() on the read-side makes the read-side
> > > non-completely wait-free, because the Linux kernel sys_futex
> > > implementation uses a hash bucket spinlock.
> > 
> > We could potentially fix that in the kernel.  Ideally sys_futex should
> > become entirely wait-free.  (In the cases where it doesn't need to
> > block, obviously. ;) )
> 
> Yes, especially for futex wakeup operations.

Right.

> > >   * urcu/yield
> > > 
> > > This approach is basic enough: busy-looping for a fixed amount of cycles
> > > while waiting for readers (to deal with SMP systems), and then call
> > > sched_yield. The periods for which the writer waits for a specific
> > > reader is marked, so the reader can notice it when it calls
> > > rcu_read_lock() and call sched_yield(). Similar solution for QSBR.
> > 
> > You should not call sched_yield.  Among other things, it has
> > unpredictable semantics across kernel versions, and it fundamentally
> > seems worse than blocking on an appropriate event.
> 
> Hrm ok. You're about to convince me that my "futex" approach should
> probably be used for mb and signal RCU also, even though it adds a small
> overhead on the read-side.

The futex approach seems the most correct in principle; we should just
focus on making the read-side overhead go away.

> > > * Preliminary conclusions
> > > 
> > > QSBR:
> > > 
> > > I think we could afford the sys_futex for QSBR, and specify that within
> > > the kernel syscall implementations we might find use of non wait-free
> > > algorithms. This solution really gave the best results on my UP machine.
> > 
> > Sounds sensible.
> > 
> 
> .. and futex probably makes sense for urcu mb/signal too, even if it
> adds a bit of overhead.
> 
> I'm just unsure about the fallback solution when there is no
> sys_futex(). The adaptive G.P. delay makes sense for QSBR, but not
> really for mb/signal urcu, because the C.S. duration is typically just
> too small. That would only leave sched_yield() (yuk !) for cases when
> sys_futex() (or equivalent) is not available.

You don't need to worry *too* much about the lack of sys_futex().
Arguably, as long as you stay no less efficient than locking, switching
to urcu represents a portable performance win.

- Josh Triplett




More information about the lttng-dev mailing list