[ltt-dev] Choice of wait scheme for UP-aware userspace RCU

Sat Sep 26 22:33:02 EDT 2009

On Sat, Sep 26, 2009 at 10:07:26PM -0400, Mathieu Desnoyers wrote:
> * Josh Triplett (josh at joshtriplett.org) wrote:
> > On Sat, Sep 26, 2009 at 09:34:57PM -0400, Mathieu Desnoyers wrote:
> > > * Josh Triplett (josh at joshtriplett.org) wrote:
> > > > On Sat, Sep 26, 2009 at 07:10:39PM -0400, Mathieu Desnoyers wrote:
> > > > > We have the choice between :
> > > > > 
> > > > >   * urcu/busyloop
> > > > > 
> > > > > Busy-waiting approach. Works well on 2+ processors, but not quite
> > > > > efficient on UP machines or in scenario where there is more threads than
> > > > > processors. This is what we had last week.
> > > > > 
> > > > > For all the following approaches, I use a busy-loop for a fixed number
> > > > > of cycles (to support SMP), and if this fails, we go for the "slow
> > > > > path".
> > > > 
> > > > You could cut out the busyloop on UP at runtime.
> > > > 
> > > 
> > > I could cut it in the following situations:
> > > - If there is only one thread in the program (in which case we could
> > >   just skip the waiting phase altogether, as with tiny rcu).
> > > - If there is only one CPU in the system.
> > > 
> > > But again.. cpu hotplug and thread addition/removal make me a bit
> > > uncomfortable.
> > > 
> > > OTOH, we can afford to be "wrong" about the number of cpus for a period
> > > of time, because this is "just" an optimization. As long as we don't
> > > remove memory barriers nor algorithm parts that are needed for
> > > correctness.
> > 
> > Exactly.  CPU hotplug doesn't happen very often, so you can definitely
> > afford to assume the number of CPUs won't change for a while, as long as
> > you'll still run correctly if it does.
> 
> Hrm, tempting. If you have an idea about the API, it would be welcome.
> vgetcpu() vsyscall is not exactly what I need. Opening /proc/cpuinfo
> seems a bit hackish.

Not entirely portable, but most programs I've seen use
sysconf(_SC_NPROCESSORS_ONLN) for this.  And since you only need it for
a potential optimization...

> > > > > * Preliminary conclusions
> > > > > 
> > > > > QSBR:
> > > > > 
> > > > > I think we could afford the sys_futex for QSBR, and specify that within
> > > > > the kernel syscall implementations we might find use of non wait-free
> > > > > algorithms. This solution really gave the best results on my UP machine.
> > > > 
> > > > Sounds sensible.
> > > > 
> > > 
> > > .. and futex probably makes sense for urcu mb/signal too, even if it
> > > adds a bit of overhead.
> > > 
> > > I'm just unsure about the fallback solution when there is no
> > > sys_futex(). The adaptive G.P. delay makes sense for QSBR, but not
> > > really for mb/signal urcu, because the C.S. duration is typically just
> > > too small. That would only leave sched_yield() (yuk !) for cases when
> > > sys_futex() (or equivalent) is not available.
> > 
> > You don't need to worry *too* much about the lack of sys_futex().
> > Arguably, as long as you stay no less efficient than locking, switching
> > to urcu represents a portable performance win.
> 
> I'd be tempted to use futex all the time, and if unavailable, go for the
> "oh-so-bad" sched_yield() for compatibility then. It's reliably simple,
> although the resulting performance may vary. But I would be concerned
> about using a more evolved compability scheme, where subtle bugs would
> only turn up in some very specific builds of the library, which would
> hence be hard to test on a regular basis.

What about using futex when available and condition variables if not?
Sure, it costs more, but it gives you a simple portable solution with no
sched_yield and no periodic wakeups.

- Josh Triplett