[ltt-dev] Choice of wait scheme for UP-aware userspace RCU

Sat Sep 26 22:07:26 EDT 2009

* Josh Triplett (josh at joshtriplett.org) wrote:
> On Sat, Sep 26, 2009 at 09:34:57PM -0400, Mathieu Desnoyers wrote:
> > * Josh Triplett (josh at joshtriplett.org) wrote:
> > > On Sat, Sep 26, 2009 at 07:10:39PM -0400, Mathieu Desnoyers wrote:
> > > > We have the choice between :
> > > > 
> > > >   * urcu/busyloop
> > > > 
> > > > Busy-waiting approach. Works well on 2+ processors, but not quite
> > > > efficient on UP machines or in scenario where there is more threads than
> > > > processors. This is what we had last week.
> > > > 
> > > > For all the following approaches, I use a busy-loop for a fixed number
> > > > of cycles (to support SMP), and if this fails, we go for the "slow
> > > > path".
> > > 
> > > You could cut out the busyloop on UP at runtime.
> > > 
> > 
> > I could cut it in the following situations:
> > - If there is only one thread in the program (in which case we could
> >   just skip the waiting phase altogether, as with tiny rcu).
> > - If there is only one CPU in the system.
> > 
> > But again.. cpu hotplug and thread addition/removal make me a bit
> > uncomfortable.
> > 
> > OTOH, we can afford to be "wrong" about the number of cpus for a period
> > of time, because this is "just" an optimization. As long as we don't
> > remove memory barriers nor algorithm parts that are needed for
> > correctness.
> 
> Exactly.  CPU hotplug doesn't happen very often, so you can definitely
> afford to assume the number of CPUs won't change for a while, as long as
> you'll still run correctly if it does.

Hrm, tempting. If you have an idea about the API, it would be welcome.
vgetcpu() vsyscall is not exactly what I need. Opening /proc/cpuinfo
seems a bit hackish.

> 
> > > Also, on x86, you could include a pause instruction in your busyloop,
> > > which will improve performance on multicore/multithread systems.
> > > 
> > 
> > I use rep; nop already, is that what you have in mind ? It's a
> > "cpu_relax()", similar to the linux kernel.
> 
> Yeah, exactly.
> 
> > > >   * urcu/futex
> > > > 
> > > > Approach using futex wait in synchronize_rcu() and futex wake at the end
> > > > of the rcu_read_lock() (or after marking Q.S. in QSBR). My preliminary
> > > > tests shows that it would very likely be appropriate for QSBR, because
> > > > the wait period is long enough. Also, there is no addition to the fast
> > > > path in this case.
> > > > 
> > > > However, for signal and mb-based RCU, we need to add a test after the
> > > > read-side C.S., which impact read-side performances a tiny bit. I'd like
> > > > to avoid that.
> > > > 
> > > > Downside: using sys_futex() on the read-side makes the read-side
> > > > non-completely wait-free, because the Linux kernel sys_futex
> > > > implementation uses a hash bucket spinlock.
> > > 
> > > We could potentially fix that in the kernel.  Ideally sys_futex should
> > > become entirely wait-free.  (In the cases where it doesn't need to
> > > block, obviously. ;) )
> > 
> > Yes, especially for futex wakeup operations.
> 
> Right.
> 
> > > >   * urcu/yield
> > > > 
> > > > This approach is basic enough: busy-looping for a fixed amount of cycles
> > > > while waiting for readers (to deal with SMP systems), and then call
> > > > sched_yield. The periods for which the writer waits for a specific
> > > > reader is marked, so the reader can notice it when it calls
> > > > rcu_read_lock() and call sched_yield(). Similar solution for QSBR.
> > > 
> > > You should not call sched_yield.  Among other things, it has
> > > unpredictable semantics across kernel versions, and it fundamentally
> > > seems worse than blocking on an appropriate event.
> > 
> > Hrm ok. You're about to convince me that my "futex" approach should
> > probably be used for mb and signal RCU also, even though it adds a small
> > overhead on the read-side.
> 
> The futex approach seems the most correct in principle; we should just
> focus on making the read-side overhead go away.
> 

Just so you know, here is what I add on the fast path:

1) a memory barrier

2) execution of:

static inline void wake_up_gp(void)
{
        if (unlikely(atomic_read(&gp_futex) == -1)) {
                atomic_set(&gp_futex, 0);
                futex(&gp_futex, FUTEX_WAKE, 1,
                      NULL, NULL, 0);
        }
}

So on the fast path, we just have a volatile read of gp_futex, a test
and a branch. That should be cheap enough. It's the memory barrier we
must be careful about.

Well, on QSBR, we just don't care, because it's done once in a blue
moon.

For the signal-based RCU, it's turned into a simple compiler barrier.
Not a problem there neither.

It's mostly for mb-based RCU, which already has 2 mb per C.S., that
I know we're losing a bit of performance. It would need one more.

e.g. (note: reader_barrier() -> smp_mb() for the mb-urcu)

static inline void _rcu_read_lock(void)
{
        long tmp;

        tmp = urcu_active_readers;
        /* urcu_gp_ctr = RCU_GP_COUNT | (~RCU_GP_CTR_BIT or RCU_GP_CTR_BIT) */
        if (likely(!(tmp & RCU_GP_CTR_NEST_MASK))) {
                _STORE_SHARED(urcu_active_readers, _LOAD_SHARED(urcu_gp_ctr));
                /*
                 * Set active readers count for outermost nesting level before
                 * accessing the pointer. See force_mb_all_threads().
                 */
                reader_barrier();
        } else {
                _STORE_SHARED(urcu_active_readers, tmp + RCU_GP_COUNT);
        }
}

static inline void _rcu_read_unlock(void)
{
        long tmp;

        tmp = urcu_active_readers;
        /*
         * Finish using rcu before decrementing the pointer.
         * See force_mb_all_threads().
         */
        if (likely((tmp & RCU_GP_CTR_NEST_MASK) == RCU_GP_COUNT)) {
                reader_barrier();
                _STORE_SHARED(urcu_active_readers,
                              urcu_active_readers - RCU_GP_COUNT);
                /* write urcu_active_readers before read futex */
                reader_barrier();
                wake_up_gp();
        } else {
                _STORE_SHARED(urcu_active_readers,
                              urcu_active_readers - RCU_GP_COUNT);
        }
}

As you see, we need to appear to be out of our C.S. before waking up
synchronize_rcu(), otherwise we can end up in a situation where it could
wait forever (given no more readers would come).

> > > > * Preliminary conclusions
> > > > 
> > > > QSBR:
> > > > 
> > > > I think we could afford the sys_futex for QSBR, and specify that within
> > > > the kernel syscall implementations we might find use of non wait-free
> > > > algorithms. This solution really gave the best results on my UP machine.
> > > 
> > > Sounds sensible.
> > > 
> > 
> > .. and futex probably makes sense for urcu mb/signal too, even if it
> > adds a bit of overhead.
> > 
> > I'm just unsure about the fallback solution when there is no
> > sys_futex(). The adaptive G.P. delay makes sense for QSBR, but not
> > really for mb/signal urcu, because the C.S. duration is typically just
> > too small. That would only leave sched_yield() (yuk !) for cases when
> > sys_futex() (or equivalent) is not available.
> 
> You don't need to worry *too* much about the lack of sys_futex().
> Arguably, as long as you stay no less efficient than locking, switching
> to urcu represents a portable performance win.

I'd be tempted to use futex all the time, and if unavailable, go for the
"oh-so-bad" sched_yield() for compatibility then. It's reliably simple,
although the resulting performance may vary. But I would be concerned
about using a more evolved compability scheme, where subtle bugs would
only turn up in some very specific builds of the library, which would
hence be hard to test on a regular basis.

Thanks,

Mathieu

> 
> - Josh Triplett

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68