[ltt-dev] Choice of wait scheme for UP-aware userspace RCU
Mathieu Desnoyers
mathieu.desnoyers at polymtl.ca
Sat Sep 26 22:07:26 EDT 2009
* Josh Triplett (josh at joshtriplett.org) wrote:
> On Sat, Sep 26, 2009 at 09:34:57PM -0400, Mathieu Desnoyers wrote:
> > * Josh Triplett (josh at joshtriplett.org) wrote:
> > > On Sat, Sep 26, 2009 at 07:10:39PM -0400, Mathieu Desnoyers wrote:
> > > > We have the choice between :
> > > >
> > > > * urcu/busyloop
> > > >
> > > > Busy-waiting approach. Works well on 2+ processors, but not quite
> > > > efficient on UP machines or in scenario where there is more threads than
> > > > processors. This is what we had last week.
> > > >
> > > > For all the following approaches, I use a busy-loop for a fixed number
> > > > of cycles (to support SMP), and if this fails, we go for the "slow
> > > > path".
> > >
> > > You could cut out the busyloop on UP at runtime.
> > >
> >
> > I could cut it in the following situations:
> > - If there is only one thread in the program (in which case we could
> > just skip the waiting phase altogether, as with tiny rcu).
> > - If there is only one CPU in the system.
> >
> > But again.. cpu hotplug and thread addition/removal make me a bit
> > uncomfortable.
> >
> > OTOH, we can afford to be "wrong" about the number of cpus for a period
> > of time, because this is "just" an optimization. As long as we don't
> > remove memory barriers nor algorithm parts that are needed for
> > correctness.
>
> Exactly. CPU hotplug doesn't happen very often, so you can definitely
> afford to assume the number of CPUs won't change for a while, as long as
> you'll still run correctly if it does.
Hrm, tempting. If you have an idea about the API, it would be welcome.
vgetcpu() vsyscall is not exactly what I need. Opening /proc/cpuinfo
seems a bit hackish.
>
> > > Also, on x86, you could include a pause instruction in your busyloop,
> > > which will improve performance on multicore/multithread systems.
> > >
> >
> > I use rep; nop already, is that what you have in mind ? It's a
> > "cpu_relax()", similar to the linux kernel.
>
> Yeah, exactly.
>
> > > > * urcu/futex
> > > >
> > > > Approach using futex wait in synchronize_rcu() and futex wake at the end
> > > > of the rcu_read_lock() (or after marking Q.S. in QSBR). My preliminary
> > > > tests shows that it would very likely be appropriate for QSBR, because
> > > > the wait period is long enough. Also, there is no addition to the fast
> > > > path in this case.
> > > >
> > > > However, for signal and mb-based RCU, we need to add a test after the
> > > > read-side C.S., which impact read-side performances a tiny bit. I'd like
> > > > to avoid that.
> > > >
> > > > Downside: using sys_futex() on the read-side makes the read-side
> > > > non-completely wait-free, because the Linux kernel sys_futex
> > > > implementation uses a hash bucket spinlock.
> > >
> > > We could potentially fix that in the kernel. Ideally sys_futex should
> > > become entirely wait-free. (In the cases where it doesn't need to
> > > block, obviously. ;) )
> >
> > Yes, especially for futex wakeup operations.
>
> Right.
>
> > > > * urcu/yield
> > > >
> > > > This approach is basic enough: busy-looping for a fixed amount of cycles
> > > > while waiting for readers (to deal with SMP systems), and then call
> > > > sched_yield. The periods for which the writer waits for a specific
> > > > reader is marked, so the reader can notice it when it calls
> > > > rcu_read_lock() and call sched_yield(). Similar solution for QSBR.
> > >
> > > You should not call sched_yield. Among other things, it has
> > > unpredictable semantics across kernel versions, and it fundamentally
> > > seems worse than blocking on an appropriate event.
> >
> > Hrm ok. You're about to convince me that my "futex" approach should
> > probably be used for mb and signal RCU also, even though it adds a small
> > overhead on the read-side.
>
> The futex approach seems the most correct in principle; we should just
> focus on making the read-side overhead go away.
>
Just so you know, here is what I add on the fast path:
1) a memory barrier
2) execution of:
static inline void wake_up_gp(void)
{
if (unlikely(atomic_read(&gp_futex) == -1)) {
atomic_set(&gp_futex, 0);
futex(&gp_futex, FUTEX_WAKE, 1,
NULL, NULL, 0);
}
}
So on the fast path, we just have a volatile read of gp_futex, a test
and a branch. That should be cheap enough. It's the memory barrier we
must be careful about.
Well, on QSBR, we just don't care, because it's done once in a blue
moon.
For the signal-based RCU, it's turned into a simple compiler barrier.
Not a problem there neither.
It's mostly for mb-based RCU, which already has 2 mb per C.S., that
I know we're losing a bit of performance. It would need one more.
e.g. (note: reader_barrier() -> smp_mb() for the mb-urcu)
static inline void _rcu_read_lock(void)
{
long tmp;
tmp = urcu_active_readers;
/* urcu_gp_ctr = RCU_GP_COUNT | (~RCU_GP_CTR_BIT or RCU_GP_CTR_BIT) */
if (likely(!(tmp & RCU_GP_CTR_NEST_MASK))) {
_STORE_SHARED(urcu_active_readers, _LOAD_SHARED(urcu_gp_ctr));
/*
* Set active readers count for outermost nesting level before
* accessing the pointer. See force_mb_all_threads().
*/
reader_barrier();
} else {
_STORE_SHARED(urcu_active_readers, tmp + RCU_GP_COUNT);
}
}
static inline void _rcu_read_unlock(void)
{
long tmp;
tmp = urcu_active_readers;
/*
* Finish using rcu before decrementing the pointer.
* See force_mb_all_threads().
*/
if (likely((tmp & RCU_GP_CTR_NEST_MASK) == RCU_GP_COUNT)) {
reader_barrier();
_STORE_SHARED(urcu_active_readers,
urcu_active_readers - RCU_GP_COUNT);
/* write urcu_active_readers before read futex */
reader_barrier();
wake_up_gp();
} else {
_STORE_SHARED(urcu_active_readers,
urcu_active_readers - RCU_GP_COUNT);
}
}
As you see, we need to appear to be out of our C.S. before waking up
synchronize_rcu(), otherwise we can end up in a situation where it could
wait forever (given no more readers would come).
> > > > * Preliminary conclusions
> > > >
> > > > QSBR:
> > > >
> > > > I think we could afford the sys_futex for QSBR, and specify that within
> > > > the kernel syscall implementations we might find use of non wait-free
> > > > algorithms. This solution really gave the best results on my UP machine.
> > >
> > > Sounds sensible.
> > >
> >
> > .. and futex probably makes sense for urcu mb/signal too, even if it
> > adds a bit of overhead.
> >
> > I'm just unsure about the fallback solution when there is no
> > sys_futex(). The adaptive G.P. delay makes sense for QSBR, but not
> > really for mb/signal urcu, because the C.S. duration is typically just
> > too small. That would only leave sched_yield() (yuk !) for cases when
> > sys_futex() (or equivalent) is not available.
>
> You don't need to worry *too* much about the lack of sys_futex().
> Arguably, as long as you stay no less efficient than locking, switching
> to urcu represents a portable performance win.
I'd be tempted to use futex all the time, and if unavailable, go for the
"oh-so-bad" sched_yield() for compatibility then. It's reliably simple,
although the resulting performance may vary. But I would be concerned
about using a more evolved compability scheme, where subtle bugs would
only turn up in some very specific builds of the library, which would
hence be hard to test on a regular basis.
Thanks,
Mathieu
>
> - Josh Triplett
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
More information about the lttng-dev
mailing list