[ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
Mathieu Desnoyers
compudj at krystal.dyndns.org
Fri Feb 13 12:33:52 EST 2009
* Linus Torvalds (torvalds at linux-foundation.org) wrote:
>
>
> On Fri, 13 Feb 2009, Mathieu Desnoyers wrote:
> >
> > I created also
> >
> > _STORE_SHARED()
> > _LOAD_SHARED()
> >
> > which identify the variables which need to have cache flush done before
> > (load) or after (store). So we get both speed and identification when
> > needed (if we need to do batch updates linked with a single cache flush).
> > e.g.
>
> The thing is, THAT JUST ABSOLUTELY SUCKS.
>
> Lookie here - we don't want to flush the cache at every load of a shared
> variable. There's no reason to. If you don't care about the orderign, you
> might as well get the old values. That's what memory ordering _means_, for
> chissake! In the absense of locks, loads may get stale values. It's that
> easy.
>
> A lot of code wants to access multiple variables, and they are potentially
> nearby, and in the same cacheline. Making them all use _LOAD_SHARED() adds
> absolutely no value - and makes it MUCH MUCH SLOWER.
>
Hrm, I think there is a misunderstanding here, because _LOAD_SHARED() is
not much more than a simple comment.
The whole idea behind _LOAD_SHARED() is that it does not translate in
any different assembly output than a standard load. So no, it cannot be
possibly slower. It has no more side-effect than a simple comment in the
code, and that's its purpose : to identify those variables. So if we
find a code path doing
_STORE_SHARED(x, v);
smp_mc();
while (_LOAD_SHARED(z) != val)
cpu_relax();
We can verify very easily the code correctness :
A write cache flush is required after _STORE_SHARED
A read cache flush is required before _LOAD_SHARED
Read cache flushes are required to happen eventually between
_LOAD_SHARED in the loop.
It's basically the same as having something like an eventual :
_STORE_ORDERED(x, v);
smp_mb();
_LOAD_ORDERED(z);
Instead of relying on a comment around smp_mb(); stating which variables
it orders. I would understand if you dislike it, but I find it rather
useful to have this information in the source code around the variable
access rather than formulated as a comment around the barrier. Actually
having both the barrier comment *and* this identification seems rather
good for code review.
> So what's the answer?
>
> I already outlined it: either you use locks (which will do the magic for
> you), or you use memory barriers. In no case do you make the access magic,
> unless you have a compiler issue where you are afraid that the compiler
> would turn it into _multiple_ accesses and potentially get inconsistent
> results.
>
> So the point about ACCESS_ONCE() is not, and never has been, about
> re-ordering. We know that the CPU may re-order the accesses and give us
> stale values (or values from the "future" wrt the other accesses around
> it). That's not the point. The point of ACCESS_ONCE() is that we get
> exactly _one_ value, and not two different ones (or none at all) because
> of the compiler either re-loading it several times or not re-loading it at
> all.
>
> Anybody who confuses ACCESS_ONCE() with ordering is simply confused.
>
> And we don't want to make any "load with cache flush" either. Which side
> should the cache flush be on? Before? After? Both? Atomically? There is no
> sane semantics for that.
>
We might want to simply scrap the "safe and slow" version without
underscores (LOAD_SHARED, STORE_SHARED) which contain smp_rmc and
smp_wmc statements within the macro. But Paul insisted that he likes
having the proper memory ordering/cache coherency enforced within the
accessor macros. Personnally, I see much more value in the simple
"comment-only" versions _LOAD_SHARED/_STORE_SHARED matched with an
explicit cache flush statement because in a lot of cases, we will want
to do a batch of read/writes between cache flushes. Note that memory
barriers are already implicit in a lot of kernel primitives, namely
rcu_dereference, cmpxchg, spinlock, ... so this is debatable I guess.
> The only remaining sane semantics is to depend on memory barriers, and
> then make a magic memory barrier that is extra weak and doesn't order
> anythign at all, but just says "syncronize very weakly".
I agree completely. What I am proposing here is just to add syntaxic
sugar to better identify the variables related to those extra weak
barriers.
>
> And I think we have that in "cpu_relax()". Because if you have somebody
> doing shared memory accesses in a loop without any memory barriers or
> locks or anything (ie the _ordering_ doesn't matter, only that some value
> has been seen), then dang it, I can't see how you can _possibly_ use
> anything else than that "cpu_relax()" somewhere in that loop.
>
It must also be matched with the equivalent write flush barrier at the
write side, so hiding this deep within cpu_relax() only at the read-side
seems to hide a lot of what must be performed by the cores to exchange
the data properly. (ok we don't care about write cache flush for
Blackfin particularly, but I don't see why we should not start thinking
about what non-coherent caches small embedded devices can bring)
It's also worth noting that Paul and I have no agenda to push anything
into the mainline kernel to enforce anything like "wmc"-type cache flush
barriers. We are barely trying to find the best semantic to express our
userspace RCU algorithm, and I happen to have noticed this loophole
about non-coherent cache architectures. But ideally it would be good to
stay in sync with the Linux kernel's primitives, so this is why your
criticism is much appreciated.
Thanks,
Mathieu
> Linus
>
> _______________________________________________
> ltt-dev mailing list
> ltt-dev at lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
More information about the lttng-dev
mailing list