[ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)

Mathieu Desnoyers compudj at krystal.dyndns.org
Fri Feb 13 12:33:52 EST 2009


* Linus Torvalds (torvalds at linux-foundation.org) wrote:
> 
> 
> On Fri, 13 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > I created also
> > 
> > _STORE_SHARED()
> > _LOAD_SHARED()
> > 
> > which identify the variables which need to have cache flush done before
> > (load) or after (store). So we get both speed and identification when
> > needed (if we need to do batch updates linked with a single cache flush).
> > e.g.
> 
> The thing is, THAT JUST ABSOLUTELY SUCKS.
> 
> Lookie here - we don't want to flush the cache at every load of a shared 
> variable. There's no reason to. If you don't care about the orderign, you 
> might as well get the old values. That's what memory ordering _means_, for 
> chissake! In the absense of locks, loads may get stale values. It's that 
> easy.
> 
> A lot of code wants to access multiple variables, and they are potentially 
> nearby, and in the same cacheline. Making them all use _LOAD_SHARED() adds 
> absolutely no value - and makes it MUCH MUCH SLOWER.
> 

Hrm, I think there is a misunderstanding here, because _LOAD_SHARED() is
not much more than a simple comment.

The whole idea behind _LOAD_SHARED() is that it does not translate in
any different assembly output than a standard load. So no, it cannot be
possibly slower. It has no more side-effect than a simple comment in the
code, and that's its purpose : to identify those variables. So if we
find a code path doing

  _STORE_SHARED(x, v);
  smp_mc();
  while (_LOAD_SHARED(z) != val)
    cpu_relax();

We can verify very easily the code correctness :

A write cache flush is required after _STORE_SHARED
A read cache flush is required before _LOAD_SHARED
Read cache flushes are required to happen eventually between
  _LOAD_SHARED in the loop.

It's basically the same as having something like an eventual :

_STORE_ORDERED(x, v);
smp_mb();
_LOAD_ORDERED(z);

Instead of relying on a comment around smp_mb(); stating which variables
it orders. I would understand if you dislike it, but I find it rather
useful to have this information in the source code around the variable
access rather than formulated as a comment around the barrier. Actually
having both the barrier comment *and* this identification seems rather
good for code review.

> So what's the answer?
> 
> I already outlined it: either you use locks (which will do the magic for 
> you), or you use memory barriers. In no case do you make the access magic, 
> unless you have a compiler issue where you are afraid that the compiler 
> would turn it into _multiple_ accesses and potentially get inconsistent 
> results.
> 
> So the point about ACCESS_ONCE() is not, and never has been, about 
> re-ordering. We know that the CPU may re-order the accesses and give us 
> stale values (or values from the "future" wrt the other accesses around 
> it). That's not the point. The point of ACCESS_ONCE() is that we get 
> exactly _one_ value, and not two different ones (or none at all) because 
> of the compiler either re-loading it several times or not re-loading it at 
> all.
> 
> Anybody who confuses ACCESS_ONCE() with ordering is simply confused.
> 
> And we don't want to make any "load with cache flush" either. Which side 
> should the cache flush be on? Before? After? Both? Atomically? There is no 
> sane semantics for that.
> 

We might want to simply scrap the "safe and slow" version without
underscores (LOAD_SHARED, STORE_SHARED) which contain smp_rmc and
smp_wmc statements within the macro. But Paul insisted that he likes
having the proper memory ordering/cache coherency enforced within the
accessor macros. Personnally, I see much more value in the simple
"comment-only" versions _LOAD_SHARED/_STORE_SHARED matched with an
explicit cache flush statement because in a lot of cases, we will want
to do a batch of read/writes between cache flushes. Note that memory
barriers are already implicit in a lot of kernel primitives, namely
rcu_dereference, cmpxchg, spinlock, ... so this is debatable I guess.

> The only remaining sane semantics is to depend on memory barriers, and 
> then make a magic memory barrier that is extra weak and doesn't order 
> anythign at all, but just says "syncronize very weakly".

I agree completely. What I am proposing here is just to add syntaxic
sugar to better identify the variables related to those extra weak
barriers.

> 
> And I think we have that in "cpu_relax()". Because if you have somebody 
> doing shared memory accesses in a loop without any memory barriers or 
> locks or anything (ie the _ordering_ doesn't matter, only that some value 
> has been seen), then dang it, I can't see how you can _possibly_ use 
> anything else than that "cpu_relax()" somewhere in that loop.
> 

It must also be matched with the equivalent write flush barrier at the
write side, so hiding this deep within cpu_relax() only at the read-side
seems to hide a lot of what must be performed by the cores to exchange
the data properly. (ok we don't care about write cache flush for
Blackfin particularly, but I don't see why we should not start thinking
about what non-coherent caches small embedded devices can bring)

It's also worth noting that Paul and I have no agenda to push anything
into the mainline kernel to enforce anything like "wmc"-type cache flush
barriers. We are barely trying to find the best semantic to express our
userspace RCU algorithm, and I happen to have noticed this loophole
about non-coherent cache architectures. But ideally it would be good to
stay in sync with the Linux kernel's primitives, so this is why your
criticism is much appreciated.

Thanks,

Mathieu

> 			Linus
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev at lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68




More information about the lttng-dev mailing list