[lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

Fri Apr 16 12:01:39 EDT 2021

On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote:
> > Hi Paul, Will, Peter,
> > 
> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
> > is able to break rcu_dereference. This seems to be taken care of by
> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
> > 
> > In the liburcu user-space library, we have this comment near rcu_dereference() in
> > include/urcu/static/pointer.h:
> > 
> >  * The compiler memory barrier in CMM_LOAD_SHARED() ensures that value-speculative
> >  * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform the
> >  * data read before the pointer read by speculating the value of the pointer.
> >  * Correct ordering is ensured because the pointer is read as a volatile access.
> >  * This acts as a global side-effect operation, which forbids reordering of
> >  * dependent memory operations. Note that such concern about dependency-breaking
> >  * optimizations will eventually be taken care of by the "memory_order_consume"
> >  * addition to forthcoming C++ standard.
> > 
> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was introduced in
> > liburcu as a public API before READ_ONCE() existed in the Linux kernel)
> > 
> > Peter tells me the "memory_order_consume" is not something which can be used today.
> > Any information on its status at C/C++ standard levels and implementation-wise ?

Actually, you really can use memory_order_consume.  All current
implementations will compile it as if it was memory_order_acquire.
This will work correctly, but may be slower than you would like on ARM,
PowerPC, and so on.

On things like x86, the penalty is forgone optimizations, so less
of a problem there.

> > Pragmatically speaking, what should we change in liburcu to ensure we don't generate
> > broken code when LTO is enabled ? I suspect there are a few options here:
> > 
> > 1) Fail to build if LTO is enabled,
> > 2) Generate slower code for rcu_dereference, either on all architectures or only
> >    on weakly-ordered architectures,
> > 3) Generate different code depending on whether LTO is enabled or not. AFAIU this would only
> >    work if every compile unit is aware that it will end up being optimized with LTO. Not sure
> >    how this could be done in the context of user-space.
> > 4) [ Insert better idea here. ]

Use memory_order_consume if LTO is enabled.  That will work now, and
might generate good code in some hoped-for future.

> > Thoughts ?
> 
> Using memory_order_acquire is safe; and is basically what Will did for
> ARM64.
> 
> The problematic tranformations are possible even without LTO, although
> less likely due to less visibility, but everybody agrees they're
> possible and allowed.
> 
> OTOH we do not have a positive sighting of it actually happening (I
> think), we're all just being cautious and not willing to debug the
> resulting wreckage if it does indeed happen.

And yes, you can also use memory_order_acquire.

							Thanx, Paul