[lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

Fri Apr 16 14:40:08 EDT 2021

----- On Apr 16, 2021, at 12:01 PM, paulmck paulmck at kernel.org wrote:

> On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote:
>> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote:
>> > Hi Paul, Will, Peter,
>> > 
>> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
>> > is able to break rcu_dereference. This seems to be taken care of by
>> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
>> > 
>> > In the liburcu user-space library, we have this comment near rcu_dereference()
>> > in
>> > include/urcu/static/pointer.h:
>> > 
>> >  * The compiler memory barrier in CMM_LOAD_SHARED() ensures that
>> >  value-speculative
>> >  * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform the
>> >  * data read before the pointer read by speculating the value of the pointer.
>> >  * Correct ordering is ensured because the pointer is read as a volatile access.
>> >  * This acts as a global side-effect operation, which forbids reordering of
>> >  * dependent memory operations. Note that such concern about dependency-breaking
>> >  * optimizations will eventually be taken care of by the "memory_order_consume"
>> >  * addition to forthcoming C++ standard.
>> > 
>> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was introduced in
>> > liburcu as a public API before READ_ONCE() existed in the Linux kernel)
>> > 
>> > Peter tells me the "memory_order_consume" is not something which can be used
>> > today.
>> > Any information on its status at C/C++ standard levels and implementation-wise ?
> 
> Actually, you really can use memory_order_consume.  All current
> implementations will compile it as if it was memory_order_acquire.
> This will work correctly, but may be slower than you would like on ARM,
> PowerPC, and so on.
> 
> On things like x86, the penalty is forgone optimizations, so less
> of a problem there.

OK

> 
>> > Pragmatically speaking, what should we change in liburcu to ensure we don't
>> > generate
>> > broken code when LTO is enabled ? I suspect there are a few options here:
>> > 
>> > 1) Fail to build if LTO is enabled,
>> > 2) Generate slower code for rcu_dereference, either on all architectures or only
>> >    on weakly-ordered architectures,
>> > 3) Generate different code depending on whether LTO is enabled or not. AFAIU
>> > this would only
>> >    work if every compile unit is aware that it will end up being optimized with
>> >    LTO. Not sure
>> >    how this could be done in the context of user-space.
>> > 4) [ Insert better idea here. ]
> 
> Use memory_order_consume if LTO is enabled.  That will work now, and
> might generate good code in some hoped-for future.

In the context of a user-space library, how does one check whether LTO is enabled with
preprocessor directives ? A quick test with gcc seems to show that both with and without
-flto cannot be distinguished from a preprocessor POV, e.g. the output of both

gcc --std=c11 -O2 -dM -E - < /dev/null
and
gcc --std=c11 -O2 -flto -dM -E - < /dev/null

is exactly the same. Am I missing something here ?

If we accept to use memory_order_consume all the time in both C and C++ code starting from
C11 and C++11, the following code snippet could do the trick:

#define CMM_ACCESS_ONCE(x) (*(__volatile__  __typeof__(x) *)&(x))
#define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p)

#if defined (__cplusplus)
# if __cplusplus >= 201103L
#  include <atomic>
#  define rcu_dereference(x)    ((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume)
# else
#  define rcu_dereference(x)    CMM_LOAD_SHARED(x)
# endif
#else
# if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L)
#  include <stdatomic.h>
#  define rcu_dereference(x)    atomic_load_explicit(&(x), memory_order_consume)
# else
#  define rcu_dereference(x)    CMM_LOAD_SHARED(x)
# endif
#endif

This uses the volatile approach prior to C11/C++11, and moves to memory_order_consume
afterwards. This will bring a performance penalty on weakly-ordered architectures even
when -flto is not specified though.

Then the burden is pushed on the compiler people to eventually implement an efficient
memory_order_consume.

Is that acceptable ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com