[lttng-dev] [Userspace RCU] - rcu_dereference() memory ordering
Olivier Dion
odion at efficios.com
Mon Oct 21 15:53:04 EDT 2024
Hi Paul,
In liburcu, `rcu_dereference()' is either implemented with volatile
access with `CMM_LOAD_SHARED()' followed by a memory barrier depends, or
a atomic load with CONSUME memory ordering (configurable by users on a
compilation unit basis).
However, it is my understanding that the CONSUME memory ordering
semantic has some deficiency [0] and will be promoted to a ACQUIRE
memory ordering. This is somewhat inefficient (see benchmarks at the
end) for weakly-ordered architectures [1]:
rcu_dereference_consume:
sub sp, sp, #16
add x1, sp, 8
str x0, [sp, 8]
ldar x0, [x1] ;; Load acquire
add sp, sp, 16
ret
rcu_dereference_relaxed:
sub sp, sp, #16
add x1, sp, 8
str x0, [sp, 8]
ldr x0, [x1] ;; Load
add sp, sp, 16
rer
I had a discussion with Mathieu on that, and using the RELAXED memory
ordering (on every architecture except Alpha) + a compiler barrier would
not prevent compiler value-speculative optimizations (e.g. VSS: Value
Speculation Scheduling).
Consider the following code:
#define cmm_barrier() asm volatile ("" : : : "memory")
#define rcu_dereference(p) __atomic_load_n(&(p), __ATOMIC_RELAXED)
// Assume QSBR flavor
#define rcu_read_lock() do { } while (0)
#define rcu_read_unlock() do { } while (0)
struct foo {
long x;
};
struct foo *foo;
extern void do_stuff(long);
// Assume that global pointer `foo' is never NULL for simplicity.
void func(void)
{
struct foo *a, *b;
rcu_read_lock(); {
a = rcu_dereference(foo);
do_stuff(a->x);
} rcu_read_unlock();
cmm_barrier();
rcu_read_lock(); {
b = rcu_dereference(foo);
if (a == b)
do_stuff(b->x);
} rcu_read_unlock();
}
and the resulting assembler on ARM64 (GCC 14.2.0) [2]:
func:
stp x29, x30, [sp, -32]!
mov x29, sp
stp x19, x20, [sp, 16]
adrp x19, .LANCHOR0
add x19, x19, :lo12:.LANCHOR0
ldr x20, [x19] ;; a = rcu_dereference | <-- here ...
ldr x0, [x20] ;; a->x
bl do_stuff
ldr x0, [x19] ;; b = rcu_dereference
cmp x20, x0
beq .L5
ldp x19, x20, [sp, 16]
ldp x29, x30, [sp], 32
ret
.L5:
ldr x0, [x20] ;; b->x | can be reordered up-to ...
ldp x19, x20, [sp, 16]
ldp x29, x30, [sp], 32
b do_stuff
foo:
.zero 8
>From my understanding of the ARM memory ordering and its ISA, the
processor is within its right to reorder the `ldr x0, [x20]' in `.L5',
up to its dependency at `ldr x20, [x19]', which happen before the RCU
dereferencing of `b'.
This looks similar to what Mathieu described here [3].
Our proposed solution is to keep using the CONSUME memory ordering by
default, therefore guaranteeing correctness above all for all cases.
However, to allow for better performance, users can opt-in to use
"traditional" volatile access instead of atomic builtins for
`rcu_dereference()', as long as pointer comparisons are avoided or as
long as the `ptr_eq' wrapper proposed by Mathieu [3] is used for them.
Thus, `rcu_dereference()' would be defined as something like:
#ifdef URCU_DEREFERENCE_USE_VOLATILE
# define rcu_dereference(p) do { CMM_LOAD_SHARED(p); cmm_smp_rmc(); } while(0)
#else
# define rcu_dereference(p) uatomic_load(&(p), CMM_CONSUME)
#endif
and would yield, if using `cmm_ptr_eq' (ARM64 (GCC 14.2.0)) [4]:
func:
stp x29, x30, [sp, -32]!
mov x29, sp
stp x19, x20, [sp, 16]
adrp x20, .LANCHOR0
ldr x19, [x20, #:lo12:.LANCHOR0] ;; a = rcu_dereference
ldr x0, [x19] ;; a->x
bl do_stuff
ldr x2, [x20, #:lo12:.LANCHOR0] ;; b = rcu_dereference | <-- here ...
mov x0, x19 ;; side effect of cmm_ptr_eq, force to use more registers
mov x1, x2 ;; and more registers
cmp x0, x1
beq .L5
ldp x19, x20, [sp, 16]
ldp x29, x30, [sp], 32
ret
.L5:
ldp x19, x20, [sp, 16]
ldp x29, x30, [sp], 32
ldr x0, [x2] ;; b->x | can be re-ordered up-to ...
b do_stuff
foo:
.zero 8
The Pro & Cons overall for selecting the volatile for rcu_dereference:
Pro:
- Yield better performance on weakly-ordered architectures for all
`rcu_dereference'.
Cons:
- Users would need to use the `cmm_ptr_eq' for pointer comparisons,
even on strongly ordered architectures.
- `cmm_ptr_eq' can increase register pressure, resulting in possible
register spilling.
Here is a benchmark summary. You can find more details in the attached
file.
CPU: Aarch64 Cortex-A57
Program ran with perf. Thight loop of the above example 1 000 000 000
times.
Variants are:
- Baseline v0.14.1:: rcu_dereference() implemented with
CMM_ACCESS_ONCE(). Pointers comparisons with `==' operator.
- Volatile access:: rcu_dereference() implemented with
CMM_ACCESS_ONCE(). Pointers comparisons with cmm_ptr_eq.
- Atomic builtins:: rcu_dereference() implementd with
__atomic_load_n CONSUME. Pointers comparisons with cmm_ptr_eq.
All variants were compiled with _LGPL_SOURCE.
| Variant | Time [s] | Cycles | Instructions | Branch misses |
|-----------------+-------------+---------------+----------------+---------------|
| Baseline | 4.217609351 | 8 015 627 017 | 15 008 330 513 | 26 607 |
|-----------------+-------------+---------------+----------------+---------------|
| Volatile access | +10.95 % | +11.14 % | +6.25 % | +10.81 % |
| Atomic builtins | +423.18 % | +425.94 % | +6.87 % | +188.37 % |
Any thoughts on that?
Thanks,
Olivier
[0] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
[1] https://godbolt.org/z/xxqGPjaxK
[2] https://godbolt.org/z/cPzxq7PKb
[3] https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
[4] https://godbolt.org/z/979jnccc9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20241018T121818--benchmarks-urcu__urcu.org
Type: text/x-org
Size: 5843 bytes
Desc: not available
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20241021/1d2cc7f9/attachment.bin>
-------------- next part --------------
--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com
More information about the lttng-dev
mailing list