[lttng-dev] [Userspace RCU] - rcu_dereference() memory ordering

Mon Oct 21 15:53:04 EDT 2024

Hi Paul,

In liburcu, `rcu_dereference()' is either implemented with volatile
access with `CMM_LOAD_SHARED()' followed by a memory barrier depends, or
a atomic load with CONSUME memory ordering (configurable by users on a
compilation unit basis).

However, it is my understanding that the CONSUME memory ordering
semantic has some deficiency [0] and will be promoted to a ACQUIRE
memory ordering.  This is somewhat inefficient (see benchmarks at the
end) for weakly-ordered architectures [1]:

  rcu_dereference_consume:
          sub     sp, sp, #16
          add     x1, sp, 8
          str     x0, [sp, 8]
          ldar    x0, [x1]     ;; Load acquire
          add     sp, sp, 16
          ret     
  rcu_dereference_relaxed:
          sub     sp, sp, #16
          add     x1, sp, 8
          str     x0, [sp, 8]
          ldr     x0, [x1]     ;; Load
          add     sp, sp, 16
          rer

I had a discussion with Mathieu on that, and using the RELAXED memory
ordering (on every architecture except Alpha) + a compiler barrier would
not prevent compiler value-speculative optimizations (e.g. VSS: Value
Speculation Scheduling).

Consider the following code:

  #define cmm_barrier() asm volatile ("" : : : "memory")
  #define rcu_dereference(p) __atomic_load_n(&(p), __ATOMIC_RELAXED)

  // Assume QSBR flavor
  #define rcu_read_lock() do { } while (0)
  #define rcu_read_unlock() do { } while (0)

  struct foo {
      long x;
  };

  struct foo *foo;

  extern void do_stuff(long);

  // Assume that global pointer `foo' is never NULL for simplicity.
  void func(void)
  {
      struct foo *a, *b;

      rcu_read_lock(); {
          a = rcu_dereference(foo);
          do_stuff(a->x);
      } rcu_read_unlock();
      cmm_barrier();
      rcu_read_lock(); {
          b = rcu_dereference(foo);
          if (a == b)
              do_stuff(b->x);
      } rcu_read_unlock();
  }

and the resulting assembler on ARM64 (GCC 14.2.0) [2]:

  func:
          stp     x29, x30, [sp, -32]!
          mov     x29, sp
          stp     x19, x20, [sp, 16]
          adrp    x19, .LANCHOR0
          add     x19, x19, :lo12:.LANCHOR0
          ldr     x20, [x19] ;; a = rcu_dereference | <-- here ...
          ldr     x0, [x20]  ;; a->x
          bl      do_stuff
          ldr     x0, [x19]  ;; b = rcu_dereference
          cmp     x20, x0
          beq     .L5
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          ret
  .L5:
          ldr     x0, [x20] ;; b->x | can be reordered up-to ...
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          b       do_stuff
  foo:
          .zero   8

>From my understanding of the ARM memory ordering and its ISA, the
processor is within its right to reorder the `ldr x0, [x20]' in `.L5',
up to its dependency at `ldr x20, [x19]', which happen before the RCU
dereferencing of `b'.

This looks similar to what Mathieu described here [3].

Our proposed solution is to keep using the CONSUME memory ordering by
default, therefore guaranteeing correctness above all for all cases.
However, to allow for better performance, users can opt-in to use
"traditional" volatile access instead of atomic builtins for
`rcu_dereference()', as long as pointer comparisons are avoided or as
long as the `ptr_eq' wrapper proposed by Mathieu [3] is used for them.

Thus, `rcu_dereference()' would be defined as something like:

  #ifdef URCU_DEREFERENCE_USE_VOLATILE
  #  define rcu_dereference(p) do { CMM_LOAD_SHARED(p); cmm_smp_rmc(); } while(0)
  #else
  #  define rcu_dereference(p)  uatomic_load(&(p), CMM_CONSUME)
  #endif

and would yield, if using `cmm_ptr_eq' (ARM64 (GCC 14.2.0)) [4]:

  func:
          stp     x29, x30, [sp, -32]!
          mov     x29, sp
          stp     x19, x20, [sp, 16]
          adrp    x20, .LANCHOR0
          ldr     x19, [x20, #:lo12:.LANCHOR0] ;; a = rcu_dereference
          ldr     x0, [x19] ;; a->x
          bl      do_stuff
          ldr     x2, [x20, #:lo12:.LANCHOR0] ;; b = rcu_dereference | <-- here ...
          mov     x0, x19 ;; side effect of cmm_ptr_eq, force to use more registers
          mov     x1, x2  ;; and more registers
          cmp     x0, x1
          beq     .L5
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          ret
  .L5:
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          ldr     x0, [x2] ;; b->x | can be re-ordered up-to ...
          b       do_stuff
  foo:
          .zero   8

The Pro & Cons overall for selecting the volatile for rcu_dereference:

  Pro:

    - Yield better performance on weakly-ordered architectures for all
      `rcu_dereference'.

  Cons:

    - Users would need to use the `cmm_ptr_eq' for pointer comparisons,
      even on strongly ordered architectures.

    - `cmm_ptr_eq' can increase register pressure, resulting in possible
      register spilling.

Here is a benchmark summary.  You can find more details in the attached
file.

  CPU: Aarch64 Cortex-A57

  Program ran with perf.  Thight loop of the above example 1 000 000 000
  times.

  Variants are:

    - Baseline v0.14.1:: rcu_dereference() implemented with
      CMM_ACCESS_ONCE().  Pointers comparisons with `==' operator.

    - Volatile access:: rcu_dereference() implemented with
      CMM_ACCESS_ONCE().  Pointers comparisons with cmm_ptr_eq.

    - Atomic builtins:: rcu_dereference() implementd with
      __atomic_load_n CONSUME.  Pointers comparisons with cmm_ptr_eq.

  All variants were compiled with _LGPL_SOURCE.  

  | Variant         | Time [s]    | Cycles        | Instructions   | Branch misses |
  |-----------------+-------------+---------------+----------------+---------------|
  | Baseline        | 4.217609351 | 8 015 627 017 | 15 008 330 513 | 26 607        |
  |-----------------+-------------+---------------+----------------+---------------|
  | Volatile access | +10.95 %    | +11.14 %      | +6.25 %        | +10.81 %      |
  | Atomic builtins | +423.18 %   | +425.94 %     | +6.87 %        | +188.37 %     |

Any thoughts on that?

Thanks,
Olivier

 [0] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
 [1] https://godbolt.org/z/xxqGPjaxK
 [2] https://godbolt.org/z/cPzxq7PKb
 [3] https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
 [4] https://godbolt.org/z/979jnccc9

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20241018T121818--benchmarks-urcu__urcu.org
Type: text/x-org
Size: 5843 bytes
Desc: not available
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20241021/1d2cc7f9/attachment.bin>
-------------- next part --------------
-- 
Olivier Dion
EfficiOS Inc.
https://www.efficios.com