[Userspace RCU] - rcu_dereference() memory ordering

Olivier Dion odion at efficios.com
Thu Nov 21 13:13:29 EST 2024


On Thu, 21 Nov 2024, Mathieu Desnoyers <mathieu.desnoyers at efficios.com> wrote:
> On 2024-10-21 19:35, Paul E. McKenney wrote:
>> On Mon, Oct 21, 2024 at 03:53:04PM -0400, Olivier Dion wrote:
[...] 
>> How much of the added "Volatile access" overhead is due to the volatile
>> load and how much to the cmm_ptr_eq?  Many use cases do not need to
>> compare pointers, except maybe against NULL.  Or against a sentinel.
>> In both cases, an equality comparison means no dereferncing, so no
>> problems.
>
> Olivier will prepare benchmarks without the cmm_ptr_eq() so we can isolate
> the overhead contribution of volatile vs atomic builtins more
> specifically.

Here is the micro-benchmark without pointers comparison.  Tight loop of
rcu_derefenrece() ran 1 000 000 000 times:

Hardware:

  ARM Cortex-A57

Overview:

 | Implementation | Instructions   | Cycles         | Branch misses | Task clock (ms) | Insn/cycle |
 |----------------+----------------+----------------+---------------+-----------------+------------|
 | Volatile (V)   | 10 006 366 281 | 6 011 214 706  | 21 168        | 3 159.60        |       1.66 |
 | Atomic (A)     | 10 020 098 136 | 21 081 007 289 | 46 091        | 11 039.38       |       0.48 |
 |----------------+----------------+----------------+---------------+-----------------+------------|
 | Δ (A / V - 1)  | 0.14 %         | 250.69 %       | 117.74 %      | 249.39 %        |   -71.08 % |

Volatile:

         0000000000000860 <func>:
          860:   90000100        adrp    x0, 20000 <__libc_start_main at GLIBC_2.34>
          864:   91012001        add     x1, x0, #0x48
          868:   f9402400        ldr     x0, [x0, #72]   ;; rcu_dereference()
          86c:   f9400000        ldr     x0, [x0]
          870:   f9000420        str     x0, [x1, #8]
          874:   d65f03c0        ret

          3,159.60 msec task-clock                       #    0.999 CPUs utilized
                 3      context-switches                 #    0.949 /sec
                 0      cpu-migrations                   #    0.000 /sec
                42      page-faults                      #   13.293 /sec
     6,011,214,706      cycles                           #    1.903 GHz
    10,006,366,281      instructions                     #    1.66  insn per cycle
   <not supported>      branches
            21,168      branch-misses

       3.161819264 seconds time elapsed

       3.161902000 seconds user
       0.000000000 seconds sys

Atomic:

        0000000000000860 <func>:
         860:   90000100        adrp    x0, 20000 <__libc_start_main at GLIBC_2.34>
         864:   91012000        add     x0, x0, #0x48
         868:   c8dffc01        ldar    x1, [x0]        ;; rcu_dereference()
         86c:   f9400021        ldr     x1, [x1]
         870:   f9000401        str     x1, [x0, #8]
         874:   d65f03c0        ret

         11,039.38 msec task-clock                       #    1.000 CPUs utilized
                20      context-switches                 #    1.812 /sec
                 0      cpu-migrations                   #    0.000 /sec
                43      page-faults                      #    3.895 /sec
    21,081,007,289      cycles                           #    1.910 GHz
    10,020,098,136      instructions                     #    0.48  insn per cycle
   <not supported>      branches
            46,091      branch-misses

      11.042103521 seconds time elapsed

      11.041847000 seconds user
       0.000000000 seconds sys

[...]
-- 
Olivier Dion
EfficiOS Inc.
https://www.efficios.com


More information about the lttng-dev mailing list