[Userspace RCU] - rcu_dereference() memory ordering
Olivier Dion
odion at efficios.com
Thu Nov 21 13:13:29 EST 2024
On Thu, 21 Nov 2024, Mathieu Desnoyers <mathieu.desnoyers at efficios.com> wrote:
> On 2024-10-21 19:35, Paul E. McKenney wrote:
>> On Mon, Oct 21, 2024 at 03:53:04PM -0400, Olivier Dion wrote:
[...]
>> How much of the added "Volatile access" overhead is due to the volatile
>> load and how much to the cmm_ptr_eq? Many use cases do not need to
>> compare pointers, except maybe against NULL. Or against a sentinel.
>> In both cases, an equality comparison means no dereferncing, so no
>> problems.
>
> Olivier will prepare benchmarks without the cmm_ptr_eq() so we can isolate
> the overhead contribution of volatile vs atomic builtins more
> specifically.
Here is the micro-benchmark without pointers comparison. Tight loop of
rcu_derefenrece() ran 1 000 000 000 times:
Hardware:
ARM Cortex-A57
Overview:
| Implementation | Instructions | Cycles | Branch misses | Task clock (ms) | Insn/cycle |
|----------------+----------------+----------------+---------------+-----------------+------------|
| Volatile (V) | 10 006 366 281 | 6 011 214 706 | 21 168 | 3 159.60 | 1.66 |
| Atomic (A) | 10 020 098 136 | 21 081 007 289 | 46 091 | 11 039.38 | 0.48 |
|----------------+----------------+----------------+---------------+-----------------+------------|
| Δ (A / V - 1) | 0.14 % | 250.69 % | 117.74 % | 249.39 % | -71.08 % |
Volatile:
0000000000000860 <func>:
860: 90000100 adrp x0, 20000 <__libc_start_main at GLIBC_2.34>
864: 91012001 add x1, x0, #0x48
868: f9402400 ldr x0, [x0, #72] ;; rcu_dereference()
86c: f9400000 ldr x0, [x0]
870: f9000420 str x0, [x1, #8]
874: d65f03c0 ret
3,159.60 msec task-clock # 0.999 CPUs utilized
3 context-switches # 0.949 /sec
0 cpu-migrations # 0.000 /sec
42 page-faults # 13.293 /sec
6,011,214,706 cycles # 1.903 GHz
10,006,366,281 instructions # 1.66 insn per cycle
<not supported> branches
21,168 branch-misses
3.161819264 seconds time elapsed
3.161902000 seconds user
0.000000000 seconds sys
Atomic:
0000000000000860 <func>:
860: 90000100 adrp x0, 20000 <__libc_start_main at GLIBC_2.34>
864: 91012000 add x0, x0, #0x48
868: c8dffc01 ldar x1, [x0] ;; rcu_dereference()
86c: f9400021 ldr x1, [x1]
870: f9000401 str x1, [x0, #8]
874: d65f03c0 ret
11,039.38 msec task-clock # 1.000 CPUs utilized
20 context-switches # 1.812 /sec
0 cpu-migrations # 0.000 /sec
43 page-faults # 3.895 /sec
21,081,007,289 cycles # 1.910 GHz
10,020,098,136 instructions # 0.48 insn per cycle
<not supported> branches
46,091 branch-misses
11.042103521 seconds time elapsed
11.041847000 seconds user
0.000000000 seconds sys
[...]
--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com
More information about the lttng-dev
mailing list