[lttng-dev] liburcu rcu_xchg_pointer and rcu_cmpxchg_pointer ARM32 barriers

Mon Dec 5 23:06:41 UTC 2016

On Mon, Dec 05, 2016 at 11:01:10PM +0000, Mathieu Desnoyers wrote:
> ----- On Dec 5, 2016, at 5:35 PM, Paul E. McKenney paulmck at linux.vnet.ibm.com wrote:
> 
> > On Mon, Dec 05, 2016 at 02:14:47PM +0000, Mathieu Desnoyers wrote:
> >> Hi Paul,
> >> 
> >> So about the liburcu rcu_xchg_pointer() barriers, here is the current
> >> situation:
> >> 
> >> rcu_xchg_pointer is implemented as:
> >> 
> >> #define _rcu_xchg_pointer(p, v)                         \
> >>         __extension__                                   \
> >>         ({                                              \
> >>                 __typeof__(*p) _________pv = (v);       \
> >>                 if (!__builtin_constant_p(v) ||         \
> >>                     ((v) != NULL))                      \
> >>                         cmm_wmb();                              \
> >>                 uatomic_xchg(p, _________pv);           \
> >>         })
> >> 
> >> So we actually add a write barrier before the uatomic_xchg(),
> >> which should not be required if we consider that uatomic_xchg()
> >> *should* imply a full barrier before/after.
> >> 
> >> But in reality, it's ARM32 uatomic_xchg() which does not fulfill
> >> its contract, due to __sync_lock_test_and_set being only
> >> an acquire barrier [1]. So the extra cmm_wmb() is what saved
> >> us here for rcu_xchg_pointer().
> >> 
> >> The code currently generated by rcu_xchg_pointer() looks like:
> >> 
> >>    11000:       f3bf 8f5f       dmb     sy
> >>    11004:       e857 ef00       ldrex   lr, [r7]
> >>    11008:       e847 0300       strex   r3, r0, [r7]
> >>    1100c:       2b00            cmp     r3, #0
> >>    1100e:       d1f9            bne.n   11004 <thr_writer+0x70>
> >>    11010:       f3bf 8f5b       dmb     ish
> >> 
> >> 
> >> Looking at the cmpxchg variant:
> >> 
> >> #define _rcu_cmpxchg_pointer(p, old, _new)                              \
> >>         __extension__                                                   \
> >>         ({                                                              \
> >>                 __typeof__(*p) _________pold = (old);                   \
> >>                 __typeof__(*p) _________pnew = (_new);                  \
> >>                 if (!__builtin_constant_p(_new) ||                      \
> >>                     ((_new) != NULL))                                   \
> >>                         cmm_wmb();                                              \
> >>                 uatomic_cmpxchg(p, _________pold, _________pnew);       \
> >>         })
> >> 
> >> We also notice a cmm_wmb() before what should imply a full barrier
> >> (uatomic_cmxchg). The latter is implemented with __sync_val_compare_and_swap_N,
> >> which should imply a full barrier based on [1] (which is as vague as it
> >> gets). Looking at the generated code, we indeed have two barriers before:
> >> 
> >>    11000:       f3bf 8f5f       dmb     sy
> >>    11004:       f3bf 8f5b       dmb     ish
> >>    11008:       e857 ef00       ldrex   lr, [r7]
> >>    1100c:       45c6            cmp     lr, r8
> >>    1100e:       d103            bne.n   11018 <thr_writer+0x84>
> >>    11010:       e847 0300       strex   r3, r0, [r7]
> >>    11014:       2b00            cmp     r3, #0
> >>    11016:       d1f7            bne.n   11008 <thr_writer+0x74>
> >>    11018:       f3bf 8f5b       dmb     ish
> >> 
> >> So for stable-0.8 and stable-0.9, I would be tempted to err on
> >> the safe side and simply add the missing cmm_smp_mb() within
> >> uatomic_xchg() before the __sync_lock_test_and_set().
> >> 
> >> For the master branch, in addition to adding the missing cmm_smp_mb()
> >> to uatomic_xchg(), we could remove the redundant cmm_wmb() in
> >> rcu_cmpxchg_pointer and rcu_xchg_pointer.
> >> 
> >> Thoughts ?
> > 
> > Seems reasonable to me.  It is the x86 guys who might have objections,
> > given that the extra barrier costs them but has no effect.  ;-)
> 
> This barrier is only added to the asm-specific code of uatomic_xchg and uatomic_cmpxchg(),
> and has no impact on x86, so we should be good.
> 
> Actually, removing the explicit wmb() from rcu_cmpxchg_pointer() and rcu_xchg_pointer()
> will even speed up those operations on x86.

Even better!

							Thanx, Paul