[lttng-dev] [PATCH] urcu: avoid false sharing for rcu_gp_ctr

Mon Dec 10 10:04:18 EST 2012

On Fri, Dec 07, 2012 at 12:22:52PM -0500, Mathieu Desnoyers wrote:
> * Lai Jiangshan (eag0628 at gmail.com) wrote:
> > On Saturday, December 8, 2012, Mathieu Desnoyers wrote:
> > 
> > > * Lai Jiangshan (eag0628 at gmail.com <javascript:;>) wrote:
> > > > we can define rcu_gp_ctr and registry with aligned attribute, but it is
> > > not
> > > > reliable way
> > > >
> > > > We need only this:
> > > > unsigned long rcu_gp_ctr __attribute((aligned and padded(don't put other
> > > > var next to it except the futex)))
> > >
> > > In which situation would this be unreliable ?
> > 
> > 
> > 
> > int a;
> > int b __attribute__((aligned));
> > int c;
> > 
> > b and c will be in the same line, even we define c as aligned too, the
> > compiler/linker may put a next to b, thus a and b in the same line
> 
> So if our goal is to have rcu_gp_ctr and rcu_gp_futex on the same cache
> line, which is different from that of the registry, we could do:
> 
> typeA rcu_gp_ctr __attribute__((aligned(...)));
> typeB rcu_gp_futex;
> typeC registry __attribute__((aligned(...)));
> 
> I would expect the compiler won't typically reorder rcu_gp_futex and
> registry. But I guess there is no guarantee it is going to be always
> true given by the C99 standard.
> 
> I guess this is a case where we could bump the library version number
> and do things properly.
> 
> Let's think a bit more about it, anyone else has comments on this ?

If you really want the alignment and padding, they should go into a
structure.  ;-)

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > 
> > >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > >
> > > > On Saturday, December 8, 2012, Mathieu Desnoyers wrote:
> > > >
> > > > > * Lai Jiangshan (laijs at cn.fujitsu.com <javascript:;> <javascript:;>)
> > > wrote:
> > > > > > @rcu_gp_ctr and @registry share the same cache line, it causes
> > > > > > false sharing and slowdown both of the read site and update site.
> > > > > >
> > > > > > Fix: Use different cache line for them.
> > > > > >
> > > > > > Although rcu_gp_futex is updated less than rcu_gp_ctr, but
> > > > > > they always be accessed at almost the same time, so we also move
> > > > > rcu_gp_futex
> > > > > > to the cacheline of rcu_gp_ctr to reduce the cacheline-usage or
> > > > > cache-missing
> > > > > > of read site.
> > > > >
> > > > > Hi Lai,
> > > > >
> > > > > I agree on the goal: placing registry and rcu_gp_ctr on different
> > > > > cache-lines. And yes, it makes sense to put rcu_gp_ctr and rcu_gp_futex
> > > > > on the same cache-line. I agree that the next patch is fine too
> > > (keeping
> > > > > qsbr and other urcu similar). This is indeed what I try to ensure
> > > > > myself.
> > > > >
> > > > > I'm just concerned that this patch seems to break ABI compability for
> > > > > liburcu: the read-side, within applications, would have to be
> > > > > recompiled. So I guess we should decide if we do this change in a way
> > > > > that does not break the ABI (e.g. not introducing a structure), or if
> > > we
> > > > > choose to bump the library version number.
> > > > >
> > > > > Thoughts ?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Mathieu
> > > > >
> > > > > >
> > > > > >
> > > > > > test: (4X6=24 CPUs)
> > > > > >
> > > > > > Before patch:
> > > > > >
> > > > > > [root at localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
> > > > > > SUMMARY ./tests/test_urcu_mb      testdur   20 nr_readers  20 rdur
> > > > >  0 wdur      0 nr_writers   1 wdelay      0 nr_reads   2100285330
> > > nr_writes
> > > > >      3390219 nr_ops   2103675549
> > > > > > [root at localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
> > > > > > SUMMARY ./tests/test_urcu_mb      testdur   20 nr_readers  20 rdur
> > > > >  0 wdur      0 nr_writers   1 wdelay      0 nr_reads   1619868562
> > > nr_writes
> > > > >      3529478 nr_ops   1623398040
> > > > > > [root at localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
> > > > > > SUMMARY ./tests/test_urcu_mb      testdur   20 nr_readers  20 rdur
> > > > >  0 wdur      0 nr_writers   1 wdelay      0 nr_reads   1949067038
> > > nr_writes
> > > > >      3469334 nr_ops   1952536372
> > > > > >
> > > > > >
> > > > > > after patch:
> > > > > >
> > > > > > [root at localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
> > > > > > SUMMARY ./tests/test_urcu_mb      testdur   20 nr_readers  20 rdur
> > > > >  0 wdur      0 nr_writers   1 wdelay      0 nr_reads   3380191848
> > > nr_writes
> > > > >      4903248 nr_ops   3385095096
> > > > > > [root at localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
> > > > > > SUMMARY ./tests/test_urcu_mb      testdur   20 nr_readers  20 rdur
> > > > >  0 wdur      0 nr_writers   1 wdelay      0 nr_reads   3397637486
> > > nr_writes
> > > > >      4129809 nr_ops   3401767295
> > > > > >
> > > > > > Singed-by: Lai Jiangshan <laijs at cn.fujitsu.com <javascript:;><javascript:;>>
> > > > > > ---
> > > > > > diff --git a/urcu.c b/urcu.c
> > > > > > index 15def09..436d71c 100644
> > > > > > --- a/urcu.c
> > > > > > +++ b/urcu.c
> > > > > > @@ -83,16 +83,7 @@ void __attribute__((destructor)) rcu_exit(void);
> > > > > >  #endif
> > > > > >
> > > > > >  static pthread_mutex_t rcu_gp_lock = PTHREAD_MUTEX_INITIALIZER;
> > > > > > -
> > > > > > -int32_t rcu_gp_futex;
> > > > > > -
> > > > > > -/*
> > > > > > - * Global grace period counter.
> > > > > > - * Contains the current RCU_GP_CTR_PHASE.
> > > > > > - * Also has a RCU_GP_COUNT of 1, to accelerate the reader fast path.
> > > > > > - * Written to only by writer with mutex taken. Read by both writer
> > > and
> > > > > readers.
> > > > > > - */
> > > > > > -unsigned long rcu_gp_ctr = RCU_GP_COUNT;
> > > > > > +struct urcu_gp rcu_gp = { .ctr = RCU_GP_COUNT };
> > > > > >
> > > > > >  /*
> > > > > >   * Written to only by each individual reader. Read by both the
> > > reader
> > > > > and the
> > > > > > @@ -217,8 +208,8 @@ static void wait_gp(void)
> > > > > >  {
> > > > > >       /* Read reader_gp before read futex */
> > > > > >       smp_mb_master(RCU_MB_GROUP);
> > > > > > -     if (uatomic_read(&rcu_gp_futex) == -1)
> > > > > > -             futex_async(&rcu_gp_futex, FUTEX_WAIT, -1,
> > > > > > +     if (uatomic_read(&rcu_gp.futex) == -1)
> > > > > > +             futex_async(&rcu_gp.futex, FUTEX_WAIT, -1,
> > > > > >                     NULL, NULL, 0);
> > > > > >  }
> > > > > >
> > > > > > @@ -232,12 +223,12 @@ static void wait_for_readers(struct
> > > cds_list_head
> > > > > *input_readers,
> > > > > >       /*
> > > > > >        * Wait for each thread URCU_TLS(rcu_reader).ctr to either
> > > > > >        * indicate quiescence (not nested), or observe the current
> > > > > > -      * rcu_gp_ctr value.
> > > > > > +      * rcu_gp.ctr value.
> > > > > >        */
> > > > > >       for (;;) {
> > > > > >               wait_loops++;
> > > > > >               if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> > > > > > -                     uatomic_dec(&rcu_gp_futex);
> > > > > > +                     uatomic_dec(&rcu_gp.futex);
> > > > > >                       /* Write futex before read reader_gp */
> > > > > >                       smp_mb_master(RCU_MB_GROUP);
> > > > > >               }
> > > > > > @@ -270,7 +261,7 @@ static void wait_for_readers(struct cds_list_head
> > > > > *input_readers,
> > > > > >                       if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> > > > > >                               /* Read reader_gp before write futex */
> > > > > >                               smp_mb_master(RCU_MB_GROUP);
> > > > > > -                             uatomic_set(&rcu_gp_futex, 0);
> > > > > > +                             uatomic_set(&rcu_gp.futex, 0);
> > > > > >                       }
> > > > > >                       break;
> > > > > >               } else {
> > > > > > @@ -289,7 +280,7 @@ static void wait_for_readers(struct cds_list_head
> > > > > *input_readers,
> > > > > >                       if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> > > > > >                               /* Read reader_gp before write futex */
> > > > > >                             > > lttng-dev at lists.lttng.org<javascript:;><javascript:;>
> > > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > > >
> > >
> > > --
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > >
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>