[ltt-dev] [rp] [URCU RFC patch 3/3] call_rcu: remove delay for wakeup scheme

Mon Jun 6 17:39:27 EDT 2011

On Mon, Jun 06, 2011 at 02:29:01PM -0700, Phil Howard wrote:
> On Mon, Jun 6, 2011 at 12:41 PM, Paul E. McKenney
> <paulmck at linux.vnet.ibm.com> wrote:
> > On Mon, Jun 06, 2011 at 03:21:07PM -0400, Mathieu Desnoyers wrote:
> >> * Mathieu Desnoyers (mathieu.desnoyers at efficios.com) wrote:
> >> > I notice that the "poll(NULL, 0, 10);" delay is executed both for the RT
> >> > and non-RT code.  So given that my goal is to get the call_rcu thread to
> >> > GC memory as quickly as possible to diminish the overhead of cache
> >> > misses, I decided to try removing this delay for !RT: the call_rcu
> >> > thread then wakes up ASAP when the thread invoking call_rcu wakes it. My
> >> > updates jump to 76349/s (getting there!) ;).
> >> >
> >> > This improvement can be explained by a lower delay between call_rcu and
> >> > execution of its callback, which decrease the amount of cache used, and
> >> > therefore provides better cache locality.
> >>
> >> I just wonder if it's worth it: removing this delay from the !RT
> >> call_rcu thread can cause high-rate of synchronize_rcu() calls. So
> >> although there might be an advantage in terms of update rate, it will
> >> likely cause extra cache-line bounces between the call_rcu threads and
> >> the reader threads.
> >>
> >> test_urcu_rbtree 7 1 20 -g 1000000
> >>
> >> With the delay in the call_rcu thread:
> >> search:  1842857 items/reader thread/s (7 reader threads)
> >> updates:   21066 items/s (1 update thread)
> >> ratio: 87 search/update
> >>
> >> Without the delay in the call_rcu thread:
> >> search:  3064285 items/reader thread/s (7 reader threads)
> >> updates:   45096 items/s (1 update thread)
> >> ratio: 68 search/update
> >>
> >> So basically, adding the delay doubles the update performance, at the
> >> cost of being 33% slower for reads. My first thought is that if an
> >> application has very frequent updates, then maybe it wants to have fast
> >> updates because the update throughput is then important. If the
> >> application has infrequent updates, then the reads will be fast anyway,
> >> because rare call_rcu invocation will trigger less cache-line bounce
> >> between readers and writers. Any other thoughts on this trade-off and
> >> how to deal with it ?
> >
> > One approach would be to let the user handle it using real-time
> > priority adjustment.  Another approach would be to let the user
> > specify the wait time in milliseconds, and skip the poll() system
> > call if the specified wait time is zero.
> >
> > The latter seems more sane to me.  It also allows the user to
> > specify (say) 10000 milliseconds for cases where there is a
> > lot of memory and where amortizing synchronize_rcu() overhead
> > across a large number of updates is important.
> >
> > Other thoughts?
> >
> >                                                Thanx, Paul
> 
> If synchronize_rcu is used to time memory reclamation, then trading
> memory for overhead is a valid way to think of this timing. But if
> synchronize_rcu is required inside an update for other purposes (e.g.
> my RBTree algorithm or Josh's hash table resize), then the trade-off
> needs to include synchronize_rcu overhead vs. update throughput.

But this patch does not affect synchronize_rcu(), just call_rcu().

That said, your point on raw results vs. ratio in your other email
are well-taken.

						Thanx, Paul

> -phil
> 
> >
> >> Thanks,
> >>
> >> Mathieu
> >>
> >>
> >> >
> >> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> >> > ---
> >> >  urcu-call-rcu-impl.h |    3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > Index: userspace-rcu/urcu-call-rcu-impl.h
> >> > ===================================================================
> >> > --- userspace-rcu.orig/urcu-call-rcu-impl.h
> >> > +++ userspace-rcu/urcu-call-rcu-impl.h
> >> > @@ -242,7 +242,8 @@ static void *call_rcu_thread(void *arg)
> >> >             else {
> >> >                     if (&crdp->cbs.head == _CMM_LOAD_SHARED(crdp->cbs.tail))
> >> >                             call_rcu_wait(crdp);
> >> > -                   poll(NULL, 0, 10);
> >> > +                   else
> >> > +                           poll(NULL, 0, 10);
> >> >             }
> >> >     }
> >> >     call_rcu_lock(&crdp->mtx);
> >> >
> >>
> >> --
> >> Mathieu Desnoyers
> >> Operating System Efficiency R&D Consultant
> >> EfficiOS Inc.
> >> http://www.efficios.com
> >
> > _______________________________________________
> > rp mailing list
> > rp at svcs.cs.pdx.edu
> > http://svcs.cs.pdx.edu/mailman/listinfo/rp
> >