[lttng-dev] userspace rcu flavor improvements
Mathieu Desnoyers
mathieu.desnoyers at efficios.com
Mon Nov 19 12:05:19 EST 2012
* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Mon, Nov 19, 2012 at 03:52:18PM +0800, Lai Jiangshan wrote:
> > On 11/18/2012 12:16 AM, Mathieu Desnoyers wrote:
> > > Here are a couple of improvements for all userspace RCU flavors. Many
> > > thanks to Alan Stern for his suggestions.
> >
> > It makes urcu like SRCU. (sync_rcu = check zero + flip + check zero)
> > If I have time, I may port more SRCU code to urcu.
>
> I am sure that this is obvious to everyone, but I cannot help restating
> it. There is one important difference between user code and kernel code,
> though. In the kernel, we track by CPU, so one of SRCU's big jobs is
> to track multiple tasks using the same CPU. This opens the possibility
> of preemption, which is one of the things that complicates SRCU's design.
>
> In contrast, user-mode RCU tracks tasks without multiplexing. This
> allows simplifications that are similar to those that could be achieved
> in the kernel if we were willing to disable preemption across the entire
> SRCU read-side critical section.
>
> So although I am all for user-mode RCU taking advantage of any technology
> we have at hand, we do need to be careful to avoid needless complexity.
Very good point! Indeed, when considering modifications to URCU, I will
be considering all of those elements:
- Added complexity (verification cost),
+ Speedup,
+ Lower latency,
+ Better scalability,
+ Lower power consumption,
So yes, I'm all for improving URCU synchronisation, but I might be
reluctant to pull modifications that increase complexity significantly
without very significant benefits.
>
> > > Patch 8/8 is only done for qsbr so far, and proposed as RFC. I'd like to
> > > try and benchmark other approaches to concurrent grace periods too.
>
> The concurrent grace periods are the big win, in my opinion. ;-)
I've done some basic benchmarking on the approach taken by patch 8/8,
and it leads to very interesting scalability improvement and speedups,
e.g., on a 24-core AMD, with a write-heavy scenario (4 readers threads,
20 updater threads, each updater using synchronize_rcu()):
* Serialized grace periods :
./test_urcu_qsbr 4 20 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads 20251412728 nr_writes 1826331 nr_ops 20253239059
* Batched grace periods :
./test_urcu_qsbr 4 20 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads 15141994746 nr_writes 9382515 nr_ops 15151377261
For a 9382515/1826331 = 5.13 speedup
Of course, we can see that readers have slowed down, probably due to
increased update traffic, given there is no change to the read-side code
whatsoever.
Now let's see the penality of managing the stack for single-updater.
With 4 readers, single updater:
* Serialized grace periods :
./test_urcu_qsbr 4 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 19240784755 nr_writes 2130839 nr_ops 19242915594
* Batched grace periods :
./test_urcu_qsbr 4 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 19160162768 nr_writes 2253068 nr_ops 1916241583
2253068 vs 2137036 -> a couple of runs show that this difference is lost
in the noise for single updater.
So given that implementing a real "concurrent" approach for grace
periods would take a while and adds a lot of complexity, I am tempted to
merge the batching approach given it does not add complexity to the
synchronization algorithm, and already shows interesting speedup.
Moreover, we can easily remove batching if it appears not to be needed
in the future.
Thoughts ?
Thanks,
Mathieu
>
> Thanx, Paul
>
> > > Feedback is welcome,
> > >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > >
> >
>
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list