[lttng-dev] [PATCH] urcu-mb/signal/membarrier: batch concurrent synchronize_rcu()

Sun Nov 25 22:47:06 EST 2012

I'm also interesting in the result of:
./test_urcu 12 12 20
./test_urcu 16 8 20
./test_urcu 20 4 20


On 11/26/2012 11:22 AM, Mathieu Desnoyers wrote:
> Here are benchmarks on batching of synchronize_rcu(), and it leads to
> very interesting scalability improvement and speedups, e.g., on a
> 24-core AMD, with a write-heavy scenario (4 readers threads, 20 updater
> threads, each updater using synchronize_rcu()):
> 
> * Serialized grace periods:
> ./test_urcu 4 20 20
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers  20 wdelay      0
> nr_reads    714598368 nr_writes      5032889 nr_ops    719631257
> 
> * Batched grace periods:
> 
> ./test_urcu 4 20 20
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers  20 wdelay      0
> nr_reads    611848168 nr_writes      9877965 nr_ops    621726133
> 
> For a 9877965/5032889 = 1.96 speedup for 20 updaters.
> 
> Of course, we can see that readers have slowed down, probably due to
> increased update traffic, given there is no change to the read-side code
> whatsoever.
> 
> Now let's see the penality of managing the stack for single-updater.
> With 4 readers, single updater:
> 
> * Serialized grace periods :
> 
> ./test_urcu 4 1 20
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers   1 wdelay      0
> nr_reads    241959144 nr_writes     11146189 nr_ops    253105333
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers   1 wdelay      0
> nr_reads    257131080 nr_writes     12310537 nr_ops    269441617
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers   1 wdelay      0
> nr_reads    259973359 nr_writes     12203025 nr_ops    272176384
> 
> * Batched grace periods :
> 
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers   1 wdelay      0
> nr_reads    298926555 nr_writes     14018748 nr_ops    312945303
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers   1 wdelay      0
> nr_reads    272411290 nr_writes     12832166 nr_ops    285243456
> SUMMARY ./test_urcu               testdur   20 nr_readers   4
> rdur       0 wdur      0 nr_writers   1 wdelay      0
> nr_reads    267511858 nr_writes     12822026 nr_ops    280333884
> 
> Serialized vs batched seems to similar, batched possibly even slightly
> faster, but this is probably caused by NUMA affinity.
> 
> CC: Paul E. McKenney <paulmck at linux.vnet.ibm.com>
> CC: Lai Jiangshan <laijs at cn.fujitsu.com>
> CC: Alan Stern <stern at rowland.harvard.edu>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> ---
> diff --git a/urcu.c b/urcu.c
> index e6ff0f3..836bad9 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -43,6 +43,7 @@
>  #include "urcu/tls-compat.h"
>  
>  #include "urcu-die.h"
> +#include "urcu-wait.h"
>  
>  /* Do not #define _LGPL_SOURCE to ensure we can emit the wrapper symbols */
>  #undef _LGPL_SOURCE
> @@ -106,6 +107,12 @@ DEFINE_URCU_TLS(unsigned int, rcu_rand_yield);
>  
>  static CDS_LIST_HEAD(registry);
>  
> +/*
> + * Queue keeping threads awaiting to wait for a grace period. Contains
> + * struct gp_waiters_thread objects.
> + */
> +static DEFINE_URCU_WAIT_QUEUE(gp_waiters);
> +
>  static void mutex_lock(pthread_mutex_t *mutex)
>  {
>  	int ret;
> @@ -306,9 +313,31 @@ void synchronize_rcu(void)
>  {
>  	CDS_LIST_HEAD(cur_snap_readers);
>  	CDS_LIST_HEAD(qsreaders);
> +	DEFINE_URCU_WAIT_NODE(wait, URCU_WAIT_WAITING);
> +	struct urcu_waiters waiters;
> +
> +	/*
> +	 * Add ourself to gp_waiters queue of threads awaiting to wait
> +	 * for a grace period. Proceed to perform the grace period only
> +	 * if we are the first thread added into the queue.
> +	 */
> +	if (urcu_wait_add(&gp_waiters, &wait) != 0) {
> +		/* Not first in queue: will be awakened by another thread. */
> +		urcu_adaptative_busy_wait(&wait);
> +		/* Order following memory accesses after grace period. */
> +		cmm_smp_mb();
> +		return;
> +	}
> +	/* We won't need to wake ourself up */
> +	urcu_wait_set_state(&wait, URCU_WAIT_RUNNING);
>  
>  	mutex_lock(&rcu_gp_lock);
>  
> +	/*
> +	 * Move all waiters into our local queue.
> +	 */
> +	urcu_move_waiters(&waiters, &gp_waiters);
> +
>  	if (cds_list_empty(&registry))
>  		goto out;
>  
> @@ -374,6 +403,13 @@ void synchronize_rcu(void)
>  	smp_mb_master(RCU_MB_GROUP);
>  out:
>  	mutex_unlock(&rcu_gp_lock);
> +
> +	/*
> +	 * Wakeup waiters only after we have completed the grace period
> +	 * and have ensured the memory barriers at the end of the grace
> +	 * period have been issued.
> +	 */
> +	urcu_wake_all_waiters(&waiters);
>  }
>  
>  /*
>