[ltt-dev] [PATCH] urcu-qsbr: avoid useless futex wakeups and burning CPU for long grace periods

Fri Aug 12 14:08:19 EDT 2011

* Paolo Bonzini (pbonzini at redhat.com) wrote:
> On 08/09/2011 11:12 PM, Mathieu Desnoyers wrote:
>> The reader will see the index->waiting set to 1, reset it to 0, and
>> observe a gp_futex that is not -1 (yet, due to reordering of memory
>> accesses), thus return immediately.
>
> You're right.  I set gp_futex before index->waiting exactly for this  
> reason, but I was confused by the uatomic name in uatomic_set and didn't  
> add the memory barrier.
>
>> Please note that I am extremely careful when doing changes to the wakeup
>> code, and I went as far as to create a CPU-level promela model to make
>> sure the wakeups would never be missed in any allowed memory
>> access/instruction execution orders.
>>
>> So adding complexity to this code path for the sake of accelerating the
>> speed of synchronize_rcu(), which is already considered as a slow path,
>> does not appear as an immediate win.
>
> It is not accelerating synchronize_rcu().  It does two things:
>
> 1) by using futexes, it avoids burning CPU when a grace period is long.  
> It is actually effective even if the grace period is _not_ so long: 100 
> walks through the thread list take less than a millisecond, and you do 
> not want to call rcu_quiescent_state() that often;
>
> 2) once you're always using futexes, if you have frequent quiescent  
> states in one thread and more rare quiescent states in another, the  
> former thread will uselessly call FUTEX_WAKE on each quiescent state,  
> even though it is already in the next grace period.

OK, so this might benefit to URCU implementations other than qsbr too,
right ?

>
>> But I might be open to a simpler fix that just changes all the
>>
>> -                     if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
>>
>> to
>>
>> +                     if (wait_loops>= RCU_QS_ACTIVE_ATTEMPTS) {
>>
>> Because it does not alter the gp_futex and memory barriers semantics.
>> You could do that change to both urcu-qsbr.c and urcu.c. I would be
>> interested if this alone helps performance ?
>
> That doesn't work.  The first read-side thread that calls FUTEX_WAKE  
> will set gp_futex to zero, and you will never get a wakeup from a second  
> read-side thread.

I think I did not convey my idea fully:

this would take care of re-decrementing the gp_futex value for the first
wait attempt and the following ones:

                if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
                        uatomic_dec(&gp_futex);
                        /* Write futex before read reader_gp */
                        cmm_smp_mb();
                }
[...]

and this would be waiting for a wakeup:

                       if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
                                wait_gp();
                        } else {

But I agree that this does not handle readers with quite different
period length very well, because short-lived periods will trigger a lot
of useless futex wakeups.

>
>> If we choose to also alter the gp_futex behavior, we'll definitely have
>> to update the Promela model and make sure than can never lead to missed
>> wakeups.
>
> Makes sense.  Where are the models?

userspace rcu tree, formal-model branch

in futex-wakeup/

you'll notice that the model has error-injection checks, to find out if
it actually detects missed wakeups.

>  I attach a patch with the correct  
> memory barrier for completeness.
>
> Paolo

> From f3afba5236999ca0cfd4a71829ccdc4140676675 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini <pbonzini at redhat.com>
> Date: Tue, 9 Aug 2011 14:14:12 +0200
> Subject: [PATCH] urcu-qsbr: avoid useless futex wakeups and burning CPU for
>  long grace periods
> 
> I noticed that urcu makes exactly _one_ attempt at using futexes
> to avoid busy looping on synchronize_rcu.  The attached patch instead
> switches from busy waiting to futexes after RCU_QS_ACTIVE_ATTEMPTS.
> To limit the amount of system calls, reading threads remember whether
> they already had a quiescent state in this grace period; if so they were
> already removed from the list, and can avoid signaling the futex.
> 
> Performance measured with rcutorture (nreaders: 10, nupdaters: 1,
> duration: 10, median of nine runs):
> 
>      RCU_QS_ACTIVE_ATTEMPTS == 100, no patch         n_updates = 292
>      RCU_QS_ACTIVE_ATTEMPTS == 1, no patch           n_updates = 290
>      RCU_QS_ACTIVE_ATTEMPTS == 100, with patch       n_updates = 408
>      RCU_QS_ACTIVE_ATTEMPTS == 1, with patch         n_updates = 404
> 
> (the first two cases are obviously the same; the only change is
> when the futex is used, but over many calls there is no difference).
> ---
>  urcu-qsbr.c             |   17 ++++++++++++-----
>  urcu/static/urcu-qsbr.h |    7 ++++++-
>  2 files changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/urcu-qsbr.c b/urcu-qsbr.c
> index a239be0..8d8a9cf 100644
> --- a/urcu-qsbr.c
> +++ b/urcu-qsbr.c
> @@ -150,26 +150,33 @@ static void update_counter_and_wait(void)
>  	 */
>  	for (;;) {
>  		wait_loops++;
> -		if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> -			uatomic_dec(&gp_futex);
> +		if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
> +			uatomic_set(&gp_futex, -1);
> +			/*
> +			 * Write futex before write waiting (the other side
> +			 * reads them in the opposite order).
> +			 */
> +			cmm_smp_wmb();
> +			cds_list_for_each_entry(index, &registry, node) {
> +				_CMM_STORE_SHARED(index->waiting, 1);
> +			}
>  			/* Write futex before read reader_gp */
>  			cmm_smp_mb();
>  		}
> -
>  		cds_list_for_each_entry_safe(index, tmp, &registry, node) {
>  			if (!rcu_gp_ongoing(&index->ctr))
>  				cds_list_move(&index->node, &qsreaders);
>  		}
>  
>  		if (cds_list_empty(&registry)) {
> -			if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> +			if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
>  				/* Read reader_gp before write futex */
>  				cmm_smp_mb();
>  				uatomic_set(&gp_futex, 0);
>  			}
>  			break;
>  		} else {
> -			if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> +			if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
>  				wait_gp();
>  			} else {
>  #ifndef HAS_INCOHERENT_CACHES
> diff --git a/urcu/static/urcu-qsbr.h b/urcu/static/urcu-qsbr.h
> index 5b7adac..da2edf6 100644
> --- a/urcu/static/urcu-qsbr.h
> +++ b/urcu/static/urcu-qsbr.h
> @@ -124,6 +124,7 @@ struct rcu_reader {
>  	unsigned long ctr;
>  	/* Data used for registry */
>  	struct cds_list_head node __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
> +	int waiting;
>  	pthread_t tid;
>  };
>  
> @@ -136,7 +137,11 @@ extern int32_t gp_futex;
>   */
>  static inline void wake_up_gp(void)
>  {
> -	if (unlikely(uatomic_read(&gp_futex) == -1)) {
> +	if (unlikely(_CMM_LOAD_SHARED(rcu_reader.waiting))) {
> +		_CMM_STORE_SHARED(rcu_reader.waiting, 0);

Commenting this memory barrier would be helpful too.

Thanks,

Mathieu

> +		cmm_smp_mb();
> +		if (uatomic_read(&gp_futex) != -1)
> +			return;
>  		uatomic_set(&gp_futex, 0);
>  		futex_noasync(&gp_futex, FUTEX_WAKE, 1,
>  		      NULL, NULL, 0);
> -- 
> 1.7.6
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com