[lttng-dev] rculfstack bug

Lai Jiangshan laijs at cn.fujitsu.com
Wed Oct 10 21:31:01 EDT 2012


On 10/11/2012 03:50 AM, Paul E. McKenney wrote:
> On Wed, Oct 10, 2012 at 01:53:04PM -0400, Mathieu Desnoyers wrote:
>> * Mathieu Desnoyers (mathieu.desnoyers at efficios.com) wrote:
>>> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
>>>> On Wed, Oct 10, 2012 at 07:42:15AM -0400, Mathieu Desnoyers wrote:
>>>>> * Lai Jiangshan (laijs at cn.fujitsu.com) wrote:
>>>>>> test code:
>>>>>> ./tests/test_urcu_lfs 100 10 10
>>>>>>
>>>>>> bug produce rate > 60%
>>>>>>
>>>>>> {{{
>>>>>> I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or "./tests/test_urcu_lfs 100 100 10"
>>>>>> But I just test it about 5 times
>>>>>> }}}
>>>>>>
>>>>>> 4cores*1threads: Intel(R) Core(TM) i5 CPU         760
>>>>>> RCU_MB (no time to test for other rcu type)
>>>>>> test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
>>>>>>
>>>>>> I didn't see any bug when "./tests/test_urcu_mb 10 100 10"
>>>>>>
>>>>>> Sorry, I tried, but I failed to find out the root cause currently.
>>>>>
>>>>> I think I managed to narrow down the issue:
>>>>>
>>>>> 1) the master branch does not reproduce it, but commit
>>>>>    768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the
>>>>>    time.
>>>>>
>>>>> 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
>>>>>    current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
>>>>>    moving to wfcqueue.
>>>>>
>>>>> 3) the bug always arise, for me, at the end of the 10 seconds.
>>>>>    However, it might be simply due to the fact that most of the memory
>>>>>    get freed at the end of program execution.
>>>>>
>>>>> 4) I've been able to get a backtrace, and it looks like we have some
>>>>>    call_rcu callback-invokation threads still working while
>>>>>    call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free()
>>>>>    is nicely waiting for the next thread to stop, and during that time,
>>>>>    two callback-invokation threads are invoking callbacks (and one of
>>>>>    them triggers the segfault).
>>>>
>>>> Do any of the callbacks reference __thread variables from some other
>>>> thread?  If so, those threads must refrain from exiting until after
>>>> such callbacks complete.
>>>
>>> The callback is a simple caa_container_of + free, usual stuff, nothing
>>> fancy.
>>
>> Here is the fix: the bug was in call rcu. It is not required for master,
>> because we fixed it while moving to wfcqueue.
>>
>> We were erroneously writing to the head field of the default
>> call_rcu_data rather than tail.
> 
> Ouch!!!  I have no idea why that would have passed my testing.  :-(

It's one of the reasons that I rewrite wfqueue and introduce delete_all()
(Mathieu uses splice instead) to replace open code of wfqueue in urcu-call-rcu-impl.h.

> 
>> I wonder if we should simply do a new release with call_rcu using
>> wfcqueue and tell people to upgrade, or if we should somehow create a
>> stable branch with this fix.
>>
>> Thoughts ?
> 
> Under what conditions does this bug appear?  It is necessary to not just
> use call_rcu(), but also to explicitly call call_rcu_data_free(), right?
> 
> My guess is that a stable branch would be good -- there will be other
> bugs, after all.  :-/
> 
> 							Thanx, Paul
> 
>> Thanks,
>>
>> Mathieu
>>
>> ---
>> diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
>> index 13b24ff..b205229 100644
>> --- a/urcu-call-rcu-impl.h
>> +++ b/urcu-call-rcu-impl.h
>> @@ -647,8 +647,9 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
>>  		/* Create default call rcu data if need be */
>>  		(void) get_default_call_rcu_data();
>>  		cbs_endprev = (struct cds_wfq_node **)
>> -			uatomic_xchg(&default_call_rcu_data, cbs_tail);
>> -		*cbs_endprev = cbs;
>> +			uatomic_xchg(&default_call_rcu_data->cbs.tail,
>> +					cbs_tail);
>> +		_CMM_STORE_SHARED(*cbs_endprev, cbs);
>>  		uatomic_add(&default_call_rcu_data->qlen,
>>  			    uatomic_read(&crdp->qlen));
>>  		wake_call_rcu_thread(default_call_rcu_data);
>>
>>
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>>
>>>> 							Thanx, Paul
>>>>
>>>>> So I expect that commit 
>>>>>
>>>>> commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
>>>>> Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
>>>>> Date:   Tue Sep 25 10:50:49 2012 -0500
>>>>>
>>>>>     call_rcu: use wfcqueue, eliminate false-sharing
>>>>>     
>>>>>     Eliminate false-sharing between call_rcu (enqueuer) and worker threads
>>>>>     on the queue head and tail.
>>>>>     

I think the changelog of this commit is too short.

Thanks,
Lai



More information about the lttng-dev mailing list