[lttng-dev] rculfstack bug
Mathieu Desnoyers
mathieu.desnoyers at efficios.com
Thu Oct 11 11:27:08 EDT 2012
* Lai Jiangshan (laijs at cn.fujitsu.com) wrote:
> On 10/11/2012 03:50 AM, Paul E. McKenney wrote:
> > On Wed, Oct 10, 2012 at 01:53:04PM -0400, Mathieu Desnoyers wrote:
> >> * Mathieu Desnoyers (mathieu.desnoyers at efficios.com) wrote:
> >>> * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> >>>> On Wed, Oct 10, 2012 at 07:42:15AM -0400, Mathieu Desnoyers wrote:
> >>>>> * Lai Jiangshan (laijs at cn.fujitsu.com) wrote:
> >>>>>> test code:
> >>>>>> ./tests/test_urcu_lfs 100 10 10
> >>>>>>
> >>>>>> bug produce rate > 60%
> >>>>>>
> >>>>>> {{{
> >>>>>> I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or "./tests/test_urcu_lfs 100 100 10"
> >>>>>> But I just test it about 5 times
> >>>>>> }}}
> >>>>>>
> >>>>>> 4cores*1threads: Intel(R) Core(TM) i5 CPU 760
> >>>>>> RCU_MB (no time to test for other rcu type)
> >>>>>> test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
> >>>>>>
> >>>>>> I didn't see any bug when "./tests/test_urcu_mb 10 100 10"
> >>>>>>
> >>>>>> Sorry, I tried, but I failed to find out the root cause currently.
> >>>>>
> >>>>> I think I managed to narrow down the issue:
> >>>>>
> >>>>> 1) the master branch does not reproduce it, but commit
> >>>>> 768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the
> >>>>> time.
> >>>>>
> >>>>> 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
> >>>>> current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
> >>>>> moving to wfcqueue.
> >>>>>
> >>>>> 3) the bug always arise, for me, at the end of the 10 seconds.
> >>>>> However, it might be simply due to the fact that most of the memory
> >>>>> get freed at the end of program execution.
> >>>>>
> >>>>> 4) I've been able to get a backtrace, and it looks like we have some
> >>>>> call_rcu callback-invokation threads still working while
> >>>>> call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free()
> >>>>> is nicely waiting for the next thread to stop, and during that time,
> >>>>> two callback-invokation threads are invoking callbacks (and one of
> >>>>> them triggers the segfault).
> >>>>
> >>>> Do any of the callbacks reference __thread variables from some other
> >>>> thread? If so, those threads must refrain from exiting until after
> >>>> such callbacks complete.
> >>>
> >>> The callback is a simple caa_container_of + free, usual stuff, nothing
> >>> fancy.
> >>
> >> Here is the fix: the bug was in call rcu. It is not required for master,
> >> because we fixed it while moving to wfcqueue.
> >>
> >> We were erroneously writing to the head field of the default
> >> call_rcu_data rather than tail.
> >
> > Ouch!!! I have no idea why that would have passed my testing. :-(
>
> It's one of the reasons that I rewrite wfqueue and introduce delete_all()
> (Mathieu uses splice instead) to replace open code of wfqueue in
> urcu-call-rcu-impl.h.
Yes, you did an excellent call on this one. We had the bug fixed in the
master branch before it was discovered, which is always nice :)
>
> >
> >> I wonder if we should simply do a new release with call_rcu using
> >> wfcqueue and tell people to upgrade, or if we should somehow create a
> >> stable branch with this fix.
> >>
> >> Thoughts ?
> >
> > Under what conditions does this bug appear? It is necessary to not just
> > use call_rcu(), but also to explicitly call call_rcu_data_free(), right?
> >
> > My guess is that a stable branch would be good -- there will be other
> > bugs, after all. :-/
> >
> > Thanx, Paul
> >
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> ---
> >> diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
> >> index 13b24ff..b205229 100644
> >> --- a/urcu-call-rcu-impl.h
> >> +++ b/urcu-call-rcu-impl.h
> >> @@ -647,8 +647,9 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
> >> /* Create default call rcu data if need be */
> >> (void) get_default_call_rcu_data();
> >> cbs_endprev = (struct cds_wfq_node **)
> >> - uatomic_xchg(&default_call_rcu_data, cbs_tail);
> >> - *cbs_endprev = cbs;
> >> + uatomic_xchg(&default_call_rcu_data->cbs.tail,
> >> + cbs_tail);
> >> + _CMM_STORE_SHARED(*cbs_endprev, cbs);
> >> uatomic_add(&default_call_rcu_data->qlen,
> >> uatomic_read(&crdp->qlen));
> >> wake_call_rcu_thread(default_call_rcu_data);
> >>
> >>
> >>>
> >>> Thanks,
> >>>
> >>> Mathieu
> >>>
> >>>>
> >>>> Thanx, Paul
> >>>>
> >>>>> So I expect that commit
> >>>>>
> >>>>> commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
> >>>>> Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> >>>>> Date: Tue Sep 25 10:50:49 2012 -0500
> >>>>>
> >>>>> call_rcu: use wfcqueue, eliminate false-sharing
> >>>>>
> >>>>> Eliminate false-sharing between call_rcu (enqueuer) and worker threads
> >>>>> on the queue head and tail.
> >>>>>
>
> I think the changelog of this commit is too short.
Yes, unfortunately, it's already in the master branch, and rewriting
history is something mere mortals are not allowed to do. ;-)
There are a couple of patches that I have pending that did not receive
any negative feedback at this point. I will move things forwards
cautiously to ensure we have a consistent master branch, and create
stable branches for 0.6 and 0.7. Review of these commits will be welcome
before we do a 0.8 release (with the new wfcqueue). We'll be able to
proceed to further changes as separate commits.
Thanks!
Mathieu
>
> Thanks,
> Lai
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list