[lttng-dev] Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash

Thu Jun 1 13:01:21 UTC 2017

----- On Oct 21, 2016, at 4:19 AM, Evgeniy Ivanov <i at eivanov.com> wrote: 

> On Wed, Oct 19, 2016 at 6:03 PM, Mathieu Desnoyers < [
> mailto:mathieu.desnoyers at efficios.com | mathieu.desnoyers at efficios.com ] >
> wrote:

>> This is because we use call_rcu internally to trigger the hash table
>> resize.

>> In cds_lfht_destroy, we start by waiting for "in-flight" resize to complete.
>> Unfortunately, this requires that call_rcu worker thread progresses. If
>> cds_lfht_destroy is called from the call_rcu worker thread, it will wait
>> forever.

>> One alternative would be to implement our own worker thread scheme
>> for the rcu HT resize rather than use the call_rcu worker thread. This
>> would simplify cds_lfht_destroy requirements a lot.

>> Ideally I'd like to re-use all the call_rcu work dispatch/worker handling
>> scheme, just as a separate work queue.

>> Thoughts ?

> Thank you for explaining. Sounds like a plan: in our prod there is no issue with
> having extra thread for table resizes. And nested tables is important feature.

I finally managed to find some time to implement a solution, feedback 
would be welcome! 

Here are the RFC patches: 

https://lists.lttng.org/pipermail/lttng-dev/2017-May/027183.html 
https://lists.lttng.org/pipermail/lttng-dev/2017-May/027184.html 

Thanks, 

Mathieu 

>> Thanks,

>> Mathieu

>> ----- On Oct 19, 2016, at 6:03 AM, Evgeniy Ivanov < [ mailto:i at eivanov.com |
>> i at eivanov.com ] > wrote:

>>> Sorry, found partial answer in docs which state that cds_lfht_destroy should not
>>> be called from a call_rcu thread context. Why does this limitation exists?

>>> On Wed, Oct 19, 2016 at 12:56 PM, Evgeniy Ivanov < [ mailto:i at eivanov.com |
>>> i at eivanov.com ] > wrote:

>>>> Hi,

>>>> Each node of top level rculfhash has nested rculfhash. Some thread clears the
>>>> top level map and then uses rcu_barrier() to wait until everything is destroyed
>>>> (it is done to check leaks). Recently it started to dead lock sometimes with
>>>> following stacks:

>>>> Thread1:

>>>> __poll
>>>> cds_lfht_destroy <---- nested map
>>>> ...
>>>> free_Node(rcu_head*) <----- node of top level map
>>>> call_rcu_thread

>>>> Thread2:

>>>> syscall
>>>> rcu_barrier_qsbr
>>>> destroy_all
>>>> main

>>>> Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
>>>> internal deadlock because of nested maps?

>>>> --
>>>> Cheers,
>>>> Evgeniy

>>> --
>>> Cheers,
>>> Evgeniy

>>> _______________________________________________
>>> lttng-dev mailing list
>>> [ mailto:lttng-dev at lists.lttng.org | lttng-dev at lists.lttng.org ]
>>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> [ http://www.efficios.com/ | http://www.efficios.com ]

>> _______________________________________________
>> lttng-dev mailing list
>> [ mailto:lttng-dev at lists.lttng.org | lttng-dev at lists.lttng.org ]
>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

> --
> Cheers,
> Evgeniy

> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20170601/19b19d7c/attachment.html>