[lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

Thu Oct 23 21:45:59 EDT 2014

On 10/23/2014 09:06 PM, Mathieu Desnoyers wrote:
> ----- Original Message -----
>> From: "Lai Jiangshan" <laijs at cn.fujitsu.com>
>> To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
>> Cc: "Ben Maurer" <bmaurer at fb.com>, "lttng-dev" <lttng-dev at lists.lttng.org>, "Yannick Brosseau" <scientist at fb.com>,
>> "Paul E. McKenney" <paulmck at linux.vnet.ibm.com>, "Stephen Hemminger" <stephen at networkplumber.org>
>> Sent: Thursday, October 23, 2014 12:26:58 AM
>> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
>>
>> Hi, Mathieu
> 
> Hi Lai,
> 
>>
>> In the code, how this problem be solved?
>> "All the tasks may continue stealing works without any progress."
> 
> Good point. It is not. I see two possible solutions for this:
> 
> 1) We break the circular chain of siblings at the list head.
>    It has the downside that work could accumulate at the first
>    worker of the list, or
> 2) Whenever we grab work (either from the global queue or from
>    a sibling), we could dequeue one element and keep it to ourself,
>    out of reach of siblings. This would ensure progress, since
>    worker stealing work would always keep one work element to itself,
>    which would therefore ensure progress.
> 
> I personally prefer (2), since it has fewer downsides.
> 
>>
>> Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?
> 
> This lock is to protect operations on the sibling workqueue.
> Here are the various actors touching it:
> 
> - The worker attempting to steal from the sibling (with splice-from),
> - The sibling itself, in urcu_dequeue_work(), with a dequeue
>   operation,
> 
> Looking at the wfcq synchronization table:
> 
>  * Synchronization table:
>  *
>  * External synchronization techniques described in the API below is
>  * required between pairs marked with "X". No external synchronization
>  * required between pairs marked with "-".
>  *
>  * Legend:
>  * [1] cds_wfcq_enqueue
>  * [2] __cds_wfcq_splice (destination queue)
>  * [3] __cds_wfcq_dequeue
>  * [4] __cds_wfcq_splice (source queue)
>  * [5] __cds_wfcq_first
>  * [6] __cds_wfcq_next
>  *
>  *     [1] [2] [3] [4] [5] [6]
>  * [1]  -   -   -   -   -   -
>  * [2]  -   -   -   -   -   -
>  * [3]  -   -   X   X   X   X
>  * [4]  -   -   X   -   X   X
>  * [5]  -   -   X   X   -   -
>  * [6]  -   -   X   X   -   -
> 
> We need a mutex between splice-from and dequeue since they
> are executed by different threads. I can replace this with
> a call to cds_wfcq_splice_blocking() instead, which takes the
> lock internally. That would be neater.
> 
>>
>> It is not a good idea for a worker to call synchronize_rcu() (in
>> urcu_accept_work()).
>> A new work may be coming soon.
> 
> The reason why the worker calls synchronize_rcu() there is to protect
> the waitqueue lfstack "pop" operation from ABA. It matches the RCU
> read-side critical sections encompassing urcu_dequeue_wake_single().
> 
> Another way to achieve this would be to grab a mutex across calls to
> urcu_dequeue_wake_single(), but unfortunately this call needs to be
> done by the dispatcher thread, and I'm trying to keep a low-latency
> bias in favor of the dispatcher here. In this case, it is done at the
> expense of overhead in the worker when the worker is about to busy-wait
> and eventually block, so I thought the tradeoff was not so bad.

"single worker pattern" does be a frequent use-case.
you can image the lag and jitter in this case.

How about a mutex&rcu protected idle-workers?

idle-worker is wokenup via rcu-protected scheme.
and dequeued&requeued (in urcu_accept_work() or urcu_worker_unregister()) via mutex-protected scheme.

> 
> Thoughts ?
> 
> Thanks for the feedback!
> 
> Mathieu
> 
>>
>> thanks,
>> Lai
>>
>