[lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

Thu Oct 23 09:06:04 EDT 2014

----- Original Message -----
> From: "Lai Jiangshan" <laijs at cn.fujitsu.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> Cc: "Ben Maurer" <bmaurer at fb.com>, "lttng-dev" <lttng-dev at lists.lttng.org>, "Yannick Brosseau" <scientist at fb.com>,
> "Paul E. McKenney" <paulmck at linux.vnet.ibm.com>, "Stephen Hemminger" <stephen at networkplumber.org>
> Sent: Thursday, October 23, 2014 12:26:58 AM
> Subject: Re: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing
> 
> Hi, Mathieu

Hi Lai,

> 
> In the code, how this problem be solved?
> "All the tasks may continue stealing works without any progress."

Good point. It is not. I see two possible solutions for this:

1) We break the circular chain of siblings at the list head.
   It has the downside that work could accumulate at the first
   worker of the list, or
2) Whenever we grab work (either from the global queue or from
   a sibling), we could dequeue one element and keep it to ourself,
   out of reach of siblings. This would ensure progress, since
   worker stealing work would always keep one work element to itself,
   which would therefore ensure progress.

I personally prefer (2), since it has fewer downsides.

> 
> Why ___urcu_steal_work() needs cds_wfcq_dequeue_lock()?

This lock is to protect operations on the sibling workqueue.
Here are the various actors touching it:

- The worker attempting to steal from the sibling (with splice-from),
- The sibling itself, in urcu_dequeue_work(), with a dequeue
  operation,

Looking at the wfcq synchronization table:

 * Synchronization table:
 *
 * External synchronization techniques described in the API below is
 * required between pairs marked with "X". No external synchronization
 * required between pairs marked with "-".
 *
 * Legend:
 * [1] cds_wfcq_enqueue
 * [2] __cds_wfcq_splice (destination queue)
 * [3] __cds_wfcq_dequeue
 * [4] __cds_wfcq_splice (source queue)
 * [5] __cds_wfcq_first
 * [6] __cds_wfcq_next
 *
 *     [1] [2] [3] [4] [5] [6]
 * [1]  -   -   -   -   -   -
 * [2]  -   -   -   -   -   -
 * [3]  -   -   X   X   X   X
 * [4]  -   -   X   -   X   X
 * [5]  -   -   X   X   -   -
 * [6]  -   -   X   X   -   -

We need a mutex between splice-from and dequeue since they
are executed by different threads. I can replace this with
a call to cds_wfcq_splice_blocking() instead, which takes the
lock internally. That would be neater.

> 
> It is not a good idea for a worker to call synchronize_rcu() (in
> urcu_accept_work()).
> A new work may be coming soon.

The reason why the worker calls synchronize_rcu() there is to protect
the waitqueue lfstack "pop" operation from ABA. It matches the RCU
read-side critical sections encompassing urcu_dequeue_wake_single().

Another way to achieve this would be to grab a mutex across calls to
urcu_dequeue_wake_single(), but unfortunately this call needs to be
done by the dispatcher thread, and I'm trying to keep a low-latency
bias in favor of the dispatcher here. In this case, it is done at the
expense of overhead in the worker when the worker is about to busy-wait
and eventually block, so I thought the tradeoff was not so bad.

Thoughts ?

Thanks for the feedback!

Mathieu

> 
> thanks,
> Lai
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com