[lttng-dev] Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

Sun Oct 19 15:15:28 EDT 2014

Hey,

That's pretty cool! Do you have any thoughts on the LifoSem implementation in folly? What are the differences in design decisions between your workqueue and lifosem?

One other folly abstraction you might want to look at is MPMCQueue -- this is our fifo workqueue that we combine with lifosem: https://github.com/facebook/folly/blob/master/folly/MPMCQueue.h.

One thing I'd point out about lifo waiting that really applies to rcu is the ability to quiesce threads. When we use futexes to implement data structures like LifoSem, we use an abstraction called MemoryIdler (https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h). When MemoryIdler calls futex() it passes a maximum timeout of 5 seconds. If it takes > 5 seconds to wake up the thread, MemoryIdler releases some of the local resources of that thread. For example, it calls madvise() to release unused portions of the thread's stack and tells jemalloc to free unused per-thread caches. Because we use lifo waiting, the oldest waiting threads tend to sleep for a long time. 

For RCU, I'm guessing that because you require thread registration parts of the library iterate through all active threads. An abstraction like MemoryIdler could allow threads that sleep for a long time to unregister themselves reducing the cost of iteration.

One question I've been asking myself: is there an API the kernel could be providing us to do a better job here? One of the things that annoys me about LifoSem is that we don't need exactly LIFO behavior. All we're trying to do is avoid having threads go to sleep if they could spin a bit and acquire new work. We also want inactive threads to sleep for a long time. But among threads that have been running recently, I wish we could give the kernel more freedom in choosing. For example if the most recently run thread in the LIFO stack ran on CPU 12 but a task is currently running on CPU 12, it'd be great if the kernel could choose a different thread -- one that had recently run on a CPU where it can be immediately scheduled.

Something to think about here: most applications I've seen don't really stress the performance of this queue. The bulk of the performance win we've gotten from LifoSem has been in the application itself -- LIFO wait queues increase application performance through better cache usage. I believe this is where we'd get further wins -- eg, if the kernel could do a better job of selecting which thread to use.

-b
________________________________________
From: Mathieu Desnoyers [mathieu.desnoyers at efficios.com]
Sent: Saturday, October 18, 2014 2:13 PM
To: Ben Maurer; lttng-dev
Cc: Yannick Brosseau; Paul E. McKenney; Lai Jiangshan; Stephen Hemminger
Subject: Userspace RCU: workqueue with batching, cheap wakeup, and work stealing

Hi,

I have written an implementation of a workqueue based
on wfcqueue (fifo workqueue), wfstack (lifo wait queue)
including a work-stealing scheme. I think this could
scale very naturally to large workloads, and it skips
the overhead of sys_futex when wakees are already running.
Since the wakeup scheme is very lightweight when work
is ongoing, we use it extensively.

It also makes extensive use of wfcqueue "splice" operation
to move work batch in bulk amongst queue and worker
threads. It is a O(1) operation which does not need to
touch the content of the queue.

RCU is used to lookup the sibling worker threads for
work stealing and to wake the siblings of a thread
busy handling work.

It's at:

git://git.urcu.so/urcu.git
branch: urcu/workqueue-wakeup

or gitweb of the top commit:
https://urldefense.proofpoint.com/v1/url?u=http://git.lttng.org/?p%3Duserspace-rcu.git%3Ba%3Dcommit%3Bh%3D2c6a5414ed2ab44be24211d15a5d9f1f40dbfc74&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=ykFqYBj5kD4j7jlBP4F60A%3D%3D%0A&m=Lj6FdYEksb4oceU%2BtcZSfPjzIApytWcuNFbjvS2oDHc%3D%0A&s=f6323913816c4c2fae2116d0db4689082678cb921efaa7962e3bdede01907454

Note that this is work in progress, compile-tested only.
I also need to write extensive tests.

Feedback is welcome,

Thanks!

Mathieu