[lttng-dev] [RFC URCU PATCH] wfqueue: ABI v1, simplify implementation, 2.3x to 2.6x performance boost

Mon Aug 20 23:41:37 EDT 2012

On 08/20/2012 09:16 PM, Mathieu Desnoyers wrote:
> * Lai Jiangshan (laijs at cn.fujitsu.com) wrote:
>> On 08/16/2012 05:31 AM, Mathieu Desnoyers wrote:
>>> This work is derived from the patch from Lai Jiangshan submitted as
>>> "urcu: new wfqueue implementation"
>>> (http://lists.lttng.org/pipermail/lttng-dev/2012-August/018379.html)
>>>
>>
> 
> Hi Lai,
> 
>> Hi, Mathieu
>>
>> please add this part to your patch.
>>
>> Changes(item5 has alternative):
>>
>> 1) Reorder the head and the tail in struct cds_wfq_queue
>>    Reason: wfq is enqueue-preference
> 
> Is this a technical improvement over the original order ? Both cache
> lines are aligned, so it should not matter which one comes first. So I'm
> not sure why we should favor one order vs the other.
> 
>> 2) Reorder the code of ___cds_wfq_next_blocking()
>>    Reason: short code, better readability
> 
> Yes, I agree with this change. Can you extract it into a separate patch ?
> 
>>
>> 3) Add cds_wfq_dequeue_[un]lock
>>    Reason: the fields of struct cds_wfq_queue are not exposed to users of lib,
>> 	but some APIs needs to have this lock held. Add this locking function
>> 	to allow the users use such APIs
> 
> I agree with this change too. You can put it into the same patch as (2).
> 
>>
>> 4) Add cds_wfq_node_sync_next(), cds_wfq_dequeue_all_blocking()
>>    Reason: helper for for_each_*
> 
> Points 4-5-6-7 will need more discussion.
> 
> I don't like exposing "cds_wfq_node_sync_next()", which should remain an
> internal implementation detail not exposed to queue users.  The
> "get_first"/"get_next" API, as well as the dequeue, are providing an
> abstraction that hide the node synchronization, which makes it harder
> for the user to make mistakes. Given that the fast-path of
> cds_wfq_node_sync_next is just a pointer load and test, I don't think
> that skipping this the second time we iterate on the same queue would
> bring any significant performance improvement, but exposing sync_next
> adds in API complexity, which I would like to avoid.

If the name "cds_wfq_node_sync_next()" is bad and exposes the implementation,
we can rename it to "__cds_wfq_next_blocking_nocheck()" to hide the implementation.

"get_next_nocheck": get next node of a queue, the caller ensure that the next node is existed.
"get_next": get next node of a queue, it also finds out that whether the next node is existed or not.

Or just remove "cds_wfq_node_sync_next()", use "get_next" instead.

> 
> About cds_wfq_dequeue_all_blocking(): the "cds_wfq_splice()" I
> introduced performs the same action as "cds_wfq_dequeue_all_blocking()",
> but ensures that we can splice lists multiple times (rather than only
> once), which is a very neat side-effect. Let's imagine a use-case like
> Tree RCU, where each CPU produce a queue of call_rcu callbacks, and we
> need to merge pairs of queues together recursively until we reach the
> tree root: by doing splice() at each level of the tree, we can combine
> the queues into a single queue without ever needing to do an iteration
> on each node: no node synchronization is needed until we actually need
> to traverse the queue at the root node level.
> 
> The cost of exposing dequeue_all is that we need to expose sync_next
> (more complexity for the user). As you exposed in your earlier email,
> the gain we get by exposing dequeue_all separately from splice is that
> dequeue_all is 1 xchg() operation, while splice is 2 xchg() operations.
> However, given that this operation is typically amortized over many
> nodes makes performance optimization a less compelling argument than
> simplicity of use.

Reducing the temporary big struct is also my tempt.

---
Related to "internal", see the bottom.

> 
>>
>> 5) Add more for_each_*, which default semantic is dequeuing
>>    Reason: __cds_wfq_for_each_blocking_undequeue is enough for undequeuing loops
>> 	don't need to add more.
>> 	looping and dequeuing are the most probable use-cases.
>> 	dequeuing for_each_* make the queue like a pipe. make it act as a channel.(GO language)
>>    Alternative: rename these dequeuing __cds_wfq_for_each*() to __cds_wfq_dequeue_for_each*
> 
> I notice, in the iterators you added, the presence of
> "cds_wfq_for_each_blocking_nobreak", and
> "cds_wfq_for_each_blocking_nobreak_internal", which document that this
> special flavor is needed for loops that do not have a "break" statement.
> That seems to be very much error-prone and counter-intuitive for users,
> and I would like to avoid that.

__cds_wfq_for_each_blocking_safe(remove-all) is also "no-break"(*), is it error-prone and ...?

So this is not a good reason to reject cds_wfq_for_each_blocking_nobreak(), if the user
don't know how to use it, he should use cds_wfq_for_each_blocking() (the version in my
patch which is dequeue()-based and we can break from the loop body)

__cds_wfq_for_each_blocking_safe() and cds_wfq_for_each_blocking_nobreak() are exactly the
same semantic in "remove-all" and "no-break". but cds_wfq_for_each_blocking_nobreak()
does NOT CORRUPT the queue.

So cds_wfq_for_each_blocking_nobreak() is yet another __cds_wfq_for_each_blocking_safe()
and it don't corrupt the queue, it is really safer than __cds_wfq_for_each_blocking_safe().

""""footnote: (*)
__cds_wfq_for_each_blocking_safe() should be used for "remove-none" or "remove-all"
We should use __cds_wfq_for_each_blocking_undequeue() instead of __cds_wfq_for_each_blocking_safe(remove-none)
"""

> 
> The "internal" flavor, along with "sync_next" are then used in the
> call_rcu code, since they are needed to get the best performance
> possible. I don't like to use special, internal APIs for call_rcu: the
> original motivation for refactoring the wfqueue code was testability,
> and I would really like to make sure that the APIs we use internally are
> also exposed to the outside world, mainly for testability, and also
> because if we need to directly access internal interfaces to have good
> performance, it means we did a poor job at exporting APIs that allow to
> do the same kind of highly-efficient use of the queue. Exposing an API
> with the "internal" keyword in it clearly discourages users from using
> it.

Related to "internal" see the bottom.

> 
> 
>>
>> 6) Remove __cds_wfq_for_each_blocking_safe
>>    Reason: its semantic is not clear, hard to define a well-defined semantic to it.
>> 	when we use it, we have to delete none or delete all nodes, even when we delete all nodes,
>> 	struct cds_wfq_queue still becomes stale while looping or after loop.
> 
> The typical use-case I envision is:
> 
> - enqueue many nodes to queue A
> 
> - then, a dequeuer thread splice the content of queue A into its temporary
>   queue T (which head is local on its stack).
> 
> - now, the thread is free to iterate on queue T, as many times as it
>   likes, and it does not need to change the structure of the queue while
>   iterating on it (read-only operation, without side-effect). The last
>   time it iterates on queue T, it needs to use the _safe iteration and
>   free the nodes.
> 
> - after the nodes have been deleted, the queue T can be simply discarded
>   (if on stack, function return will do so; if dynamically allocated,
>   can be simply reinitialized to an empty queue, or freed).

My for_each_*() can do exactly the same thing in this use-case since we still
have splice() and cds_wfq_for_each_blocking_nobreak()(safer **_safe()). Doesn't it?

The memory represented a struct queue is big.(two cache line), something we need to
avoid to use it in the stack.(shorten the stack size or avoid to cost 2 or 3 additional cache line).

-----------a use case
All these local field are hot in the stack(and for some reason, it is not in the registers)
	int count;
	struct cds_wfq_queue tmp;
	struct cds_wfq_node *node, *next;

It will use 4 cache line. Should we always recommend the users use splice()+tmp?

> 
> As you certainly noticed, we can iterate both on a queue being
> concurrently enqueued to, and on a queue that has been spliced into.
> Being able to use iterators and dequeue on any queue (initial enqueue,
> or queue spliced to) is a feature: we can therefore postpone the
> "sync_next" synchronization to the moment where loop iteration is really
> needed (lazy synchronization), which keeps optimal locality of reference
> of synchronization vs node data access.
> 
>>
>> 7) Rename old __cds_wfq_for_each_blocking() to __cds_wfq_for_each_blocking_undequeue()
>>    Reason: the default semantic of the other for_each_* is dequeuing.
> 
> I'm not convinced that "__cds_wfq_for_each_blocking_nobreak",
> "__cds_wfq_for_each_blocking_nobreak_internal", and sync_next APIs are
> improvements over the API I propose. For call_rcu, this turns
> 
> cds_wfq_enqueue()
> cds_wfq_splice_blocking()
> __cds_wfq_for_each_blocking_safe() for iteration
> 
> into:
> 
> cds_wfq_enqueue()
> __cds_wfq_dequeue_all_blocking()
> __cds_wfq_for_each_blocking_nobreak_internal() for iteration
> 
> Maybe we could do some documentation improvement to
> "__cds_wfq_for_each_blocking(_safe)", __cds_wfq_first_blocking(),
> __cds_wfq_next_blocking() API members I proposed. Maybe I did not
> document them well enough ?
> 

It does be not enough for __cds_wfq_for_each_blocking_safe().

I *STRONGLY OPPOSE* __cds_wfq_for_each_blocking_safe()(although it is suggested by me).

If you oppose the dequeue_all(), I agreed with this list:
	Remove old __cds_wfq_for_each_blocking_safe()
	Use manual splice()+get_first()+get_next() for call_rcu_thread().
	Remove my proposed cds_wfq_for_each_blocking_nobreak()
	Remove my proposed sync_next(), dequeue_all().
	Keep dequeue-based for_each() (alternative: introduce it in future)

If you also agreed this list, could you please partially merge my patch to your patch
and send it for review to reduce the review cycle. (don't forget the other minimal fixes in my patch)

==========
About internals:

I want to expose the for_each_*, I don't want to expose these helpers:
	cds_wfq_node_sync_next()
	cds_wfq_dequeue_all_blocking()
	__cds_wfq_dequeue_all_blocking()
	__cds_wfq_for_each_blocking_nobreak_internal()
(and I want to disallow the users use these helpers directly).

but the for_each_* use these helpers, so I have to declare it in the urcu/wfqueue.h,
and they become a part of ABI.

So these helpers become internal things, how to handle internal things
in a exposed *.h?

I found "__cds_list_del()" and "__tls_access_ ## name ()" are also
internals, and they are exposed. Are they exposed correctly?

In my view, exposed internals is OK, what we can do is disallowing
the users use it directly and ensuring the compatibility.

And we don't need to test exposed internals directly via testcases, because they are
internals, just like we don't need to test so much un-exposed internals.
When we test the exposed API, the internals will be also tested.

The call_rcu_thread() use the internal directly and can't be tested as you mentioned.
It is not totally true. call_rcu_thread() need to inject a "synchronize_rcu()"
to __cds_wfq_for_each_blocking_nobreak(), so I have to use
__cds_wfq_for_each_blocking_nobreak_internal().
Tests for __cds_wfq_for_each_blocking_nobreak() are enough for call_rcu_thread().

In one word, I don't agreed full of your disgust to the exposed internals,
but I agree to remove the ones introduced in my patch.

Thanks,
Lai