[lttng-dev] [RFC URCU PATCH] wfqueue: ABI v1, simplify implementation, 2.3x to 2.6x performance boost

Wed Aug 22 20:57:28 EDT 2012

ping

On 08/21/2012 11:41 AM, Lai Jiangshan wrote:
> On 08/20/2012 09:16 PM, Mathieu Desnoyers wrote:
>> * Lai Jiangshan (laijs at cn.fujitsu.com) wrote:
>>> On 08/16/2012 05:31 AM, Mathieu Desnoyers wrote:
>>>> This work is derived from the patch from Lai Jiangshan submitted as
>>>> "urcu: new wfqueue implementation"
>>>> (http://lists.lttng.org/pipermail/lttng-dev/2012-August/018379.html)
>>>>
>>>
>>
>> Hi Lai,
>>
>>> Hi, Mathieu
>>>
>>> please add this part to your patch.
>>>
>>> Changes(item5 has alternative):
>>>
>>> 1) Reorder the head and the tail in struct cds_wfq_queue
>>>    Reason: wfq is enqueue-preference
>>
>> Is this a technical improvement over the original order ? Both cache
>> lines are aligned, so it should not matter which one comes first. So I'm
>> not sure why we should favor one order vs the other.
>>
>>> 2) Reorder the code of ___cds_wfq_next_blocking()
>>>    Reason: short code, better readability
>>
>> Yes, I agree with this change. Can you extract it into a separate patch ?
>>
>>>
>>> 3) Add cds_wfq_dequeue_[un]lock
>>>    Reason: the fields of struct cds_wfq_queue are not exposed to users of lib,
>>> 	but some APIs needs to have this lock held. Add this locking function
>>> 	to allow the users use such APIs
>>
>> I agree with this change too. You can put it into the same patch as (2).
>>
>>>
>>> 4) Add cds_wfq_node_sync_next(), cds_wfq_dequeue_all_blocking()
>>>    Reason: helper for for_each_*
>>
>> Points 4-5-6-7 will need more discussion.
>>
>> I don't like exposing "cds_wfq_node_sync_next()", which should remain an
>> internal implementation detail not exposed to queue users.  The
>> "get_first"/"get_next" API, as well as the dequeue, are providing an
>> abstraction that hide the node synchronization, which makes it harder
>> for the user to make mistakes. Given that the fast-path of
>> cds_wfq_node_sync_next is just a pointer load and test, I don't think
>> that skipping this the second time we iterate on the same queue would
>> bring any significant performance improvement, but exposing sync_next
>> adds in API complexity, which I would like to avoid.
> 
> If the name "cds_wfq_node_sync_next()" is bad and exposes the implementation,
> we can rename it to "__cds_wfq_next_blocking_nocheck()" to hide the implementation.
> 
> "get_next_nocheck": get next node of a queue, the caller ensure that the next node is existed.
> "get_next": get next node of a queue, it also finds out that whether the next node is existed or not.
> 
> Or just remove "cds_wfq_node_sync_next()", use "get_next" instead.
> 
>>
>> About cds_wfq_dequeue_all_blocking(): the "cds_wfq_splice()" I
>> introduced performs the same action as "cds_wfq_dequeue_all_blocking()",
>> but ensures that we can splice lists multiple times (rather than only
>> once), which is a very neat side-effect. Let's imagine a use-case like
>> Tree RCU, where each CPU produce a queue of call_rcu callbacks, and we
>> need to merge pairs of queues together recursively until we reach the
>> tree root: by doing splice() at each level of the tree, we can combine
>> the queues into a single queue without ever needing to do an iteration
>> on each node: no node synchronization is needed until we actually need
>> to traverse the queue at the root node level.
>>
>> The cost of exposing dequeue_all is that we need to expose sync_next
>> (more complexity for the user). As you exposed in your earlier email,
>> the gain we get by exposing dequeue_all separately from splice is that
>> dequeue_all is 1 xchg() operation, while splice is 2 xchg() operations.
>> However, given that this operation is typically amortized over many
>> nodes makes performance optimization a less compelling argument than
>> simplicity of use.
> 
> Reducing the temporary big struct is also my tempt.
> 
> ---
> Related to "internal", see the bottom.
> 
>>
>>>
>>> 5) Add more for_each_*, which default semantic is dequeuing
>>>    Reason: __cds_wfq_for_each_blocking_undequeue is enough for undequeuing loops
>>> 	don't need to add more.
>>> 	looping and dequeuing are the most probable use-cases.
>>> 	dequeuing for_each_* make the queue like a pipe. make it act as a channel.(GO language)
>>>    Alternative: rename these dequeuing __cds_wfq_for_each*() to __cds_wfq_dequeue_for_each*
>>
>> I notice, in the iterators you added, the presence of
>> "cds_wfq_for_each_blocking_nobreak", and
>> "cds_wfq_for_each_blocking_nobreak_internal", which document that this
>> special flavor is needed for loops that do not have a "break" statement.
>> That seems to be very much error-prone and counter-intuitive for users,
>> and I would like to avoid that.
> 
> __cds_wfq_for_each_blocking_safe(remove-all) is also "no-break"(*), is it error-prone and ...?
> 
> So this is not a good reason to reject cds_wfq_for_each_blocking_nobreak(), if the user
> don't know how to use it, he should use cds_wfq_for_each_blocking() (the version in my
> patch which is dequeue()-based and we can break from the loop body)
> 
> __cds_wfq_for_each_blocking_safe() and cds_wfq_for_each_blocking_nobreak() are exactly the
> same semantic in "remove-all" and "no-break". but cds_wfq_for_each_blocking_nobreak()
> does NOT CORRUPT the queue.
> 
> So cds_wfq_for_each_blocking_nobreak() is yet another __cds_wfq_for_each_blocking_safe()
> and it don't corrupt the queue, it is really safer than __cds_wfq_for_each_blocking_safe().
> 
> """"footnote: (*)
> __cds_wfq_for_each_blocking_safe() should be used for "remove-none" or "remove-all"
> We should use __cds_wfq_for_each_blocking_undequeue() instead of __cds_wfq_for_each_blocking_safe(remove-none)
> """
> 
>>
>> The "internal" flavor, along with "sync_next" are then used in the
>> call_rcu code, since they are needed to get the best performance
>> possible. I don't like to use special, internal APIs for call_rcu: the
>> original motivation for refactoring the wfqueue code was testability,
>> and I would really like to make sure that the APIs we use internally are
>> also exposed to the outside world, mainly for testability, and also
>> because if we need to directly access internal interfaces to have good
>> performance, it means we did a poor job at exporting APIs that allow to
>> do the same kind of highly-efficient use of the queue. Exposing an API
>> with the "internal" keyword in it clearly discourages users from using
>> it.
> 
> Related to "internal" see the bottom.
> 
>>
>>
>>>
>>> 6) Remove __cds_wfq_for_each_blocking_safe
>>>    Reason: its semantic is not clear, hard to define a well-defined semantic to it.
>>> 	when we use it, we have to delete none or delete all nodes, even when we delete all nodes,
>>> 	struct cds_wfq_queue still becomes stale while looping or after loop.
>>
>> The typical use-case I envision is:
>>
>> - enqueue many nodes to queue A
>>
>> - then, a dequeuer thread splice the content of queue A into its temporary
>>   queue T (which head is local on its stack).
>>
>> - now, the thread is free to iterate on queue T, as many times as it
>>   likes, and it does not need to change the structure of the queue while
>>   iterating on it (read-only operation, without side-effect). The last
>>   time it iterates on queue T, it needs to use the _safe iteration and
>>   free the nodes.
>>
>> - after the nodes have been deleted, the queue T can be simply discarded
>>   (if on stack, function return will do so; if dynamically allocated,
>>   can be simply reinitialized to an empty queue, or freed).
> 
> My for_each_*() can do exactly the same thing in this use-case since we still
> have splice() and cds_wfq_for_each_blocking_nobreak()(safer **_safe()). Doesn't it?
> 
> The memory represented a struct queue is big.(two cache line), something we need to
> avoid to use it in the stack.(shorten the stack size or avoid to cost 2 or 3 additional cache line).
> 
> -----------a use case
> All these local field are hot in the stack(and for some reason, it is not in the registers)
> 	int count;
> 	struct cds_wfq_queue tmp;
> 	struct cds_wfq_node *node, *next;
> 
> It will use 4 cache line. Should we always recommend the users use splice()+tmp?
> 
>>
>> As you certainly noticed, we can iterate both on a queue being
>> concurrently enqueued to, and on a queue that has been spliced into.
>> Being able to use iterators and dequeue on any queue (initial enqueue,
>> or queue spliced to) is a feature: we can therefore postpone the
>> "sync_next" synchronization to the moment where loop iteration is really
>> needed (lazy synchronization), which keeps optimal locality of reference
>> of synchronization vs node data access.
>>
>>>
>>> 7) Rename old __cds_wfq_for_each_blocking() to __cds_wfq_for_each_blocking_undequeue()
>>>    Reason: the default semantic of the other for_each_* is dequeuing.
>>
>> I'm not convinced that "__cds_wfq_for_each_blocking_nobreak",
>> "__cds_wfq_for_each_blocking_nobreak_internal", and sync_next APIs are
>> improvements over the API I propose. For call_rcu, this turns
>>
>> cds_wfq_enqueue()
>> cds_wfq_splice_blocking()
>> __cds_wfq_for_each_blocking_safe() for iteration
>>
>> into:
>>
>> cds_wfq_enqueue()
>> __cds_wfq_dequeue_all_blocking()
>> __cds_wfq_for_each_blocking_nobreak_internal() for iteration
>>
>> Maybe we could do some documentation improvement to
>> "__cds_wfq_for_each_blocking(_safe)", __cds_wfq_first_blocking(),
>> __cds_wfq_next_blocking() API members I proposed. Maybe I did not
>> document them well enough ?
>>
> 
> It does be not enough for __cds_wfq_for_each_blocking_safe().
> 
> I *STRONGLY OPPOSE* __cds_wfq_for_each_blocking_safe()(although it is suggested by me).
> 
> If you oppose the dequeue_all(), I agreed with this list:
> 	Remove old __cds_wfq_for_each_blocking_safe()
> 	Use manual splice()+get_first()+get_next() for call_rcu_thread().
> 	Remove my proposed cds_wfq_for_each_blocking_nobreak()
> 	Remove my proposed sync_next(), dequeue_all().
> 	Keep dequeue-based for_each() (alternative: introduce it in future)
> 
> If you also agreed this list, could you please partially merge my patch to your patch
> and send it for review to reduce the review cycle. (don't forget the other minimal fixes in my patch)
> 
> ==========
> About internals:
> 
> I want to expose the for_each_*, I don't want to expose these helpers:
> 	cds_wfq_node_sync_next()
> 	cds_wfq_dequeue_all_blocking()
> 	__cds_wfq_dequeue_all_blocking()
> 	__cds_wfq_for_each_blocking_nobreak_internal()
> (and I want to disallow the users use these helpers directly).
> 
> but the for_each_* use these helpers, so I have to declare it in the urcu/wfqueue.h,
> and they become a part of ABI.
> 
> So these helpers become internal things, how to handle internal things
> in a exposed *.h?
> 
> I found "__cds_list_del()" and "__tls_access_ ## name ()" are also
> internals, and they are exposed. Are they exposed correctly?
> 
> In my view, exposed internals is OK, what we can do is disallowing
> the users use it directly and ensuring the compatibility.
> 
> And we don't need to test exposed internals directly via testcases, because they are
> internals, just like we don't need to test so much un-exposed internals.
> When we test the exposed API, the internals will be also tested.
> 
> The call_rcu_thread() use the internal directly and can't be tested as you mentioned.
> It is not totally true. call_rcu_thread() need to inject a "synchronize_rcu()"
> to __cds_wfq_for_each_blocking_nobreak(), so I have to use
> __cds_wfq_for_each_blocking_nobreak_internal().
> Tests for __cds_wfq_for_each_blocking_nobreak() are enough for call_rcu_thread().
> 
> In one word, I don't agreed full of your disgust to the exposed internals,
> but I agree to remove the ones introduced in my patch.
> 
> Thanks,
> Lai
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>