[lttng-dev] [RFC PATCH] wfqueue: expand API, simplify implementation, small performance boost
Mathieu Desnoyers
mathieu.desnoyers at efficios.com
Wed Aug 15 08:52:10 EDT 2012
* Lai Jiangshan (eag0628 at gmail.com) wrote:
> On Wed, Aug 15, 2012 at 6:27 AM, Mathieu Desnoyers
> <mathieu.desnoyers at efficios.com> wrote:
> > * Lai Jiangshan (eag0628 at gmail.com) wrote:
> >> On Tue, Aug 14, 2012 at 4:12 AM, Mathieu Desnoyers
> >> <mathieu.desnoyers at efficios.com> wrote:
> > [...]
> >> >> > /*
> >> >> > + * ___cds_wfq_first_blocking: get first node of a queue, without dequeuing.
> >> >> > + *
> >> >> > + * Mutual exclusion with "dequeue" and "splice" operations must be ensured
> >> >> > + * by the caller.
> >> >> > + */
> >> >> > +static inline struct cds_wfq_node *
> >> >> > +___cds_wfq_first_blocking(struct cds_wfq_queue *q)
> >> >> > +{
> >> >> > + if (_cds_wfq_empty(q))
> >> >> > + return NULL;
> >> >> > + return ___cds_wfq_node_sync_next(&q->head);
> >> >> > +}
> >> >> > +
> >> >> > +/*
> >> >> > + * ___cds_wfq_next_blocking: get next node of a queue, without dequeuing.
> >> >> > + *
> >> >> > + * Mutual exclusion with "dequeue" and "splice" operations must be ensured
> >> >> > + * by the caller.
> >> >> > + */
> >> >> > +static inline struct cds_wfq_node *
> >> >> > +___cds_wfq_next_blocking(struct cds_wfq_queue *q, struct cds_wfq_node *node)
> >> >> > +{
> >> >> > + if (CMM_LOAD_SHARED(q->tail) == node)
> >> >> > + return NULL;
> >> >> > + return ___cds_wfq_node_sync_next(node);
> >> >> > +}
> >> >>
> >> >>
> >> >> The same BUG as you told me.
> >> >> If q has only one node just enqueued by other thread.
> >> >> but if q->head.next is seen, ___cds_wfq_first_blocking() returns a node,
> >> >> And the update of q->tail is not seen, it is still &q->head,
> >> >> ___cds_wfq_node_sync_next(node) will be loop for every if there is no
> >> >> other enqueue.
> >> >
> >> > Good catch ! :-)
> >> >
> >> >>
> >> >>
> >> >>
> >> >> static inline struct cds_wfq_node *
> >> >> ___cds_wfq_first_blocking(struct cds_wfq_queue *q)
> >> >> {
> >> >> + struct cds_wfq_node *ret.
> >> >> if (_cds_wfq_empty(q))
> >> >> return NULL;
> >> >> ret = ___cds_wfq_node_sync_next(&q->head);
> >> >> + cmm_smp_rmb();
> >> >> + return ret;
> >> >> }
> >> >
> >> > However, I think we should add the rmb at the beginning of
> >> > ___cds_wfq_next_blocking(), so it applies at each "next" call.
> >> > Otherwise, I think we could end up in a situation where we wait for a
> >> > NULL next forever in the second of two consecutive
> >> > ___cds_wfq_next_blocking() calls. I therefore propose:
> >>
> >> How this can happen(wait for a NULL next forever in the second ...)?
> >>
> >> A rmb is enough to leave the state(simgle node && q->tail == &q->head).
> >> "CMM_LOAD_SHARED(q->tail) == node" will be true in the loop except this state.
> >
> > You are indeed right, but I now realise there is a small thing we need
> > to change. I tried a couple of scenarios to convince myself of it, and
> > here are my conclusions:
> >
> > - In all cases (except for the first node), comparing q->tail with node
> > ensures that we never go beyond the tail. Except for the first node
> > (q->head), we always perform this comparison, so at no point can we
> > reach the nodes through "next" pointers that would be beyond the
> > q->tail node.
> >
> > - In the case of the first node, we do the comparison with
> >
> > CMM_LOAD_SHARED(q->head.next) == NULL
> > && CMM_LOAD_SHARED(q->tail) == &q->head;
> >
> > in no particular order (checking if empty) before we move to the
> > first node with ___cds_wfq_node_sync_next(&q->head).
> > The reason why the first node is different from all the others is
> > because we don't use "CMM_LOAD_SHARED(q->tail) == &q->head" alone
> > to distinguish between an empty queue and a queue that contains
> > nodes: we first check for NULL q->head.next, AFAIU for performance
> > reasons: whenever possible, we don't want to read the q->tail pointer
> > because it is being heavily updated by the enqueuer.
>
> Is it false sharing?
> Access to q->head.next and access to q->tail have the same performance
> because they are in the same cache line.
Yes! you are right! And a quick benchmark confirms it:
with head and tail on same cache line:
SUMMARY /home/compudj/doc/userspace-rcu/tests/.libs/lt-test_urcu_wfq testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues 100833595 nr_dequeues 88647134 successful enqueues 100833595 successful dequeues 88646898 end_dequeues 12186697 nr_ops 189480729
with a 256 bytes padding between head and tail, keeping the mutex on the
"head" cache line:
SUMMARY /home/compudj/doc/userspace-rcu/tests/.libs/lt-test_urcu_wfq testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues 228992829 nr_dequeues 228921791 successful enqueues 228992829 successful dequeues 228921367 end_dequeues 71462 nr_ops 457914620
enqueue: 127% speedup
dequeue: 158% speedup
That is indeed a _really_ huge difference. However, to get this, we
would have to increase the size of struct cds_wfq_queue beyond its
current size, which would break API compatibility. Any idea on how to
best do this without causing incompatibility would be welcome.
Thanks!
Mathieu
>
> >
> > So assuming the goal beyond the CMM_LOAD_SHARED(q->head.next) == NULL
> > check in the empty queue check is for performance of dequeue vs enqueue,
> > I think it would be appropriate to do a similar optimisation for the
> > ___cds_wfq_next_blocking() case: first check the next pointer of the
> > current node to see if it is NULL. If not, then we know we can fetch the
> > next pointer without comparing our position to the tail (performance
> > optimisation). However, if we do that, we'll need a cmm_smp_rmb() at
> > each ___cds_wfq_next_blocking() invocation.
> >
> > Here is the resulting code:
> >
> > /*
> > * ___cds_wfq_first_blocking: get first node of a queue, without dequeuing.
> > *
> > * Mutual exclusion with "dequeue" and "splice" operations must be ensured
> > * by the caller.
> > */
> > static inline struct cds_wfq_node *
> > ___cds_wfq_first_blocking(struct cds_wfq_queue *q)
> > {
> > if (_cds_wfq_empty(q))
> > return NULL;
> > return ___cds_wfq_node_sync_next(&q->head);
> > }
> >
> > /*
> > * ___cds_wfq_next_blocking: get next node of a queue, without dequeuing.
> > *
> > * Mutual exclusion with "dequeue" and "splice" operations must be ensured
> > * by the caller.
> > */
> > static inline struct cds_wfq_node *
> > ___cds_wfq_next_blocking(struct cds_wfq_queue *q, struct cds_wfq_node *node)
> > {
> > struct cds_wfq_node *next;
> >
> > if ((next = CMM_LOAD_SHARED(node->next)) != NULL)
> > return next;
> > /* Load node->next before q->tail */
> > cmm_smp_rmb();
> > if (CMM_LOAD_SHARED(q->tail) == node)
> > return NULL;
> > return ___cds_wfq_node_sync_next(node);
> > }
> >
> > Is my understanding correct ?
> >
> > Thanks,
> >
> > Mathieu
> >
> > --
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list