[lttng-dev] UNCHECKED Re: Re: urcu workqueue thread uses 99% of cpu while workqueue is empty

Fri Jun 24 02:21:35 EDT 2022

On Thu, Jun 23, 2022 at 10:09:55AM -0400, Mathieu Desnoyers wrote:
> 
> By looking at the trace, here is one problematic scenario which end up emitting "workqueue:futex_wait_again":
> 
> [23:33:48.060581417] (+0.000006985) localhost.localdomain workqueue:futex_no_wait: { cpu_id = 4 }, { vpid = 19495, vtid = 21054 }, { futex = 0x5652A1035C40, val = 0 }
> [23:33:48.060581747] (+0.000000330) localhost.localdomain workqueue:futex_wake_up_success: { cpu_id = 6 }, { vpid = 19495, vtid = 20649 }, { futex = 0x5652A1035C40, val = -1 }
> [23:33:48.060581926] (+0.000000179) localhost.localdomain workqueue:futex_dec_success: { cpu_id = 4 }, { vpid = 19495, vtid = 21054 }, { futex = 0x5652A1035C40, old = 0, new = -1 }
> [23:33:48.060582826] (+0.000000900) localhost.localdomain syscall_entry_futex: { cpu_id = 6 }, { vpid = 19495, vtid = 20649, pid = 19495, tid = 20649 }, { uaddr = 94912888659008, op = 1, val = 1, utime = 0, uaddr2 = 0, val3 = 0 }
> [23:33:48.060582855] (+0.000000029) localhost.localdomain syscall_entry_futex: { cpu_id = 4 }, { vpid = 19495, vtid = 21054, pid = 19495, tid = 21054 }, { uaddr = 94912888659008, op = 0, val = 4294967295, utime = 0, uaddr2 = 0, val3 = 0 }
> [23:33:48.060584722] (+0.000001867) localhost.localdomain sched_stat_runtime: { cpu_id = 4 }, { vpid = 19495, vtid = 21054, pid = 19495, tid = 21054 }, { comm = "bcache_writebac", tid = 21054, runtime = 16033, vruntime = 96940983054 }
> [23:33:48.060585840] (+0.000001118) localhost.localdomain sched_switch: { cpu_id = 4 }, { vpid = 19495, vtid = 21054, pid = 19495, tid = 21054 }, { prev_comm = "bcache_writebac", prev_tid = 21054, prev_prio = 20, prev_state = 1, next_comm = "swapper/4", next_tid = 0, next_prio = 20 }
> [23:33:48.060587680] (+0.000001840) localhost.localdomain sched_stat_sleep: { cpu_id = 6 }, { vpid = 19495, vtid = 20649, pid = 19495, tid = 20649 }, { comm = "bcache_writebac", tid = 21054, delay = 2815 }
> [23:33:48.060588560] (+0.000000880) localhost.localdomain sched_wakeup: { cpu_id = 6 }, { vpid = 19495, vtid = 20649, pid = 19495, tid = 20649 }, { comm = "bcache_writebac", tid = 21054, prio = 20, success = 1, target_cpu = 4 }
> [23:33:48.060590437] (+0.000001877) localhost.localdomain syscall_exit_futex: { cpu_id = 4 }, { vpid = 19495, vtid = 21054, pid = 19495, tid = 21054 }, { ret = 0, uaddr = 94912888659008, uaddr2 = 0 }
> [23:33:48.060591337] (+0.000000900) localhost.localdomain syscall_exit_futex: { cpu_id = 6 }, { vpid = 19495, vtid = 20649, pid = 19495, tid = 20649 }, { ret = 1, uaddr = 94912888659008, uaddr2 = 0 }
> [23:33:48.060591385] (+0.000000048) localhost.localdomain workqueue:futex_wait_return: { cpu_id = 4 }, { vpid = 19495, vtid = 21054 }, { futex = 0x5652A1035C40, val = -1, ret = 0 }
> [23:33:48.060592205] (+0.000000820) localhost.localdomain workqueue:futex_wait_again: { cpu_id = 4 }, { vpid = 19495, vtid = 21054 }, { futex = 0x5652A1035C40, val = -1 }
> 
> Where the wake_up happens right before the dec_success on the waiter, which
> leads to the sched_wakeup awakening the waiter when the state is 0.
Yes, this makes things clear, urcu_workqueue_queue_work did wake up waiters on
futex when the futex is -1.

> 
> If we want to dig more into why this scenario can happen, we could also add tracepoints
> in urcu_workqueue_queue_work() just before/after cds_wfcq_enqueue(), and in 
> workqueue_thread() around each call to cds_wfcq_empty(), and around __cds_wfcq_splice_blocking().
> 
> I suspect we end up in a situation where:
> 
> * waker:
> 
>         cds_wfcq_enqueue(&workqueue->cbs_head, &workqueue->cbs_tail, &work->next); [a]
>         uatomic_inc(&workqueue->qlen);
>         wake_worker_thread(workqueue);                                             [b]
> 
> vs
> 
> * waiter:
>                 splice_ret = __cds_wfcq_splice_blocking(&cbs_tmp_head,             [c]
>                         &cbs_tmp_tail, &workqueue->cbs_head, &workqueue->cbs_tail);
> [...]
>                         if (cds_wfcq_empty(&workqueue->cbs_head,                   [d]
>                                         &workqueue->cbs_tail)) {
>                                 futex_wait(&workqueue->futex);                     [e]
>                                 uatomic_dec(&workqueue->futex);                    [f]
> 
> happen in an order such that 
> 
> [a] enqueues an item. Then [c] splices the item out of the queue because it was already awakened
> by a prior wakeup. So the queue is observed as empty by [c]. Then [b] sets the futex to 0
> (in userspace), which causes [e] to skip waiting on the futex, and [f] brings the value back
> to -1. However, in the next loop, the queue is still observed as empty by [d], thus calling
> [e] again. This time, the value is -1, so it calls sys_futex FUTEX_WAIT. Unfortunately, we
> still have the call to sys_futex FUTEX_WAKE that will eventually be executed, thus awakening
> the futex while the futex state is still -1, which is unexpected.
Yes, i agree with you. We'll add the patch you mentioned into our code to fix this.
Thanks for your help.

Since cds_wfcq_enqueue/__cds_wfcq_splice_blocking do nothing to
workqueue->futex, add debug info to see if workqueue is empty is not so
neccessary to me.
Let me know if you need anything else on this.

Thanks, again.

Minlan

[lttng-dev] ***UNCHECKED*** Re: Re: urcu workqueue thread uses 99% of cpu while workqueue is empty

[lttng-dev] UNCHECKED Re: Re: urcu workqueue thread uses 99% of cpu while workqueue is empty