[lttng-dev] ThreadSanitizer: data race between urcu_mb_synchronize_rcu and urcu_adaptative_wake_up

Wed Mar 22 07:01:02 EDT 2023

> On 22. 3. 2023, at 9:02, Ondřej Surý via lttng-dev <lttng-dev at lists.lttng.org> wrote:
> 
> That's pretty much weird because the "Write" happens on stack local variable,
> while the "Previous write" happens after futex, which lead me to the fact that
> ThreadSanitizer doesn't intercept futex, but we can annotate the futexes:
> 
> https://groups.google.com/g/thread-sanitizer/c/T0G_NyyZ3s4

FTR neither annotating the futex with __tsan_acquire(addr) and __tsan_release(addr)
nor falling back to compat_futex_async() for ThreadSanitizer has helped.

It seems to me that TSAN still doesn't understand the synchronization between
RCU read-critical sections and call_rcu/synchronize_rcu() as I am also getting
following reports:

  Write of size 8 at 0x7b54000009c0 by thread T102:
    #0 __tsan_memset <null> (badcache_test+0x49257d) (BuildId: a7c1595d61e3ee411276cf89a536a8daefa959a3)
    #1 mem_put /home/ondrej/Projects/bind9/lib/isc/mem.c:324:3 (libisc-9.19.12-dev.so+0x7d136) (BuildId: a33cd26e483b73684928b4782627f1278c001605)
    #2 isc__mem_put /home/ondrej/Projects/bind9/lib/isc/mem.c:684:2 (libisc-9.19.12-dev.so+0x7e0c3) (BuildId: a33cd26e483b73684928b4782627f1278c001605)
    #3 bcentry_destroy_rcu /home/ondrej/Projects/bind9/lib/dns/badcache.c:163:2 (libdns-9.19.12-dev.so+0x4e071) (BuildId: 8a550b795003cd1075ff29590734c806d84e76e6)
    #4 call_rcu_thread /home/ondrej/Projects/userspace-rcu/src/../src/urcu-call-rcu-impl.h:389:5 (liburcu-mb.so.8+0x9d6b) (BuildId: d4f5ea9d96625c7b7d2b2efb590b208f7b83cb6f)

  Previous atomic write of size 8 at 0x7b54000009c0 by main thread (mutexes: write M0):
    #0 ___cds_wfcq_append /home/ondrej/Projects/userspace-rcu/src/../include/urcu/static/wfcqueue.h:202:2 (liburcu-mb.so.8+0xa8ae) (BuildId: d4f5ea9d96625c7b7d2b2efb590b208f7b83cb6f)
    #1 _cds_wfcq_enqueue /home/ondrej/Projects/userspace-rcu/src/../include/urcu/static/wfcqueue.h:223:9 (liburcu-mb.so.8+0xac09) (BuildId: d4f5ea9d96625c7b7d2b2efb590b208f7b83cb6f)
    #2 _call_rcu /home/ondrej/Projects/userspace-rcu/src/../src/urcu-call-rcu-impl.h:719:2 (liburcu-mb.so.8+0x604f) (BuildId: d4f5ea9d96625c7b7d2b2efb590b208f7b83cb6f)
    #3 urcu_mb_barrier /home/ondrej/Projects/userspace-rcu/src/../src/urcu-call-rcu-impl.h:932:3 (liburcu-mb.so.8+0x4d1b) (BuildId: d4f5ea9d96625c7b7d2b2efb590b208f7b83cb6f)
    #4 badcache_flush /home/ondrej/Projects/bind9/lib/dns/badcache.c:329:2 (libdns-9.19.12-dev.so+0x4d8b3) (BuildId: 8a550b795003cd1075ff29590734c806d84e76e6)
    [...]

E.g. ThreadSanitizer reports a race between a place where bcentry->rcu_head is added to call_rcu() queue
and when call_rcu callbacks are called.  Annotating the bcentry with acquire/release here helps with this
particular data race, but it does not feel right to me to add annotation at this level.

The code is not very complicated there:

static void
bcentry_destroy_rcu(struct rcu_head *rcu_head) {
        dns_bcentry_t *bad = caa_container_of(rcu_head, dns_bcentry_t,
                                              rcu_head);
        /* __tsan_release(bad); <-- this helps */
        dns_badcache_t *bc = bad->bc;

        isc_mem_put(bc->mctx, bad, sizeof(*bad));

        dns_badcache_detach(&bc);
}

static void
bcentry_evict(struct cds_lfht *ht, dns_bcentry_t *bad) {
        /* There can be multiple deleters now */
        if (cds_lfht_del(ht, &bad->ht_node) == 0) {
                /* __tsan_acquire(bad); <- this helps */
                call_rcu(&bad->rcu_head, bcentry_destroy_rcu);
        }
}

static void
badcache_flush(dns_badcache_t *bc, struct cds_lfht *ht) {
        struct cds_lfht *oldht = rcu_xchg_pointer(&bc->ht, ht);

        synchronize_rcu();

        rcu_read_lock();
        dns_bcentry_t *bad = NULL;
        struct cds_lfht_iter iter;
        cds_lfht_for_each_entry (oldht, &iter, bad, ht_node) {
                bcentry_evict(oldht, bad);
        }
        rcu_read_unlock();
        rcu_barrier();
        RUNTIME_CHECK(cds_lfht_destroy(oldht, NULL) == 0);
}

Any ideas?

Ondrej
--
Ondřej Surý (He/Him)
ondrej at sury.org