[lttng-dev] urcu/rculist.h clarifications - for implementing LRU

Mon Mar 13 11:30:21 EDT 2023

Hi Matthieu,

I spent some more time with the userspace-rcu on Friday and over weekend and
now I am in much better place.

> On 13. 3. 2023, at 15:29, Mathieu Desnoyers <mathieu.desnoyers at efficios.com> wrote:
> 
> On 2023-03-11 01:04, Ondřej Surý via lttng-dev wrote:
>> Hey,
>> so, we are integrating userspace-rcu to BIND 9 (yay!) and as experiment,
>> I am rewriting the internal address database (keeps the infrastructure
>> information about names and addresses).
> 
> That's indeed very interesting !

Thanks for the userspace-rcu! It saves a lot of time - while my colleague Tony Finch
already wrote our internal QSBR implementation from scratch, it would be waste of
time to try to reimplement the CDS part of the library.

This is part of larger work to replace the internal BIND 9 database that's currently
implemented as rwlocked RBT with qptrie, if you are interested Tony has good
summary here: https://dotat.at/@/2023-02-28-qp-bind.html

>> There's a hashtable to keep the entries and there's associated LRU list.
>> For the hashtable the cds_lfht seems to work well, but I am kind of struggling
>> with cds_list (the urcu/rculist.h variant).
>> The names and entries works in pretty much similar way, so I am going to
>> describe just one.
>> The workhorse is get_attached_and_locked_name() function (I am going to
>> skip the parts where we create keys, checks if LRU needs to be updated, etc.)
> 
> It would help if you could share a git branch of your prototype. In order to reason
> about RCU, we typically need to look at both the update-side and the read-side.
> For instance I don't see the read-side of the LRU linked-list in the code snippets
> below. We also need to have a complete picture of the object lifetime, from allocation
> to reclaim/reuse. I don't see where the grace periods (either synchronize_rcu or
> call_rcu) are before reclaim or reuse in the code snippets below.

Sure, happy to, but I didn't really wanted to ask you to look directly at the code, and
I don't *expect* direct help (although any help is surely appreciated).

So, please bear in mind this is far from complete and the commits are utter mess
so far: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/7680

>> So this is part with the hashtable lookup which seems to work well:
>>         rcu_read_lock();
>>         struct cds_lfht_iter iter;
>>         struct cds_lfht_node *ht_node;
>>         cds_lfht_lookup(adb->names_ht, hashval, names_match, &key, &iter);
>>         ht_node = cds_lfht_iter_get_node(&iter);
>>         bool unlink = false;
>>         if (ht_node == NULL) {
>>                 /* Allocate a new name and add it to the hash table. */
>>                 adbname = new_adbname(adb, name, start_at_zone);
>>                 ht_node = cds_lfht_add_unique(adb->names_ht, hashval,
>>                                               names_match, &key,
>>                                               &adbname->ht_node);
>>                 if (ht_node != &adbname->ht_node) {
>>                         /* ISC_R_EXISTS */
>>                         destroy_adbname(adbname);
>>                         adbname = NULL;
>>                 }
>>         }
>>         if (adbname == NULL) {
>>                 INSIST(ht_node != NULL);
>>                 adbname = caa_container_of(ht_node, dns_adbname_t, ht_node);
>>                 unlink = true;
>>         }
>>         dns_adbname_ref(adbname);
> 
> What is this dns_adbname_ref() supposed to do ? And is there a reference to adbname
> that is still used after rcu_read_unlock() ? What guarantees the existence of the
> adbname after rcu_read_unlock() ?

This is part of the internal reference counting - there's a macro that expects `isc_refcount_t references;`
member on the struct and it creates _ref, _unref, _attach and _detach functions for each struct.

The last _detach/_unref calls a destroy function.

>>         rcu_read_unlock();
>> and here's the part where LRU gets updated:
>>         LOCK(&adbname->lock); /* Must be unlocked by the caller */
> 
> I suspect you use a scheme where you hold the RCU read-side to perform the lookup, and
> then you use the object with an internal lock held. But expecting the object to still
> exist after rcu read unlock is incorrect, unless some other reference counting scheme
> is used.

Yeah, I was trying to minimize the sections where we hold the rcu_read locks, but I gave
up and now there's rcu_read lock held for longer periods of time.

>>         if (NAME_DEAD(adbname)) {
>>                 UNLOCK(&adbname->lock);
>>                 dns_adbname_detach(&adbname);
>>                 goto again;
>>         }
>>         if (adbname->last_used + ADB_CACHE_MINIMUM <= last_update) {
>>                 adbname->last_used = now;
>>                 LOCK(&adb->names_lru_lock);
>>                 if (unlink) {
>>                         cds_list_del_rcu(&adbname->list_node);
>>                 }
> 
> This looks odd. I don't see the code implementing traversal of this list, but
> I would expect a grace period between unlink of the node from a list and insertion
> into another list, otherwise if there are RCU readers traversing the list
> concurrently, they can observe an inconsistent state.

That's probably the part I am getting wrong. This all is fairly new to me and my brain
is still adjusting to the new paradigms.

>>                 cds_list_add_tail_rcu(&adbname->list_node, &adb->names_lru);
>>                 UNLOCK(&adb->names_lru_lock);
>>         }
>> The NAME_DEAD gets updated under the adbname->lock in expire_name():
>>         if (!NAME_DEAD(adbname)) {
>>                 adbname->flags |= NAME_IS_DEAD;
>>                 /* Remove the adbname from the hashtable... */
>>                 (void)cds_lfht_del(adb->names_ht, &adbname->ht_node);
> 
> I don't have the full context here, but AFAIR cds_lfht_del() allows two removals
> of the same ht_node to be done concurrently, and only one will succeed (which is
> probably what happens here). cds_list_del_rcu() however does not allow concurrent
> removals of a list_node. So if you somehow get two RCU lookups to find the same
> node in expire_name, one will likely do an extra unexpected cds_list_del_rcu().

Ah, that might explain the behaviour.  However the current branch doesn't manifest
this anymore.

>>                 /* ... and LRU list */
>>                 LOCK(&adb->names_lru_lock);
>>                 cds_list_del_rcu(&adbname->list_node);
>>                 UNLOCK(&adb->names_lru_lock);
>>         }
>> So, now the problem is that sometimes I get a crash under load:
>> (gdb) bt
>> #0  0x00007fae87a34c96 in cds_list_del_rcu (elem=0x7fae37e78880) at /usr/include/x86_64-linux-gnu/urcu/rculist.h:71
>> #1  get_attached_and_locked_name (adb=adb at entry=0x7fae830142a0, name=name at entry=0x7fae804fc9b0, start_at_zone=true, now=<optimized out>) at adb.c:1446
>> #2  0x00007fae87a392bf in dns_adb_createfind (adb=0x7fae830142a0, loop=0x7fae842c3a20, cb=cb at entry=0x7fae87b28d9f <fctx_finddone>, cbarg=0x7fae7c679000, name=name at entry=0x7fae804fc9b0, qname=0x7fae7c679010, qtype=1, options=63, now=<optimized out>, target=0x0, port=53, depth=1, qc=0x7fae7c651060, findp=0x7fae804fc698) at adb.c:2149
>> (gdb) frame 0
>> #0  0x00007fae87a34c96 in cds_list_del_rcu (elem=0x7fae37e78880) at /usr/include/x86_64-linux-gnu/urcu/rculist.h:71
>> 71 elem->next->prev = elem->prev;
>> (gdb) print elem->next
>> $1 = (struct cds_list_head *) 0x0
>> (gdb) print elem
>> $2 = (struct cds_list_head *) 0x7fae37e78880
>> So, I suspect, I am doing something wrong when updating the position of the the name in the LRU list.
>> There are couple of places where we iterate through the LRU list (overmem cleaning can kick-in, the user initiated cleaning can start, shutdown can be happening...)
> 
> It gets me to wonder whether you really need RCU for the LRU list ? Are those lookups
> very frequent ? And do they typically end up needing to grab a lock to protect against
> concurrent list modifications ?

These are lookups that happen frequently when the cache is cold - it keeps the delegation
mapping (e.g. **name** is server by **nameserver** (adbname) that has these **IP addresses**
(adbentries)).

There are two more places in the resolver hot path that are using hashtables now which I intend
to replace with rculfhash:

1. "finds" (this is for coalescing client requests, e.g. if 100 clients asks for google.com, the should be
   only one outgoing query)

2. resolver counter - these are used as bandaid protection against random-subdomain attacks, but
   the logic is similar - make a lookup for parent domain and if the counter exceeds <limit>, drop the
   extra query.

These four are in the cold cache resolver hotpath and are/were originally protected by RWLOCK
(using non-glibc RWLOCK helped a bit, but RCU is better).

>> Is there perhaps already some LRU implementation using Userspace-RCU that I can take look at?
> 
> I don't have an example implementing an LRU with a linked list specifically, but this is not
> different from other linked-list uses.

Thank you,
Ondrej
--
Ondřej Surý (He/Him)
ondrej at sury.org