[lttng-dev] urcu/rculist.h clarifications - for implementing LRU

Tue Mar 14 07:28:08 EDT 2023

Hi Mathieu,

> On 13. 3. 2023, at 21:02, Mathieu Desnoyers <mathieu.desnoyers at efficios.com> wrote:
> 
> On 2023-03-13 11:30, Ondřej Surý wrote:
>> Hi Matthieu,
>> I spent some more time with the userspace-rcu on Friday and over weekend and
>> now I am in much better place.
>>> On 13. 3. 2023, at 15:29, Mathieu Desnoyers <mathieu.desnoyers at efficios.com> wrote:
>>> 
>>> On 2023-03-11 01:04, Ondřej Surý via lttng-dev wrote:
>>>> Hey,
>>>> so, we are integrating userspace-rcu to BIND 9 (yay!) and as experiment,
>>>> I am rewriting the internal address database (keeps the infrastructure
>>>> information about names and addresses).
>>> 
>>> That's indeed very interesting !
>> Thanks for the userspace-rcu! It saves a lot of time - while my colleague Tony Finch
>> already wrote our internal QSBR implementation from scratch, it would be waste of
>> time to try to reimplement the CDS part of the library.
>> This is part of larger work to replace the internal BIND 9 database that's currently
>> implemented as rwlocked RBT with qptrie, if you are interested Tony has good
>> summary here: https://dotat.at/@/2023-02-28-qp-bind.html
> 
> Speaking of tries, I have implemented RCU Judy arrays in liburcu feature branches a
> while back. Those never made it to the liburcu master branch because I had no real-life
> use for those so far, and I did not want to expose a public API that would bitrot without
> real-life user feedback.
> 
> The lookups and ordered traversals (next/prev) are entirely RCU, and updates are
> either single-threaded, or use a strategy where locking is distributed within
> the trie so updates to data spatially discontinuous would not contend with each other.
> 
> My original implementation supported integer keys as well as variable-length string keys.
> 
> The advantage of Judy arrays is that it minimizes the number of cache-lines touched
> on lookup traversal. Let me know if this would be useful for your use-cases, and if
> so I can provide links to prototype branches.

I don't think we are going to use this right away, but I would be happy to look at the branch.

I've looked at the Judy arrays before, but the normal implementation alone made my brain
hurt, so RCU Judy arrays seems like something I would like to look at even if we are not
going to use the Judy arrays right now or ever.

>>> I suspect you use a scheme where you hold the RCU read-side to perform the lookup, and
>>> then you use the object with an internal lock held. But expecting the object to still
>>> exist after rcu read unlock is incorrect, unless some other reference counting scheme
>>> is used.
>> Yeah, I was trying to minimize the sections where we hold the rcu_read locks, but I gave
>> up and now there's rcu_read lock held for longer periods of time.
> 
> We've used that kind of scheme in LTTng lttng-relayd, where we use RCU for short-term
> existence guarantee, and reference counting for longer-term existence guarantee. An
> example can be found here:
> 
> https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-relayd/viewer-stream.cpp
> 
> viewer_stream_get_by_id() attempts lookup from the hash table, and re-validates that the
> object exists with viewer_stream_get(), which checks if the refcount is already 0 as it
> tries to increment it with urcu_ref_get_unless_zero(). If zero, it does as if the object
> was not found. I recommend this kind of scheme if you intend to use both RCU and reference
> counting.
> 
> Then you can place a mutex within the object, and use that mutex to provide mutual
> exclusion between concurrent accesses to the object that need to be serialized.
> 
> In the destroy handler (called when the reference count reaches 0), you will typically
> want to unlink your object from the various data structures holding references to it
> (hash tables, lists), and then use call_rcu() to invoke reclaim of the object after a
> grace period.

Thanks, the description looks very similar to what we do. Only a special flag is used
to mark the name/entry as DEAD - it's possible to force-expire an entry from the cache,
so reference counting alone would not be sufficient.

I'll take a look at the lltng-tools code. Thank you.

>>>>         if (NAME_DEAD(adbname)) {
>>>>                 UNLOCK(&adbname->lock);
>>>>                 dns_adbname_detach(&adbname);
>>>>                 goto again;
>>>>         }
>>>>         if (adbname->last_used + ADB_CACHE_MINIMUM <= last_update) {
>>>>                 adbname->last_used = now;
>>>>                 LOCK(&adb->names_lru_lock);
>>>>                 if (unlink) {
>>>>                         cds_list_del_rcu(&adbname->list_node);
>>>>                 }
>>> 
>>> This looks odd. I don't see the code implementing traversal of this list, but
>>> I would expect a grace period between unlink of the node from a list and insertion
>>> into another list, otherwise if there are RCU readers traversing the list
>>> concurrently, they can observe an inconsistent state.
>> That's probably the part I am getting wrong. This all is fairly new to me and my brain
>> is still adjusting to the new paradigms.
> 
> One way would be to invoke synchronize_rcu() between del_rcu and add_tail_rcu,
> but it would probably be too costly performance-wise.

Aha! So, this was this piece I was missing.

> Another way that would be a better fit to the RCU mindset would be to create a
> copy of the adbname object, add _that_ object to the tail, and use call_rcu
> to free the old object that has been removed. There is of course extra overhead
> associated with this copy.

I was thinking about adding one layer of indirection:

struct {
  adbname_t *adbname;
  struct cds_list head node;
};

And then when re-enqueueing only this bit would have to be removed and added, not the whole
adbname which holds quite a lot of data.

>>> It gets me to wonder whether you really need RCU for the LRU list ? Are those lookups
>>> very frequent ? And do they typically end up needing to grab a lock to protect against
>>> concurrent list modifications ?
>> These are lookups that happen frequently when the cache is cold - it keeps the delegation
>> mapping (e.g. **name** is server by **nameserver** (adbname) that has these **IP addresses**
>> (adbentries)).
> 
> OK.
> 
> So the characteristics of this LRU, AFAIU, are:
> 
> - frequent enqueue (most recent),
> - frequent "pick" item within the list to re-enqueue it as most recent,

This happens in get_attached_and_locked_{name,entry}:

1. lookup the name
2. if not found add the name (with failover if it was added elsewhere) and enqueue
3. if found check if it's still alive and re-enqueue on the LRU

> - frequent dequeue of the last N oldest items.

This would be purge_stale_names() and purge_stale_entries()

> 
> However I don't see that iteration on the list is inherently needed, at least
> not frequently, except as means to implement the operations above, am I correct ?

That's correct - there are couple infrequent operations:
- dumping the cache to file descriptor (that needs iteration, but I could iterate the hashtable, not the list)
- removing all the elements from the cache (it uses iterator right now, but I guess I could swap the cache and call_rcu() on the old one)
- removing a specific entry from the cache (this uses hash lookup)
- removing a subtree (in a DNS sense) - this uses the iterator to walk through all the entries in the cache and if name is subdomain, it force expires it
- and then of course the cleanup at the shutdown

Ondrej
--
Ondřej Surý (He/Him)
ondrej at sury.org