[Userspace RCU] Introducing a new data structure: Fractal Trie

Wed Apr 1 13:35:09 EDT 2026

Hi!

I've implemented a new RCU Trie data structure, and would welcome
your feedback while it's in development stage.

It's currently sitting in a development branch of my private
userspace-rcu repository. It's currently targeting userspace
applications, but could be ported to the Linux kernel if there is a
use-case for it.

https://github.com/compudj/userspace-rcu-dev/tree/fractal-trie-dev

I've started working on this project around 2012, and I've recently
been able to find time to pursue this work. (Many thanks to ISC.org
for funding this!).

Here are a few major use-cases I see are a good fit:

- DNS resolver,
- IP routing tables.
- Real-time application data structure lookups with strict real-time
   bound limits (wait-free, guaranteed bounded number of cache line
   loads per key byte).

I'm pasting the API header top comment as summary of the feature set
here:

/*
  * urcu/fractal-trie.h
  *
  * Userspace RCU library - Fractal Trie
  *
  * A concurrent, RCU-protected ordered trie mapping variable or
  * fixed-length byte keys to user-defined nodes. Keys are opaque
  * byte sequences with no reserved or sentinel values; unlike
  * tries that rely on NUL or other terminal characters, any byte
  * value may appear at any position in a key. Lookups and
  * traversals are wait-free under the RCU read-side lock and can
  * proceed concurrently with mutations. The internal structure is
  * acyclic by construction: no sequence of concurrent updates can
  * introduce a cycle, ensuring that all lookups and traversals
  * complete in bounded time. Supports exact lookup, partial
  * (prefix) match, ordered iteration, duplicate key chains, and
  * range queries (<=, >=, <, >). A longest-match lookup reports
  * the deepest trie position matching a prefix of the input key,
  * even if that position has no external node.
  *
  * Performance characteristics:
  *
  * - Lookup: At most 2 cache-line accesses per key byte.
  * - Ordered traversal: At most 3 cache-line accesses per node.
  *
  * Internal node configurations:
  *
  * Internal nodes self-adapt to the key population using several
  * configurations (linear, 1D pool, 2D pool, pigeon), each with
  * a different indexing strategy suited to its child density.
  * The appropriate configuration is chosen automatically based
  * on the number of children. Node sizes are powers of 2 between
  * 16 bytes and 2048 bytes on 64-bit architectures (1024 bytes
  * on 32-bit). Mutations use in-place updates when possible to
  * minimize node recompaction, and hysteresis at size thresholds
  * prevents repeated recompaction when the child count oscillates
  * near a boundary.
  *
  * The 1D and 2D pool configurations partition the 8-bit key
  * byte space using one or two bit positions respectively,
  * distributing children across sub-nodes. This provides a
  * range of intermediate node sizes between the compact linear
  * configuration and the full 256-entry pigeon configuration,
  * allowing memory-efficient representation of medium-density
  * populations without requiring the full pigeon footprint.
  * The bit positions are selected to minimize the maximum
  * sub-node population, using a minimax criterion. The
  * transition thresholds and worst-case sub-node sizes were
  * empirically validated by brute-force enumeration of optimal
  * bit selections across millions of random populations.
  * Alternative approaches such as Judy use a population bitmap
  * with a dense child array, which requires recompacting the
  * array on every insertion or removal. This is incompatible
  * with wait-free RCU lookups, since a reader could observe a
  * partially recompacted array. The pool approach avoids this
  * by using fixed index positions derived from key bits,
  * allowing children to be added or removed with single-pointer
  * updates visible atomically to concurrent readers.
  *
  * Node type and configuration are encoded in the low bits of
  * child pointers (tagged pointers), so determining a node's
  * layout during lookup requires no extra memory access.
  *
  * The 2D pool and pigeon configurations use a 32-byte bitmap
  * to locate populated children via bit scanning, reducing the
  * number of cache-line accesses needed for ordered traversal.
  *
  * Memory layout:
  *
  * Per-node metadata is stored in a separate page via a strided
  * allocator and reached by pointer offset from the node address,
  * keeping metadata out of the node's cache lines. This avoids
  * padding overhead within nodes and ensures that lookups and
  * traversals only touch the data they need. Internal nodes are
  * reclaimed via RCU (call_rcu), so readers never access freed
  * memory even for the trie's own internal structure.
  *
  * This provides memory efficiency comparable to adaptive radix
  * tree schemes without requiring user tuning or configuration.
  *
  * Graft, graft-swap, and detach:
  *
  * The graft, graft-swap, and detach operations move entire
  * sub-tries between trie instances within the same group. Each
  * operation is a single pointer store visible atomically to
  * concurrent RCU readers: a reader sees either the complete
  * sub-trie or nothing, never a partial state.
  *
  * - Graft (cds_ft_graft) attaches the content of a source trie
  *   at a key position in a destination trie. The destination
  *   must have no content at or below the graft point. On
  *   success the source trie becomes empty and can be reused or
  *   destroyed.
  *
  * - Graft-swap (cds_ft_graft_swap) exchanges the content at a
  *   key position in a destination trie with the content of
  *   another trie. Unlike plain graft, the destination does not
  *   need to be empty at the graft point: any pre-existing
  *   content is moved into the swap trie. Concurrent readers
  *   see either the old content or the new content, never an
  *   empty intermediate state.
  *
  * - Detach (cds_ft_detach) removes the sub-structure rooted at
  *   a key position and returns it as a new, independent trie
  *   instance whose key lengths are relative to the detach
  *   point.
  *
  * Root-level operations (key_len 0) work with both fixed-length
  * and variable-length key groups. Non-root operations (key_len
  * > 0) require a variable-length key group
  * (CDS_FT_LEN_VARIABLE): in a fixed-length group every key
  * must equal the group's fixed length, so a shorter prefix
  * needed to address an interior graft or detach point cannot be
  * expressed, and the call returns
  * CDS_FT_STATUS_INVALID_ARGUMENT_ERROR.
  *
  * Together, these operations serve as the transplant, exchange,
  * and split primitives for bulk operations. A typical bulk-load
  * pattern populates a staging trie offline — with no RCU or
  * locking overhead — and then grafts it into the live trie in
  * a single O(1) step, keeping the writer critical section
  * minimal regardless of the number of nodes being moved. A
  * complementary bulk-removal pattern detaches (or graft-swaps)
  * a sub-trie in O(1), waits for a single grace period, and
  * then drains the detached trie locally, reducing the cost
  * from one grace period per node to one grace period total.
  *
  * Iterator Lifecycle and RCU Locking:
  *
  * The `cds_ft_iter` object can operate in two path modes, selected via
  * cds_ft_iter_set_path_mode():
  *
  * CDS_FT_ITER_PATH_CACHED (default):
  *
  * The iterator caches the traversal path (RCU protected pointers) to
  * accelerate subsequent sequential operations like cds_ft_next(),
  * cds_ft_prev(), or cds_ft_remove().
  *
  * Because of this caching, the RCU read-side lock must be held
  * CONTINUOUSLY between the operation that populates the iterator
  * (e.g., cds_ft_lookup) and the operation that consumes it. If the
  * RCU read-side lock is dropped, the cached path may point to freed
  * memory.
  *
  * If you need to drop the RCU read-side lock between operations, you
  * MUST invalidate the iterator's cached path before reusing it. This
  * is done by calling:
  * - cds_ft_iter_invalidate_path()
  *
  * Calling this function clears the internal path state while keeping
  * your key intact. The next operation (e.g., cds_ft_remove) will
  * automatically fall back to a safe, fresh top-down traversal.
  *
  * Example A (Continuous Lock - Fast):
  * rcu_read_lock();
  * cds_ft_lookup(ft, iter);
  * cds_ft_remove(ft, iter, node); // Uses cached path O(1)
  * rcu_read_unlock();
  *
  * Example B (Dropped Lock - CDS_FT_ITER_PATH_CACHED only):
  * rcu_read_lock();
  * cds_ft_lookup(ft, iter);
  * rcu_read_unlock();
  * // ... time passes, lock is dropped ...
  *
  * lock(&writer_mutex);
  * cds_ft_iter_invalidate_path(iter); // Flush the stale RCU cache
  * cds_ft_remove(ft, iter, node);     // Safe: Fresh traversal protected by writer lock
  * unlock(&writer_mutex);
  *
  * Note: CDS_FT_ITER_PATH_UNCACHED iterators do not require
  * cds_ft_iter_invalidate_path() — the path is discarded
  * automatically after each operation.
  *
  * Debug validation (URCU_FRACTAL_TRIE_DEBUG_PATH):
  *
  * Building with URCU_FRACTAL_TRIE_DEBUG_PATH defined enables run-time
  * detection of stale cached paths.  Each path population records an RCU
  * grace-period snapshot (via the flavor's
  * update_start_poll_synchronize_rcu); each path consumption polls it
  * (via update_poll_state_synchronize_rcu).  If a full grace period has
  * elapsed since the path was populated, the cached pointers may
  * reference freed memory — the program aborts with a diagnostic.
  * Since struct cds_ft_iter is opaque, this option does not affect the
  * application ABI — only the library needs to be rebuilt.
  *
  * CDS_FT_ITER_PATH_UNCACHED:
  *
  * The iterator automatically discards the traversal path after each
  * operation returns. Every subsequent operation performs a fresh
  * top-down traversal from the current key. The RCU read-side lock
  * only needs to be held during each individual operation and while
  * accessing the returned node — it may be dropped between operations.
  *
  * This mode is suited for iteration patterns where the caller needs
  * to drop the RCU read-side lock between steps (e.g. to perform
  * blocking work or acquire other locks).
  *
  * Example C (Uncached mode, per-step locking with refcount):
  * cds_ft_iter_set_path_mode(iter, CDS_FT_ITER_PATH_UNCACHED);
  * rcu_read_lock();
  * cds_ft_for_each_rcu(ft, iter) {
  *         struct my_entry *e = cds_ft_entry(
  *                 cds_ft_iter_node(iter),
  *                 struct my_entry, ft_node);
  *         refcount_inc(&e->refcount);
  *         rcu_read_unlock();
  *
  *         do_blocking_work(e);
  *         refcount_dec(&e->refcount);
  *
  *         rcu_read_lock();
  * }
  * rcu_read_unlock();
  */

Your feedback is welcome!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com