from Hacker News

The concurrency trap: How an atomic counter stalled a pipeline

by delifue on 6/7/25, 11:25 AM with 43 comments

  • by hinkley on 6/10/25, 6:12 PM

    > The big difference between a concurrent hash map and ArcSwap is that ArcSwap requires swapping out the entirety of the underlying data with every write but trades this off with very cheap reads. Writers don’t even have to wait for all readers to finish since a new epoch is created with the new version of data.

    This is where tree data structures win out. Git and subversion and pure functional languages don’t replace the entire data structure. They make a new element and attach it to the tree by creating a new version of the intended parent, and a new version of its parent, on up to the root element.

    So somewhere out there is space for a hybrid of ArcSwap and DashMap that uses sharded copy on write. These guys probably don’t need it given their low write rate, but shards should help with write contention for other use cases.

  • by nick_ on 6/10/25, 6:34 PM

    It's unclear if the author understands false sharing. The excessive "ping-pong"ing of cache lines between cores means the cache line that the counter sits on is shared with other memory that's being accessed in contention.

    Making sure the counter variable is sitting alone on its own cache line would completely remove the excessive ping-pong behaviour. The cache line with the counter would, of course, still ping-pong as it's mutated, but not excessively.

  • by hinkley on 6/10/25, 5:55 PM

    > However, further analysis in the perf environment revealed that while there was a spike in the number of active sessions, those gradually dropped off while the DAG processing time continued to stay high.

    Thundering herds will do that. Queuing theory. Once the queue builds up it takes far longer to clear than intuition suggests.

  • by cogman10 on 6/14/25, 12:16 AM

    Not a bad tradeoff and one I've made in the past.

    The tricky part here (which the author solved using rcu) is it's REALLY easy introduce a race condition here.

    Imagine thread A and thread B want to add a new value to the hashmap. They arrive at the same time. arc_swap used naively would ultimately have one of those threads win and the value from the other thread disappear. Not great.

    But another thing to consider with this approach is that it's expensive under any sort of write contention. As soon as writes start to spike up the amount of allocations increases in a potentially really bad way. If you have 3 threads that want to write at the same time, one thread will create 1 full copy. 1 will create 2 full copies, and the 3rd thread will create 3 full copies. If your underlying data structure has any sort of memory allocated to it, this becomes a pretty bad trade off.

    Funnily, one easy way to solve this problem is to introduce locks again, but only for the writer. Possibly a good idea if you are spending more than ~1000 cycles creating the data structure.

  • by athrowaway3z on 6/10/25, 8:12 PM

    I like the idea of arc_swap, but I've tried skimming the code to see if it fit my mental model and found it way more complex than i expected.
  • by nasretdinov on 6/11/25, 5:57 AM

    Hate to be that person, but Go developers hit it first and then implemented (a really nice in my opinion) concurrent hash map that does stuff similar to what is described in the article, but without copying the entire hash table each time: https://pkg.go.dev/sync#Map
  • by conradludgate on 6/14/25, 1:24 AM

    For read heavy workloads, https://docs.rs/papaya/latest/papaya/ should be pretty good. It's a proper concurrent hashmap, with incremental resizing. It even has some support for rcu operations per field
  • by kccqzy on 6/10/25, 7:55 PM

    I haven't seen a pure user space implementation of RCU that works well without kernel support. The RCU methodology is easy to understand, but how does a user space implementation understand when it is safe to dispose of a piece of data structure? Don't you need a primitive to run some code on each CPU?
  • by cryptonector on 6/10/25, 10:31 PM

    TFA is about concurrent hashmaps that croak under heavy read load because the synchronization mechanisms of those hashmaps caused cache ping-pong, and the fix was to use RCU with a copy-on-write data structure for the hashmap. Each write requires cloning the previous hashmap, but because the readers don't perform any atomic operations other than one read (for the current read-only hashmap) it all works out as the cache lines don't have to ping-pong given that the readers are not writing.

    The stuff about "debt" is about the design of ArcSwap, which uses "debt" instead of "hazard pointers". The idea is that every thread keeps track of the last version of the hashmap that it's seen, and this is "debt" (references that need to be released eventually), and this debt is a thread-local object that is linked with all the other threads' debt, which essentially allows garbage collect of "debt" (debt collection?), or something like that. See https://github.com/vorner/arc-swap/blob/master/src/docs/inte...

    I've implemented something similar myself. Instead of implementing user-land RCU (which is what perhaps I should have done) I implemented a "thread-safe variable" where reads are lock-less and wait-free, and writers pay all the costs. My thread-safe variable API is real simple and intuitive: `create(value_destructor) -> thread_safe_var`, `set(thread_safe_var, value)`, `get(thread_safe_var) -> value`, `release(thread_safe_var)`, and `destroy(thread_safe_var)` (presumably only during an atexit() handler or shared object destructor call, if at all). I have two implementations, one with a linked list of per-thread hazard pointers, and the other with a tuple of {next/current, current/previous} "cells" and a protocol that gives readers time to read the cells, figure out which is "current" and then dereference and incref a refcount.

    Regardless of how one implements this data structure, the key concept is that you have to atomically compose two distinct atomic operations: a read, and a dereference to atomically increment a reference count -- this is remarkably tricky!

    When you have a GC it's really easy: just read -- there is no reference count to increment.

    With a linked list of per-thread hazard pointers all you have to do is read the thing then write it into your thread-local hazard pointer then read again, looping until the second read is the same as the first read. Having the value thus safely copied to the reader's thread-local hazard pointer then allows for notional garbage collection. Any thread that would release an old version of the value has to check all the hazard pointers to see that it's not in use, and this amounts to a small garbage collection.

    I especially like the hazard pointer approach.

    See also MVars (Haskell), transactional memory, RCU, etc.

  • by vrosas on 6/10/25, 6:59 PM

    > While we knew where to look, this investigation had already taken weeks and things took a turn for the worse when we hit the issue again on February 23rd.

    It blows me away an issue like this could take weeks to track down. If I were in any leadership position at this company I'd be rolling heads with the lack of telemetry or domain knowledge for these systems.