from Hacker News

What you must know about Memory, Caches, and Shared Memory [pdf]

by jeremysalwen on 3/4/22, 2:19 AM with 8 comments

by jorangreef on 3/4/22, 8:28 AM
This is an incredible reference that's helped tremendously, time and again, with designing TigerBeetle [1].
Especially slide 80 about the CPU's Line Fill Buffer:
```
  the bandwidth to a single core is limited by LFB entries
  and is much lower than the memory bandwidth itself:

  transfer (line) size × LFB entries / latency
```
If I understand correctly, this is the same reason why most single-threaded tasks tend to max out at a magical number of 6 GiB/s—it's because that same number is equivalent to 64 bytes * 10 / 100ns per cache miss, i.e. per-core memory bandwidth.
For example, we had a case in point with TigerBeetle, where we were benchmarking io_uring and kqueue, and were getting surprising results on Linux vs macOS. Eventually, it was this slide that helped us realize we were probably just benchmarking memory bandwidth across different CPUs with different LFB limits, and thus hitting 6 GiB/s and 20 GiB/s respectively, thanks to the M1's increased parallelism in the LFB plus increased cache line size of 128 bytes.
Another example, we've started moving away from designing some data structures to optimize for cache line size—instead we're optimizing for "effective cache line size", because that takes not only the cache line but the LFB into account. So instead of targeting 64 bytes, where it makes sense, we might rather want to target 64 * 8 or 512 bytes because that's probably more friendly to the prefetcher. We've used this "effective cache line size" technique in practice to reduce the size of our block free list bitmap indexes [2] (if you feel like reading yourself some Zig!).
[1] https://github.com/coilhq/tigerbeetle
[2] https://github.com/coilhq/tigerbeetle/blob/lsm-trees-and-mor...
by bob1029 on 3/4/22, 10:38 AM
This stuff makes such a difference that it's almost ridiculous to not consider it from the very beginning.
We're not talking 10x difference. More like 1000x.
One bad access pattern can be the difference between millions of transactions per second, and mere thousands. Many times, this bad access pattern is manifested in things as simple as using a class instead of a struct, or using a lock when a CAS would do the trick.
The most popular siren song in all of modern technology is the obsession with making all the cores share the work. For a vast majority of problems that operate on shared state in a serialized fashion (i.e. anything a bank would do), consolidation of processing to 1 thread is always going to give you the most throughput.
by uo21tp5hoyg on 3/4/22, 5:53 AM
See also: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
by hun3 on 3/4/22, 11:05 AM
Thin font is too hard to read on mobile
by rramadass on 3/5/22, 3:11 AM
Excellent!
How i wish it was more detailed :-)