from Hacker News

Thread-Per-Core Buffer Management for a modern storage system

by arjunnarayan on 12/12/20, 6:34 PM with 36 comments

  • by bob1029 on 12/12/20, 7:29 PM

    More threads (i.e. shared state) is a huge mistake if you are trying to maintain a storage subsystem with synchronous access semantics.

    I am starting to think you can handle all storage requests for a single logical node on just one core/thread. I have been pushing 5~10 million JSON-serialized entities to disk per second with a single managed thread in .NET Core (using a Samsung 970 Pro for testing). This includes indexing and sequential integer key assignment. This testing will completely saturate the drive (over 1 gigabyte per second steady-state). Just getting an increment of a 64 bit integer over a million times per second across an arbitrary number of threads is a big ask. This is the difference you can see when you double down on single threaded ideology for this type of problem domain.

    The technical trick to my success is to run all of the database operations in micro batches (10~1000 microseconds per). I use LMAX Disruptor, so the batches are formed naturally based on throughput conditions. Selecting data structures and algorithms that work well in this type of setup is critical. Append-only is a must with flash and makes orders of magnitude difference in performance. Everything else (b-tree algorithms, etc) follows from this realization.

    Put another way, If you find yourself using Task or async/await primitives when trying to talk to something as fast as NVMe flash, you need to rethink your approach. The overhead with multiple threads, task parallel abstractions, et. al. is going to cripple any notion of high throughput in a synchronous storage domain.

  • by zinclozenge on 12/12/20, 7:24 PM

    If anybody's interested, there's a Seastar inspired library for Rust that is being developed https://github.com/DataDog/glommio
  • by lrossi on 12/13/20, 9:32 AM

    Disappointed to see that you spent 25% of article space to describe in detail all the ways in which computer hardware got faster, then you promised to show how your project is taking advantage of this, but you are not showing any performance measurements at all. Just a very fancy architecture.

    Correct me if I’m wrong, but the only number that I can find is a guarantee that you do not exceed 500 us of latency when handling a request. And it’s not clear if this is a guarantee at all, since you say just that the system will throw a traceback in case of latency spikes.

    I would have liked to see the how latency varies under load, how much throughput you can achieve, how the latency long tail looks like on a long-running production load, and comparisons with off-the-shelf systems tuned reasonably.

  • by dotnwat on 12/12/20, 7:34 PM

    Noah here, developer at Vectorized. Happy to answer any questions.
  • by eis on 12/13/20, 5:50 AM

    I'd be interested in the write amplification since Redpanda went pretty low level in the IO layer. How do you guarantee atomic writes when virtually no disk provides guarantees other than on a page level which could result in destroying already written data if a write to the same page fails - at least in theory - and so one has to resort to writing data multiple times.
  • by mirekrusin on 12/12/20, 9:02 PM

    What is the point of talking performance-by-thread-per-core if raft sits in front of it, ie. only one will do the work at any time anyway?
  • by matthewtovbin on 12/13/20, 7:36 AM

    @arjunnarayan Have you evaluated the performance against vanilla Kafka / Confluent Cloud? Where can I see the results?