from Hacker News

IO Devices and Latency

by milar on 3/13/25, 4:46 PM with 153 comments

  • by bddicken on 3/13/25, 5:19 PM

    Author of the blog here. I had a great time writing this. By far the most complex article I've ever put together, with literally thousands of lines of js to build out these interactive visuals. I hope everyone enjoys.
  • by bob1029 on 3/13/25, 6:05 PM

    I've been advocating for SQLite+NVMe for a while now. For me it is a new kind of pattern you can apply to get much further into trouble than usual. In some cases, you might actually make it out to the other side without needing to scale horizontally.

    Latency is king in all performance matters. Especially in those where items must be processed serially. Running SQLite on NVMe provides a latency advantage that no other provider can offer. I don't think running in memory is even a substantial uplift over NVMe persistence for most real world use cases.

  • by magicmicah85 on 3/13/25, 6:23 PM

    Can I just say that I love how informative this was that I completely forgot it was to promote a product? Excellent visuals and interactivity.
  • by robotguy on 3/13/25, 7:52 PM

    Seeing the disk IO animation reminded me of Melvin Kaye[0]:

      Mel never wrote time-delay loops, either, even when the balky Flexowriter
      required a delay between output characters to work right.
      He just located instructions on the drum
      so each successive one was just past the read head when it was needed;
      the drum had to execute another complete revolution to find the next instruction.
      
    
    [0] https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...
  • by jhgg on 3/13/25, 6:07 PM

    Metal looks super cool, however at my last job when we tried using instance local SSD's on GCP, there were serious reliability issues (e.g. blocks on the device losing data). Has this situation changed? What machine types are you using?

    Our workaround was this: https://discord.com/blog/how-discord-supercharges-network-di...

  • by gz09 on 3/13/25, 7:07 PM

    Nice blog. There is also a problem that generally cloud storage is "just unusually slow" (this has been noted by others before, but here is a nice summary of the problem http://databasearchitects.blogspot.com/2024/02/ssds-have-bec...)

    Having recently added support for storing our incremental indexes in https://github.com/feldera/feldera on S3/object storage (we had NVMe for longer due to obvious performance advantages mentioned in the previous article), we'd be happy for someone to disrupt this space with a better offering ;).

  • by __turbobrew__ on 3/13/25, 7:47 PM

    I think something about distributed storage which is not appreciated in this article:

    1. Some systems do not support replication out of the box. Sure your cassandra cluster and mysql can do master slave replication, but lots of systems cannot.

    2. Your life becomes much harder with NVME storage in cloud as you need to respect maintenance intervals and cloud initiated drains. If you do not hook into those system and drain your data to a different node, the data goes poof. Separating storage from compute allows the cloud operator to drain and move around compute as needed and since the data is independent from the compute — and the cloud operator manages that data system and draining for that system as well — the operator can manage workload placements without the customer needing to be involved.

  • by tonyhb on 3/13/25, 5:47 PM

    This is really cool, and PlanetScale Metal looks really solid, too. Always a huge sucker for seeing latency huge latency drops on releases: https://planetscale.com/blog/upgrading-query-insights-to-met....
  • by CSDude on 3/14/25, 6:02 AM

    For years, I just didn't get why replicated databases always stick with EBS and deal with its latency. Like, replication is already there, why not be brave and just go with local disks? At my previous orgs, where we ran Elasticsearch for temporary logs/metrics storage, I proposed we do exactly that since we didn't even have major reliability requirements. But I couldn't convince them back then, we ended up with even worse AWS Elasticsearch.

    I get that local disks are finite, yeah, but I think the core/memory/disk ratio would be good enough for most use cases, no? There are plenty of local disk instances with different ratios as well, so I think a good balance could be found. You could even use local hard disk ones with 20TB+ disks for implementing hot/cold storage.

    Big kudos to the PlanetScale team, they're like, finally doing what makes sense. I mean, even AWS themselves don't run Elasticsearch on local disks! Imagine running ClickHouse, Cassandra, all of that on local disks.

  • by ucarion on 3/13/25, 6:34 PM

    Really, really great article. The visualization of random writes is very nicely done.

    On:

    > Another issue with network-attached storage in the cloud comes in the form of limiting IOPS. Many cloud providers that use this model, including AWS and Google Cloud, limit the amount of IO operations you can send over the wire. [...]

    > If instead you have your storage attached directly to your compute instance, there are no artificial limits placed on IO operations. You can read and write as fast as the hardware will allow for.

    I feel like this might be a dumb series of questions, but:

    1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?

    2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?

    I see pretty clearly putting storage and compute on the same machine strictly a latency win, because you structurally have one less hop every time. But is it also a throughput-per-dollar win too?

  • by myflash13 on 3/14/25, 8:15 AM

    If this is true, then how do "serverless" database providers like Neon advertise "low latency" access? They use object storage like S3, which I imagine is an order of magnitude worse than networked storage for latency.

    edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.

  • by vessenes on 3/13/25, 5:48 PM

    Great nerdbaiting ad. I read all the way to the bottom of it, and bookmarked it to send to my kids if I feel they are not understanding storage architectures properly. :)
  • by pjdesno on 3/13/25, 9:01 PM

    I love the visuals, and if it's ok with you will probably link them to my class material on block devices in a week or so.

    One small nit: > A typical random read can be performed in 1-3 milliseconds.

    Um, no. A 7200 RPM platter completes a rotation in 8.33 milliseconds, so rotational delay for a random read is uniformly distributed between 0 and 8.33ms, i.e. mean 4.16ms.

    >a single disk will often have well over 100,000 tracks

    By my calculations a Seagate IronWolf 18TB has about 615K tracks per surface given that it has 9 platters and 18 surfaces, and an outer diameter read speed of about 260MB/s. (or 557K tracks/inch given typical inner and outer track diameters)

    For more than you ever wanted to know about hard drive performance and the mechanical/geometrical considerations that go into it, see https://www.msstconference.org/MSST-history/2024/Papers/msst...

  • by jgalt212 on 3/13/25, 9:01 PM

    Disk latency, and one's aversion to it, is IMHO the only way Hetzner costs can run up on you. You want to keep the database on local disk, and not their very slow attached Volumes (Hetzner EBS). In short, you can have relatively light work-loads that will be on sort of expensive VMs because you need 500GB, or more, of local disk. 1TB local disk is the biggest VM they offer in the US. 300 EUR a month.
  • by rsanheim on 3/13/25, 8:04 PM

    That great infographic at the top illustrates one big reason why 'dev instances in the cloud' is a bad idea.
  • by cmurf on 3/13/25, 5:04 PM

    Plenty of text but also many cool animations. I'm a sucker for visual aids. It's a good balance.
  • by carderne on 3/14/25, 12:24 PM

    I'm always curious about latency for all these newdb offerings like PlanetScale/Neon/Supabase.

    It seems like they don't emphasise strongly enough _make sure you colocate your server in the same cloud/az/region/dc as our db. I suspect a large fraction of their users don't realise this, and have loads of server-db traffic happening very slowly over the public internet. It won't take many slow db reads (get session, get a thing, get one more) to trash your server's response latency.

  • by cynicalsecurity on 3/13/25, 8:17 PM

    That was a cool advertisement, I must give them that.
  • by anonymousDan on 3/14/25, 12:03 AM

    Nice article, but the replicated approach isn't exactly comparing like with like. To achieve the same semantics you'd need to block for a response from the remote backup servers which would end up with the same latency as the other cloud providers...
  • by bloopernova on 3/13/25, 6:20 PM

    Fantastic article, well explained and beautiful diagrams. Thank you bddicken for writing this!
  • by SAI_Peregrinus on 3/14/25, 2:54 PM

    > The next major breakthrough in storage technology was the hard disk drive.

    There were a few storage methods in between tape & HDDs, notably core memory & magnetic drum memory.

  • by samwho on 3/13/25, 7:54 PM

    Gosh, this is beautiful. Fantastic work, Ben. <3
  • by gozzoo on 3/13/25, 6:25 PM

    Can someeone share their expirience in creating such diagrams. What libraries and tools can be useful for such interactive diagrams?
  • by aftbit on 3/13/25, 6:29 PM

    Hrm "unlimited IOPS"? I suppose contrasted against the abysmal IOPS available to Cloud block devs. A good modern NVMe enterprise drive is specced for (order of magnitude) 10^6 to 10^7 IOPS. If you can saturate that from database code, then you've got some interesting problems, but it's definitely not unlimited.
  • by TheAnkurTyagi on 3/14/25, 11:17 AM

    Very nice animations.
  • by r3tr0 on 3/13/25, 7:50 PM

    We are working on a platform that lets you measure this stuff with pretty high precision in real time.

    You can check out our sandbox here:

    https://yeet.cx/play

  • by liweixin on 3/14/25, 10:11 AM

    Amazing! The visualizations are so great!
  • by dangoodmanUT on 3/13/25, 9:44 PM

    what local nvme is getting 20us? Nitro?