from Hacker News

The real failure rate of EBS

by QuinnyPig on 3/18/25, 2:24 PM with 30 comments

by mstaoru on 3/18/25, 3:31 PM
"What makes PlanetScale Metal performance so much better? With your storage and compute on the same server, you avoid the I/O network hops that traditional cloud databases require [...] Every PlanetScale database requires at least 2 replicas in addition to the primary. Semi-synchronous replication is always enabled. This ensures every write has reached stable storage in two availability zones before it’s acknowledged to the client."
Isn't there a contradiction between these two statements?
My personal experience with EBS analogs in China (Aliyun, Tencent, Huawei clouds) is that every disk will experience a fatal failure or a disconnection at least once a month, at any provisioned IOPS. I don't know what makes them so bad, but I gave up running any kinds of DB workloads on them, using node local storage instead. If there are durability constrains, I would spin up Longhorn or Rook on top of local storage. I can see replicas degrade from time to time, but overall systems work (nothing too large, maybe ~50K QPS).
by samlambert on 3/18/25, 2:48 PM
we have a lot more content like this on the way. if anyone has feedback or questions let us know.
by reedf1 on 3/18/25, 2:55 PM
If you can detect EBS failure better than Amazon - I'd be selling this to them tomorrow.
by QuinnyPig on 3/18/25, 3:32 PM
I'm a sucker for deep dive cloud nerd content like this.
by c4wrd on 3/18/25, 6:11 PM
> When attached to an EBS–optimized instance, General Purpose SSD (gp2 and gp3) volumes are designed to deliver at least 90 percent of their provisioned IOPS performance 99 percent of the time in a given year. This means a volume is expected to experience under 90% of its provisioned performance 1% of the time. That’s 14 minutes of every day or 86 hours out of the year of potential impact. This rate of degradation far exceeds that of a single disk drive or SSD. > This is not a secret, it's from the documentation. AWS doesn’t describe how failure is distributed for gp3 volumes, but in our experience it tends to last 1-10 minutes at a time. This is likely the time needed for a failover in a network or compute component. Let's assume the following: Each degradation event is random, meaning the level of reduced performance is somewhere between 1% and 89% of provisioned, and your application is designed to withstand losing 50% of its expected throughput before erroring. If each individual failure event lasts 10 minutes, every volume would experience about 43 events per month, with at least 21 of them causing downtime!
These are some seriously heavy-handed assumptions being made, completely disregarding the data they collect. First, the author assumes that these failure events are distributed randomly and expected to happen on a daily basis, ignoring Amazon's failure rate statement throughout a year ("99% of the time annually"). Second, they argue that in practice, they see failures lasting between 1 and 10 minutes. However, they assert that we should assume each failure will last 10 minutes, completely ignoring the severity range they introduced.
Imagine your favorite pizza company claiming to deliver on time "99% of the time throughout a year." The author's logic is like saying, "The delivery driver knocks precisely 14 minutes late every day -- and each delay is 10 minutes exactly, no exceptions!". It completely ignores reality: sometimes your pizza is delivered a minute late, sometimes 10 minutes late, sometimes exactly on time for four months.
As a company with useful real-world data, I expect them not to make arguments based on exaggerations but rather show cold, hard data to back up their claims. For transparency, my organization has seen 51 degraded EBS volume events in the past 3 years across ~10,000 EBS volumes. Of those events, 41 had a duration of less than one minute, nine had a duration of two minutes, and one had a duration of three minutes.
by jewel on 3/18/25, 3:31 PM
I wonder if you could work around this problem by having two EBS volumes on each host, and write to them both. You'd have the OS report the write was successful as soon as either drive reported success. With reads you could alternate between drives for double the read performance during happy times, but quickly detect when one drive is slow and reroute those reads to the other drive.
We could call this RAID -1.
You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.
Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.
by semi-extrinsic on 3/18/25, 3:32 PM
Funny to see the plots with "No unit" on the y-axis label and then the actual units in parentheses in the title.