by QuinnyPig on 3/18/25, 2:24 PM with 30 comments
by mstaoru on 3/18/25, 3:31 PM
Isn't there a contradiction between these two statements?
My personal experience with EBS analogs in China (Aliyun, Tencent, Huawei clouds) is that every disk will experience a fatal failure or a disconnection at least once a month, at any provisioned IOPS. I don't know what makes them so bad, but I gave up running any kinds of DB workloads on them, using node local storage instead. If there are durability constrains, I would spin up Longhorn or Rook on top of local storage. I can see replicas degrade from time to time, but overall systems work (nothing too large, maybe ~50K QPS).
by samlambert on 3/18/25, 2:48 PM
by reedf1 on 3/18/25, 2:55 PM
by QuinnyPig on 3/18/25, 3:32 PM
by c4wrd on 3/18/25, 6:11 PM
These are some seriously heavy-handed assumptions being made, completely disregarding the data they collect. First, the author assumes that these failure events are distributed randomly and expected to happen on a daily basis, ignoring Amazon's failure rate statement throughout a year ("99% of the time annually"). Second, they argue that in practice, they see failures lasting between 1 and 10 minutes. However, they assert that we should assume each failure will last 10 minutes, completely ignoring the severity range they introduced.
Imagine your favorite pizza company claiming to deliver on time "99% of the time throughout a year." The author's logic is like saying, "The delivery driver knocks precisely 14 minutes late every day -- and each delay is 10 minutes exactly, no exceptions!". It completely ignores reality: sometimes your pizza is delivered a minute late, sometimes 10 minutes late, sometimes exactly on time for four months.
As a company with useful real-world data, I expect them not to make arguments based on exaggerations but rather show cold, hard data to back up their claims. For transparency, my organization has seen 51 degraded EBS volume events in the past 3 years across ~10,000 EBS volumes. Of those events, 41 had a duration of less than one minute, nine had a duration of two minutes, and one had a duration of three minutes.
by jewel on 3/18/25, 3:31 PM
We could call this RAID -1.
You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.
Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.
by semi-extrinsic on 3/18/25, 3:32 PM