from Hacker News

Erasure Coding versus Tail Latency

by usrme on 3/27/24, 8:47 AM with 39 comments

by ot on 3/28/24, 3:37 PM
It is worth noting that this does not come for free, and it would have been nice for the article to mention the trade-off: reconstruction is not cheap on CPU, if you use something like Reed-Solomon.
Usually the codes used for erasure coding are in systematic form: there are k "preferential" parts out of M that are just literal fragments of the original blob, so if you get those you can just concatenate them to get the original data. If you get any other k-subset, you need to perform expensive reconstruction.
by dmw_ng on 3/28/24, 3:28 PM
Nice to see this idea written about in detail. I had thought about it in the context of terrible availability bargain bucket storage (iDrive E2), where the cost of (3,2) erasure coding an object and distributing each segment to one of 3 regions would still be dramatically lower than paying for more expensive and more reliable storage.
Say 1 chunk lives in Germany, Ireland and the US each. Client races GETs to all 3 regions and cancels the request to the slowest to respond (which may also be down). Final client latency is equivalent to that of the 2nd slowest region, with substantially better availability due to the ability to tolerate any single region being down
Still wouldn't recommend using E2 for anything important, but ^ was one potential approach to dealing with its terribleness. It still doesn't address the reality of when E2 regions go down, it is often for days and reportedly sometimes weeks at a time. So reliable writing in this scenario would necessitate some kind of queue with capacity for weeks of storage
There are variants of this scheme where you could potentially balance the horrible reliability storage with some expensive reliable storage as part of the same system, but I never got that far in thinking about how it would work
by sujayakar on 3/28/24, 4:00 PM
this is a really cool idea.
one followup I was thinking of is whether this can generalize to queries other than key value point lookups. if I'm understanding correctly, the article is suggesting to take a key value store, and for every `(key, value)` in the system, split `value` into fragments that are stored on different shards with some `k` of `M` code. then at query time, we can split a query for `key` into `k` subqueries that we send to the relevant shards and reassemble the query results into `value`.
so, if we were to do the same business for an ordered map with range queries, we'd need to find a way to turn a query for `interval: [start, end]` into some number of subqueries that we could send to the different shards and reassemble into the final result. any ideas?
by loeg on 3/28/24, 3:49 PM
Yeah. And you get the storage for free if your distributed design also uses the erasure-encoded chunks for durability. Facebook's Warm Storage infrastructure does something very similar to what this article describes.
by benlivengood on 3/28/24, 5:43 PM
The next level of efficiency is using nested erasure codes. The outer code can be across regions/zones/machines/disks while the inner code is across chunks of a stripe. Chunk unavailability is fast to correct with an extra outer chunk and bit rot or corruption can be fixed by the inner code without an extra fetch. In the fast path only data chunks need to be fetched.
by siscia on 3/28/24, 3:23 PM
Nice to see this talked about here and Marc being public about it.
AWS is such a big place that even after a bit of tenure you still got place to look to find interesting technical approaches and when I was introduced to this schema for Lambda storage I was surprised.
As Marc mentions it is such a simple and powerful idea that is definitely not mentioned enough.
by ghusbands on 3/29/24, 1:31 PM
The first graph is incredibly misleading. The text talks about fetching from 5 servers and needing 4 results vs fetching from 1 server and needing 1 result. Then the graph compares 4-of-5 to 4-of-4 latency, which is just meaningless. It should compare 4-of-5 with 1-of-1.
by jeffbee on 3/28/24, 4:22 PM
I do not follow. How is it possible that the latency is lower in a 4-of-5 read of a coded stripe compared to a 1-of-4 replicated stripe?