from Hacker News

Ask HN: Good examples of fault-tolerant Erlang code?

by roeles on 12/28/23, 11:53 AM with 45 comments

Hi HN! I'm trying to learn more about Erlang and how it achieves fault-tolerance. I am pretty up-to-date on the talks that Joe Armstrong gave, and I've read his thesis. This is already quite informative, but I wonder if there are any codebases out there which are good examples of fault-tolerance.

I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.

Any suggestions?

by toast0 on 12/28/23, 9:26 PM
If you follow OTP design principles, you end up with a supervision tree, and a lot of code like...
```
   ok = do_something_that_might_fail()
```
If it returns ok: great, it worked and you move on. If it doesn't return ok, the process crashes, you get a crash report, and the supervisor restarts it, if that's how the supervisor is configured. Presumably it starts properly and deals with future requests.
There's two issues you might rapidly encounter.
1) if a supervised process restarts too many times in an interval, the supervisor will stop (and presumably restart), and that cascades up to potentially your node stopping. This is by design, and has good reasons, but might not be expected and might not be a good fit for larger nodes running many things.
2) if your process crashes, its message queue (mailbox) is discarded, and if you were sending to a process registered by name or process group (pg), the name is now unregistered. This means a service process crashing will discard several requests; the one in progress which is probably fine (it crashed after all), but also others that could have been serviced. In my experience, you end up wanting to catch errors in service processes, log them, and move on to the next request, so you don't lose unrelated requests. Depending on your application, a restart might be better, or you might run each request in a fresh process for isolation... Lots of ways to manage this.
by ramchip on 12/29/23, 4:52 PM
RabbitMQ is IMHO probably the best open source example tackling a large, complicated real world problem with graceful degradation (e.g. if a queue keeps crashing).
Elixir has a lot of smaller but very high quality libraries to learn from. You may be interested in how Ecto & Postgrex manage DB connections, in particular how connection sockets are “borrowed” so data doesn’t get repeatedly messaged (read: copied) between processes. Bandit / Thousand Island also make interesting decisions for process structure in HTTP1.1 vs HTTP2.
I think a common mistake is to create processes mimicking classic OOP structure, like an OrderProcessor, ShippingManager, etc. Processes in Erlang are a unit of fault tolerance, not code organization. This means more usually you’ll have one process per request, potentially calling code from many different modules; since requests are the things you want to fail separately from each other.
In RabbitMQ’s case for instance connections and queues are processes, but exchanges are not. It would feel natural to model the problem as three processes with messages going Connection -> Exchange -> Queue, but in reality an exchange is a set of routing rules that can be applied by a connection directly, which avoids a lot of complexity and overhead.
Last thing I’d note is supervision trees etc. are really about handling _unexpected_ errors (Joe uses the terms faults and errors with different meanings iirc). If you want a web request to be retried a few times with a delay, don’t use a supervisor for that, just loop with a sleep. Same for things like validating inputs from a form, usually you’d want to give the user a hint and not just crash.
Some other useful links:
- https://aosabook.org/en/v1/riak.html (bit old, but another large codebase)
- https://ferd.ca/the-zen-of-erlang.html
- https://www.theerlangelist.com/article/spawn_or_not
by octacat on 12/28/23, 3:42 PM
The simple answer: supervision trees. And fault-tolerance usually means that the failing process would be restarted. It would not handle stuff like netsplits or node going down though.
Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples. There are many factors on decision when to make a new process. Data locality, the pattern of interaction with other processes, performance considerations. Good idea is to have one process per TCP connection, but not one process per each routed message. And be careful with blocking gen_server calls - these could block or fail.
by al2o3cr on 12/28/23, 2:06 PM
Step zero is definitely the OTP Design Principles doc (part of the OTP distribution):
https://www.erlang.org/doc/design_principles/users_guide
There are some good texts that have more examples:
Erlang & OTP in Action - https://www.manning.com/books/erlang-and-otp-in-action
Designing for Scalability with Erlang/OTP - https://www.oreilly.com/library/view/designing-for-scalabili...
One big example of distributed Erlang is Riak:
https://github.com/basho/riak
by chadd on 12/29/23, 6:10 PM
I have written a lot of Erlang code over the years, including an Erlang Redis clone which had some interesting performance characteristics[1] ... though not recently, I went too far down the engineering management track at Snap and elsewhere... but I worked closely with Fernando "El Brujo"[2] when he was CTO of my consultancy. If you want to see beautiful, canonical Erlang code, he's still slinging it out. Dig through his repos on Github, or better yet, ask him to provide his suggestions.
[1] https://github.com/cbd/edis [2] https://github.com/elbrujohalcon
by thibaut_barrere on 12/29/23, 12:42 PM
If you were to consolidate all the info (including links published by Joe) into an informative blog post, it could become the “2024 reference bookmark” for a lot of people.
I have thought of writing this! It would be quite useful to a lot of people.
by ihuk on 12/29/23, 12:04 PM
You don't achieve fault tolerance solely by using Erlang. Erlang does not inherently 'achieve fault tolerance.' Instead, you make your system fault-tolerant through deliberate engineering. While Erlang provides tools and design guidelines, the responsibility for achieving fault tolerance ultimately lies with you. Source: I implemented and operated a large Erlang system for approximately 3 years.
by asa400 on 12/29/23, 9:39 PM
> I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.
Processes are failure and concurrency barriers.
Failure: one process crashing does not crash another process, unless you explicitly want it to (e.g., via Erlang's `link` functionality). So, if you have multiple operations that must not interfere with each other in the case of one of them misbehaving (e.g., your application makes multiple HTTP requests in parallel), you want them in separate processes.
Concurrency: processes are independently and preemptively scheduled by the VM. If you have multiple operations that are not necessarily sequentially ordered, and you want to run them at the same time, you put each of them in a process. One example problem where this applies would be the handling of incoming TCP messages, where each message is not related to the previous or subsequent messages, and you want to be able to process multiple messages at the same time.
If you handle each new message in its own process, the VM will schedule the processing of those messages such that the processing of one message will not interfere with the processing of another. It accomplishes this by tracking a rough proxy of CPU time each process uses (called "reductions" in Erlang) and descheduling processes that consume too many resources and giving other processes a chance to run for a bit. (Note that this is just one example and ignores any performance considerations. There are other approaches but I am omitting them for simplicity's sake)
There are a number of good libraries to look at for these in practice. I'd personally go look at Cowboy and/or Ranch as they deal with lots of IO. Oban is an Elixir job queue library that is fantastic and has very high code quality. Another good one would be Poolboy, which is a worker pool library.
by brudgers on 12/28/23, 3:22 PM
when to split off a new process
Always?
Times some factor so you have several instances of the same thing in case one fails.
Good luck.
by jmnicolas on 12/29/23, 3:10 PM
I don't know Erlang so take it with a grain of salt, but maybe take a look at the CouchDB code base?
It's a NoSQL DB written in Erlang. I looked at it a few years ago, its master to master replication seemed cool.
by rramadass on 12/30/23, 12:12 PM
Not code but an excellent presentation of design/architecture of a Real-World fault-tolerant and distributed System in Erlang/Elixir - https://www.youtube.com/watch?v=pQ0CvjAJXz4
by vladimirralev on 12/29/23, 8:29 PM
My advise is don't take Erlang's fault-tolerance promises too seriously. It's just a little framework that helps in some cases and gets in the way in other cases.
I've seen many Erlang systems fail in funny ways, including some of the big examples given here. Supervision trees are cool but it's clearly nonsense to hardcode restart strategy and timing numbers for workers as if all failure modes are the same and deployed in the same network/capacity/resource/conditions with any number of workers. The strategy and schedule for recovering 10 crashed resource workers will clearly be different when you have 1M workers. The strategy will be different if you are timing out on network or if you are getting a resource error and have better things to do than restarting workers.
Focus on fault-tolerance outside erlang - have standby capacity in isolation and load-balance properly, shard the system in isolated pieces as much as you can.