by roeles on 12/28/23, 11:53 AM with 45 comments
I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.
Any suggestions?
by toast0 on 12/28/23, 9:26 PM
ok = do_something_that_might_fail()
If it returns ok: great, it worked and you move on. If it doesn't return ok, the process crashes, you get a crash report, and the supervisor restarts it, if that's how the supervisor is configured. Presumably it starts properly and deals with future requests.There's two issues you might rapidly encounter.
1) if a supervised process restarts too many times in an interval, the supervisor will stop (and presumably restart), and that cascades up to potentially your node stopping. This is by design, and has good reasons, but might not be expected and might not be a good fit for larger nodes running many things.
2) if your process crashes, its message queue (mailbox) is discarded, and if you were sending to a process registered by name or process group (pg), the name is now unregistered. This means a service process crashing will discard several requests; the one in progress which is probably fine (it crashed after all), but also others that could have been serviced. In my experience, you end up wanting to catch errors in service processes, log them, and move on to the next request, so you don't lose unrelated requests. Depending on your application, a restart might be better, or you might run each request in a fresh process for isolation... Lots of ways to manage this.
by ramchip on 12/29/23, 4:52 PM
Elixir has a lot of smaller but very high quality libraries to learn from. You may be interested in how Ecto & Postgrex manage DB connections, in particular how connection sockets are “borrowed” so data doesn’t get repeatedly messaged (read: copied) between processes. Bandit / Thousand Island also make interesting decisions for process structure in HTTP1.1 vs HTTP2.
I think a common mistake is to create processes mimicking classic OOP structure, like an OrderProcessor, ShippingManager, etc. Processes in Erlang are a unit of fault tolerance, not code organization. This means more usually you’ll have one process per request, potentially calling code from many different modules; since requests are the things you want to fail separately from each other.
In RabbitMQ’s case for instance connections and queues are processes, but exchanges are not. It would feel natural to model the problem as three processes with messages going Connection -> Exchange -> Queue, but in reality an exchange is a set of routing rules that can be applied by a connection directly, which avoids a lot of complexity and overhead.
Last thing I’d note is supervision trees etc. are really about handling _unexpected_ errors (Joe uses the terms faults and errors with different meanings iirc). If you want a web request to be retried a few times with a delay, don’t use a supervisor for that, just loop with a sleep. Same for things like validating inputs from a form, usually you’d want to give the user a hint and not just crash.
Some other useful links:
- https://aosabook.org/en/v1/riak.html (bit old, but another large codebase)
by octacat on 12/28/23, 3:42 PM
Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples. There are many factors on decision when to make a new process. Data locality, the pattern of interaction with other processes, performance considerations. Good idea is to have one process per TCP connection, but not one process per each routed message. And be careful with blocking gen_server calls - these could block or fail.
by al2o3cr on 12/28/23, 2:06 PM
https://www.erlang.org/doc/design_principles/users_guide
There are some good texts that have more examples:
Erlang & OTP in Action - https://www.manning.com/books/erlang-and-otp-in-action
Designing for Scalability with Erlang/OTP - https://www.oreilly.com/library/view/designing-for-scalabili...
One big example of distributed Erlang is Riak:
by chadd on 12/29/23, 6:10 PM
[1] https://github.com/cbd/edis [2] https://github.com/elbrujohalcon
by thibaut_barrere on 12/29/23, 12:42 PM
I have thought of writing this! It would be quite useful to a lot of people.
by ihuk on 12/29/23, 12:04 PM
by asa400 on 12/29/23, 9:39 PM
Processes are failure and concurrency barriers.
Failure: one process crashing does not crash another process, unless you explicitly want it to (e.g., via Erlang's `link` functionality). So, if you have multiple operations that must not interfere with each other in the case of one of them misbehaving (e.g., your application makes multiple HTTP requests in parallel), you want them in separate processes.
Concurrency: processes are independently and preemptively scheduled by the VM. If you have multiple operations that are not necessarily sequentially ordered, and you want to run them at the same time, you put each of them in a process. One example problem where this applies would be the handling of incoming TCP messages, where each message is not related to the previous or subsequent messages, and you want to be able to process multiple messages at the same time.
If you handle each new message in its own process, the VM will schedule the processing of those messages such that the processing of one message will not interfere with the processing of another. It accomplishes this by tracking a rough proxy of CPU time each process uses (called "reductions" in Erlang) and descheduling processes that consume too many resources and giving other processes a chance to run for a bit. (Note that this is just one example and ignores any performance considerations. There are other approaches but I am omitting them for simplicity's sake)
There are a number of good libraries to look at for these in practice. I'd personally go look at Cowboy and/or Ranch as they deal with lots of IO. Oban is an Elixir job queue library that is fantastic and has very high code quality. Another good one would be Poolboy, which is a worker pool library.
by brudgers on 12/28/23, 3:22 PM
Always?
Times some factor so you have several instances of the same thing in case one fails.
Good luck.
by jmnicolas on 12/29/23, 3:10 PM
It's a NoSQL DB written in Erlang. I looked at it a few years ago, its master to master replication seemed cool.
by rramadass on 12/30/23, 12:12 PM
by vladimirralev on 12/29/23, 8:29 PM
I've seen many Erlang systems fail in funny ways, including some of the big examples given here. Supervision trees are cool but it's clearly nonsense to hardcode restart strategy and timing numbers for workers as if all failure modes are the same and deployed in the same network/capacity/resource/conditions with any number of workers. The strategy and schedule for recovering 10 crashed resource workers will clearly be different when you have 1M workers. The strategy will be different if you are timing out on network or if you are getting a resource error and have better things to do than restarting workers.
Focus on fault-tolerance outside erlang - have standby capacity in isolation and load-balance properly, shard the system in isolated pieces as much as you can.