from Hacker News

Carving the scheduler out of our orchestrator

by darthShadow on 2/2/23, 4:30 PM with 66 comments

by schmichael on 2/2/23, 5:51 PM
As the Nomad Team Lead, this article is a gift - thank you Fly! - even if they're transitioning off of Nomad. Their description of Nomad is exactly what I would love people to hear, and their reasons for DIYing their own orchestration layer seem totally reasonable to me. Nomad has never wanted people to think we're The One True Way to run all workloads.
I hope Nomad covers cases like scaling-from-zero better in the future, but to do that within the latency requirements of a single HTTP request is quite the feat of design and implementation. There's a lot of batching Nomad does for scale and throughput that conflict with the desire for minimal placement+startup latency, and it's yet to be seen whether "having our cake and eating it too" is physically possible, much less whether we can package it up in a way operators can understand what tradeoffs they're choosing.
I've had the pleasure of chatting with mrkurt in the past, and I definitely intend to follow fly.io closely even if they're no longer a Nomad user! Thanks again for yet another fantastic post, and I wish fly.io all the best.
by jallmann on 2/2/23, 9:34 PM
This was a great article. While I was at Livepeer (distributed video transcoding on Ethereum [1]), we converged onto a very similar architecture, disaggregating scheduling into the client itself.
The key piece is to have a registry [2] with a (somewhat) up-to-date view of worker resources. This could actually be a completely static list that gets refreshed once in a while. Whenever a client has a new job, they can look at their latest copy of the registry, select workers that seem suitable, submit jobs to workers directly, handle re-tries, etc.
One neat thing about this "direct-to-worker" architecture is that it allows for backpressure from the workers themselves. Workers can respond to a shifting load profile almost instantaneously without having to centrally deallocate resources, or wait for healthchecks to pick up the latest state. Workers can tell incoming jobs, "hey sorry but I'm busy atm" and the client will try elsewhere.
This also allows for rich worker selection strategies on the client itself; eg it can preemptively request the same job on multiple workers, keep the first one that is accepted, and cancel the rest, or favor workers that respond fastest, and so forth.
[1] We were more of a "decentralized job queue" than "distributed VM scheduler" with the corresponding differences, eg shorter-lived jobs with fluctuating load profiles and our clients could be thicker. But many of the core ideas are shared, even our workers were called "orchestrators" which in turn could use similar ideas to manage jobs on GPUs attached to it... schedulers all the way down!
[2] Here the registry seems to be constructed via the Corrosion gossip protocol; we used the blockchain with regular healthcheck probes.
by tptacek on 2/2/23, 5:49 PM
I'm worried that I'm meaner about K8s in this than I mean to be, which would be a problem not least because I don't have enough K8s experience to justify meanness. I'm really more just surprised at how path-dependent the industry is; even systems that were consciously built not to echo Borg, even greenfield systems like Flynn that were reimaginactiments of orchestration, all seem to follow the same model of central, allocating schedulers based on distributed consensus about worker inventory.
by kalev on 2/2/23, 6:31 PM
I’m annoyed by the way this is written. The topic is super interesting but the author tried to hard to be funny and being a non-native reader it’s difficult to determine if certain words are technical jargon or trying to be funny.
by coredog64 on 2/2/23, 6:38 PM
Although it’s currently archived, there’s another open source orchestrator that’s similar to Borg: Treadmill.
AFS solves for Colossus, with packages being distributed into AFS.
[0] https://github.com/morganstanley/treadmill
by filereaper on 2/2/23, 8:53 PM
Mesos always had these notions of two level scheduling that let you build your own orchestration.
Aurora, Marathon, etc... would add the flavor of Orchestration that's needed. Mesos provided the resources requested.
https://mesos.apache.org/documentation/latest/architecture/
by mochomocha on 2/2/23, 8:28 PM
I have my own fair share of complaints about k8s, but I can't say the author articulates clearly what is wrong with k8s scheduler exactly.
IMO it's one of the better part of k8s. The core scheduler is pretty well written and extensible through scheduling plugins, to implement whatever policies you heart desires (which we extensively make use of at Netflix).
The main issue I have with it is the lack of built-in observability, which makes it non-trivial to A/B test scheduling policies in large scale deployment setups because you want to be able to log the various subscores of your plugins. But it's so extensible through NodeAffinity and PodAffinity plugins that you can even delegate part (or all!) of the scheduling decisions outside of it if you want.
Besides observability, one issue we've had to overcome with k8s scheduling is the inheritance of the Borg design decisions around pod shape immutability, which makes implementing things like oversubscription less easy in a "native" way.
by yencabulator on 2/3/23, 8:50 PM
This made me follow the link to https://fly.io/blog/ipv6-wireguard-peering/ and I think you have a copy-pasto:
> Our WireGuard mesh sees IPv6 addresses that look like fdaa:host:host::/48 but the rest of our system sees fdaa:net:net::48.
That was probably mean to be net:host -> host:net
by dijksterhuis on 2/3/23, 12:34 PM
$CURRENT_JOB is breaking apart and refactoring an nightmarish "orchestration application" where about 70% of the codebase solely in django.
will be referencing and sharing this fairly frequently I reckon so massive thank you @tptacek
by anyas on 2/3/23, 4:48 AM
What is the rationale for having the warm spares? If you're already able to spin up new VMs within an HTTP request, that seems pretty fast. Is the extra complexity just to make that path even faster?
by nitwit005 on 2/3/23, 9:03 AM
I suppose I'm more curious how you tested this, than the basic idea. For example, if Nomad was doing unwanted movement of apps between machines, how do you prove the replacement code won't?
by plaidfuji on 2/2/23, 9:49 PM
> With strict bin packing, we end up with Katamari Damacy scheduling, where a couple overworked servers in our fleet suck up all the random jobs they come into contact with.
by korijn on 2/2/23, 8:20 PM
I can't shake the sense that they seem to just prefer building something new (and blog about it) over figuring out how to configure a/the scheduler properly.
Anyway, they did get it done and made it work, so whatever.