by darthShadow on 2/2/23, 4:30 PM with 66 comments
by schmichael on 2/2/23, 5:51 PM
I hope Nomad covers cases like scaling-from-zero better in the future, but to do that within the latency requirements of a single HTTP request is quite the feat of design and implementation. There's a lot of batching Nomad does for scale and throughput that conflict with the desire for minimal placement+startup latency, and it's yet to be seen whether "having our cake and eating it too" is physically possible, much less whether we can package it up in a way operators can understand what tradeoffs they're choosing.
I've had the pleasure of chatting with mrkurt in the past, and I definitely intend to follow fly.io closely even if they're no longer a Nomad user! Thanks again for yet another fantastic post, and I wish fly.io all the best.
by jallmann on 2/2/23, 9:34 PM
The key piece is to have a registry [2] with a (somewhat) up-to-date view of worker resources. This could actually be a completely static list that gets refreshed once in a while. Whenever a client has a new job, they can look at their latest copy of the registry, select workers that seem suitable, submit jobs to workers directly, handle re-tries, etc.
One neat thing about this "direct-to-worker" architecture is that it allows for backpressure from the workers themselves. Workers can respond to a shifting load profile almost instantaneously without having to centrally deallocate resources, or wait for healthchecks to pick up the latest state. Workers can tell incoming jobs, "hey sorry but I'm busy atm" and the client will try elsewhere.
This also allows for rich worker selection strategies on the client itself; eg it can preemptively request the same job on multiple workers, keep the first one that is accepted, and cancel the rest, or favor workers that respond fastest, and so forth.
[1] We were more of a "decentralized job queue" than "distributed VM scheduler" with the corresponding differences, eg shorter-lived jobs with fluctuating load profiles and our clients could be thicker. But many of the core ideas are shared, even our workers were called "orchestrators" which in turn could use similar ideas to manage jobs on GPUs attached to it... schedulers all the way down!
[2] Here the registry seems to be constructed via the Corrosion gossip protocol; we used the blockchain with regular healthcheck probes.
by tptacek on 2/2/23, 5:49 PM
by kalev on 2/2/23, 6:31 PM
by coredog64 on 2/2/23, 6:38 PM
AFS solves for Colossus, with packages being distributed into AFS.
by filereaper on 2/2/23, 8:53 PM
Aurora, Marathon, etc... would add the flavor of Orchestration that's needed. Mesos provided the resources requested.
by mochomocha on 2/2/23, 8:28 PM
IMO it's one of the better part of k8s. The core scheduler is pretty well written and extensible through scheduling plugins, to implement whatever policies you heart desires (which we extensively make use of at Netflix).
The main issue I have with it is the lack of built-in observability, which makes it non-trivial to A/B test scheduling policies in large scale deployment setups because you want to be able to log the various subscores of your plugins. But it's so extensible through NodeAffinity and PodAffinity plugins that you can even delegate part (or all!) of the scheduling decisions outside of it if you want.
Besides observability, one issue we've had to overcome with k8s scheduling is the inheritance of the Borg design decisions around pod shape immutability, which makes implementing things like oversubscription less easy in a "native" way.
by yencabulator on 2/3/23, 8:50 PM
> Our WireGuard mesh sees IPv6 addresses that look like fdaa:host:host::/48 but the rest of our system sees fdaa:net:net::48.
That was probably mean to be net:host -> host:net
by dijksterhuis on 2/3/23, 12:34 PM
will be referencing and sharing this fairly frequently I reckon so massive thank you @tptacek
by anyas on 2/3/23, 4:48 AM
by nitwit005 on 2/3/23, 9:03 AM
by plaidfuji on 2/2/23, 9:49 PM
by korijn on 2/2/23, 8:20 PM
Anyway, they did get it done and made it work, so whatever.