from Hacker News

Scaling Kubernetes to 7,500 nodes (2021)

by izwasm on 3/15/23, 9:01 PM with 37 comments

  • by sciurus on 3/15/23, 10:44 PM

    This is from 2021 and was discussed then at https://news.ycombinator.com/item?id=25907312

    I'm curious what they're doing now.

  • by antonchekhov on 3/20/23, 5:34 PM

    To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( https://armadaproject.io/ ). Armada is a multi-Kubernetes cluster batch job scheduler, and is designed to address the following issues:

    A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.

    Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.

    Armada is designed primarily for ML, AI, and data analytics workloads, and to:

    - Manage compute clusters composed of tens of thousands of nodes in total. - Schedule a thousand or more pods per second, on average. - Enqueue tens of thousands of jobs over a few seconds. - Divide resources fairly between users. - Provide visibility for users and admins. - Ensure near-constant uptime.

    Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.

    Source code is available at https://github.com/armadaproject/armada - we welcome contributors and user reports!

  • by vvladymyrov on 3/16/23, 1:29 AM

    Also they use Ray.io from Anyscale https://archive.ph/ZlMi5
  • by mrits on 3/15/23, 11:41 PM

    I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.

    However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.

  • by osigurdson on 3/16/23, 1:30 PM

    >> Pods communicate directly with one another on their pod IP addresses with MPI via SSH

    It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.

    Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?

  • by rmorey on 3/15/23, 10:47 PM

    good read. should probably get [2021] tag
  • by bbarnett on 3/16/23, 1:42 AM

    Success! Meanwhile, all 7500 nodes are, computationally, replaced by a 96 core, $10k server, in a dude's basement.

    With power to spare.

  • by satvikpendem on 3/16/23, 12:31 AM

    Is Kubernetes simply BEAM but not on Erlang?