from Hacker News

Zero-Downtime Kubernetes Deployments on AWS with EKS

by pmig on 3/10/25, 12:48 PM with 115 comments

by paranoidrobot on 3/10/25, 9:44 PM
We had to figure this out the hard way, and ended up with this approach (approximately).
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
e: Formatting was slightly broken.
by bradleyy on 3/10/25, 7:32 PM
I know this won't be helpful to folks committed to EKS, but AWS ECS (i.e. running docker containers with AWS controlling) does a really great job on this, we've been running ECS for years (at multiple companies), and basically no hiccups.
One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
No software is a panacea, but ECS seems to be one of those "it just works" technologies.
by _bare_metal on 3/10/25, 5:43 PM
I run https://BareMetalSavings.com.
The amount of companies who use K8s when they have no business nor technological justification for it is staggering. It is the number one blocker in moving to bare metal/on prem when costs become too much.
Yes, on prem has its gotchas just like the EKS deployment described in the post, but everything is so much simpler and straightforward it's much easier to grasp the on prem side of things.
by paol on 3/10/25, 6:02 PM
I'm not sure why they state "although the AWS Load Balancer Controller is a fantastic piece of software, it is surprisingly tricky to roll out releases without downtime."
The AWS Load Balancer Controller uses readiness gates by default, exactly as described in the article. Am I missing something?
Edit: Ah, it's not by default, it requires a label in the namespace. I'd forgotten about this. To be fair though, the AWS docs tell you to add this label.
by cassianoleal on 3/11/25, 12:11 AM
A few years ago, while helping build a platform on Google Cloud & GKE for a client, we found the same issues.
At that point we already had a CRD used by most of out tenant apps, which deployed an opinionated (but generally flexible enough) full app stack (Deployment, Service, PodMonitor, many sane defaults for affinity/anti-affinity, etc, lots of which configurable, and other things).
Because we didn't have an opinion on what tenant apps would use in their containers, we needed a way to make the pre-stop sleep small but OS-agnostic.
We ended up with a 1 LOC (plus headers) C app that compiled to a tiny static binary. This was put in a ConfigMap, which the controller mounted on the Pod, from where it could be executed natively.
Perhaps not the most elegant solution, but a simple enough one that got the job done and was left alone with zero required maintenance for years - it might still be there to this day. It was quite fun to watch the reaction of new platform engineers the first time they'd come across it in the codebase. :D
by NightMKoder on 3/10/25, 9:04 PM
This is actually a fascinatingly complex problem. Some notes about the article: * The 20s delay before shutdown is called “lame duck mode.” As implemented it’s close to good, but not perfect. * When in lame duck mode you should fail the pod’s health check. That way you don’t rely on the ALB controller to remove your pod. Your pod is still serving other requests, but gracefully asking everyone to forget about it. * Make an effort to close http keep-alive connections. This is more important if you’re running another proxy that won’t listen to the health checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you can only do that when a request comes in - but it’s as simple as a Connection: close header on the response. * On a fun note, the new-ish kubernetes graceful node shutdown feature won’t remove your pod readiness when shutting down.
by happyweasel on 3/11/25, 6:23 AM
>The truth is that although the AWS Load Balancer Controller is a fantastic piece >of software, it is surprisingly tricky to roll out releases without downtime.
20 years ago we used simple bash scripts using curl to do rest calls to take one host out of our load balancers, then scp to the host and shut down the app gracefully, and updated the app using scp again, then put it back into the load balancer after testing the host on its own. we had 4 or 5 scripts max, straightforward stuff..
They charge $$$ and you get downtime in this simple scenario ?
by glenjamin on 3/10/25, 6:24 PM
The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.
We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.
I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.
by js2 on 3/11/25, 4:30 AM
Nit: "How we archived" subheading should be "How we achieved".
by strangelove026 on 3/11/25, 1:42 AM
We’re using Argo rollouts without issue. It’s a super set of a deployment with configuration based blue green deploy or canary. Works great for us and allows us to get around the problem laid out in this article.
by jayd16 on 3/11/25, 2:10 PM
Does this or any of the strategies listed in the comments properly handle long lived client connections? It's sufficient enough to wait for the LB to stop sending traffic when connections are 100s of ms or less but when connections are minutes or even hours long it doesn't work out well.
Is there a slick strategy for this? Is it possible to have minutes long pre-stop hooks? Is the only option to give client connections an abandon ship message and kick them out hopefully fast enough?
by gurrone on 3/11/25, 1:20 PM
Might be noteworthy that in recent enough k8s lifecycle.preStop.sleep.seconds is implemented https://github.com/kubernetes/enhancements/blob/master/keps/... so no longer any need to run an external sleep command.
by yosefmihretie on 3/10/25, 8:38 PM
highly recommend porter if you are a startup who doesn't wanna think about things like this
by evacchi on 3/10/25, 5:54 PM
somewhat related https://architect.run/
> Seamless Migrations with Zero Downtime
(I don't work for them but they are friends ;))