from Hacker News

Summary of the Amazon Kinesis Event in the Northern Virginia (US-East-1) Region

by codesparkle on 11/28/20, 7:51 AM with 146 comments

by joneholland on 11/28/20, 2:55 PM
Running out of file handles and other IO limits is embarrassing and happens at every company, but I’m surprised that AWS was not monitoring this.
I’m also surprised at the general architecture of Kinesis. What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos, a thread per cluster member? Everyone talking to everyone? An hour to reach consensus?) and the front end servers being stateful period breaks a lot of good design choices.
The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.
by ris on 11/28/20, 12:54 PM
The one thing I want to know in cases like this is: why did it affect multiple Availability Zones? Making a resource multi-AZ is a significant additional cost (and often involves additional complexity) and we really need to be confident that typical observed outages would actually have been mitigated in return.
by tnolet on 11/28/20, 12:06 PM
This is a pretty damn decent post mortem so soon after the outage. Also gives an architectural analysis of how Kinesis works which is something they had not have to do at all.
by lytigas on 11/28/20, 11:57 AM
> During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event.
Poetry.
Then, to be fair:
> We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues.
I'm curious if anyone here actually got one of these.
by freeone3000 on 11/28/20, 11:58 AM
The failure to update the Service Health Dashboard was due to reliance on internal services to update. This also happened in March 2017[0]. Perhaps a general, instead of piecemeal, approach to removing dependencies on running services from the dashboard would be valuable here?
0: https://aws.amazon.com/message/41926/
by codesparkle on 11/28/20, 8:14 AM
From the postmortem:
At 9:39 AM PST, we were able to confirm a root cause [...] the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.
by rswail on 11/29/20, 6:22 AM
Minor detail, but is anyone else irritated by the use of the word "learnings" instead of "lessons"? "To learn" is a verb. Nouning verbs seems to be an unnecessary operationalization.
by bithavoc on 11/28/20, 10:00 PM
They’re calling it an “Event”, title should say “Summary of the Amazon Kinesis Outage...”
by lend000 on 11/28/20, 7:13 PM
Even today I had a few minutes of intermittent networking outages around 9:30am EST (which started on the day of the incident), and compared to other regions, I frequently get timeouts when calling S3 from us-east-1 (although that has been happening since forever).
by karmakaze on 11/28/20, 5:07 PM
Seems to me that the root problem could also be fixed by not using presumably blocking application threads talking to each of the other servers. Any async or poll mechanism wouldn't require N^2 threads across the pool.
by ipsocannibal on 11/28/20, 8:09 PM
So the cause of outage boils down to not having a metric on total file descriptors with an alarm if usage gets within 10% of the Max and a faulty scaling plan that should of said "for every N backend hosts we add we must add X frontend hosts". One metric and a couple of lines in a wiki could have saved Amazon what is probably millions in outage related costs. One wonders if Amazon retail will start hedging its bets and go multicloud to prevent impacts on the retail customers from AWS LSE's.
by ignoramous on 11/28/20, 2:11 PM
root-cause tldr:
...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.
fixes:
...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.
...making a number of changes to radically improve the cold-start time for the front-end fleet.
...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.
...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.
...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end.
[0] https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...
by terom on 11/29/20, 9:48 AM
Unsurprising to see such outages also tickling bugs/issues in the fallback behavior of dependent services that were intended to tolerate outages. There must be some classic law of cascading failures caused by error handling code :)
> Amazon Cognito uses Kinesis Data Streams [...] this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers.
> And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates.
by londons_explore on 11/28/20, 7:00 PM
One requirement on my "production ready" checklist is that any catastrophic system failure can be resolved by starting a completely new instance of the service, and it be ready to serve traffic inside 10 minutes.
That should be tested at least quarterly (but preferably automatically with every build).
If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took...
by steelframe on 11/28/20, 7:10 PM
> Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range. This had been under way for the front-end fleet in Kinesis, but unfortunately the work is significant and had not yet been completed.
Translation: The eng team knew that they had accumulated tech debt by cutting a corner here in order to meet one of Amazon's typical and insane "just get the feature out the door" timelines. Eng warned management about it, and management decided to take the risk and lean on on-call to pull heroics to just fix any issues as they come up. Most of the time yanking a team out of bed in the middle of the night works, so that's the modus operandi at Amazon. This time, the actual problem was more fundamental and wasn't effectively addressable with middle-of-the-night heroics.
Management rolled the "just page everyone and hope they can fix it" dice yet again, as they usually do, and this time they got snake eyes.
I guarantee you that the "cellularization" of the front-end fleet wasn't actually under way, but the teams were instead completely consumed with whatever the next typical and insane "just get the feature out the door" thing was at AWS. The eng team was never going to get around to cellularizing the front-end fleet because they were given no time or incentive to do so by management. During/after this incident, I wouldn't be surprised if management didn't yell at the eng team, "Wait, you KNEW this was a problem, and you're not done yet?!?" Without recognizing that THEY are the ones actually culpable for failing to prioritize payments on tech debt vs. "new shiny" feature work, which is typical of Amazon product development culture.
I've worked with enough former AWS engineers to know what goes on there, and there's a really good reason why anybody who CAN move on from AWS will happily walk away from their 3rd- and 4th-year stock vest schedules (when the majority of your promised amount of your sign-on RSUs actually starts to vest) to flee to a company that fosters a healthy product development and engineering culture.
(Not to mention that, this time, a whole bunch of peoples' Thanksgiving plans were preempted with the demand to get a full investation and post-mortem written up, including the public post, ASAP. Was that really necessary? Couldn't it have waited until next Wednesday or something?)
by fafner on 11/28/20, 3:26 PM
From the summary I don't understand why front end servers need to talk to each other ("continuous processing of messages from other Kinesis front-end servers"). It sounds like this is part of building the shard map or the cache. Well in the end an unfortunate design decision. #hugops for the team handling this. Cascading failures are the worst.
by tmk1108 on 11/28/20, 5:53 PM
How does the architecture of Kinesis compare to Kafka? If you scale up the number of Kafka brokers can you hit similar problem? Or does Kafka not rely on creating threads to connect to each other broker
by zxcvbn4038 on 11/29/20, 4:19 AM
They didn’t really discuss their remediation plans but maybe having one fleet of servers for everything isn’t the best setup. I’d love to know which OS setting they ran into. In their defense this is exactly the sort of change that never shows up in testing because the dev and qa environments are always smaller then production.
I’m wondering how many people Amazon fired over this incident - that seems to be their goto answer to everything.
by temp0826 on 11/28/20, 2:07 PM
us-east-1 is AWS’s dirty secret. If ddb had gone down there, there would likely be a worldwide and multi-service interruption.
by pps43 on 11/28/20, 1:20 PM
> the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. [...] We didn’t want to increase the operating system limit without further testing
Is it because operating system configuration is managed by a different team within the organization?
by jaikant77 on 11/29/20, 8:17 AM
"and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration."
An auto scaling irony for AWS! We seem to be back to the late 1990s :)
by hintymad on 11/28/20, 6:58 PM
A tangential question, why would AWS even use the term "microservice"? A service is a service, right? I'm not sure what the term "microservice" signifies here.
by metaedge on 11/28/20, 1:28 PM
I would have started the response with:
First of all, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
Then move on to explain...