by codesparkle on 11/28/20, 7:51 AM with 146 comments
by joneholland on 11/28/20, 2:55 PM
I’m also surprised at the general architecture of Kinesis. What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos, a thread per cluster member? Everyone talking to everyone? An hour to reach consensus?) and the front end servers being stateful period breaks a lot of good design choices.
The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.
by ris on 11/28/20, 12:54 PM
by tnolet on 11/28/20, 12:06 PM
by lytigas on 11/28/20, 11:57 AM
Poetry.
Then, to be fair:
> We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues.
I'm curious if anyone here actually got one of these.
by freeone3000 on 11/28/20, 11:58 AM
by codesparkle on 11/28/20, 8:14 AM
At 9:39 AM PST, we were able to confirm a root cause [...] the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.
by rswail on 11/29/20, 6:22 AM
by bithavoc on 11/28/20, 10:00 PM
by lend000 on 11/28/20, 7:13 PM
by karmakaze on 11/28/20, 5:07 PM
by ipsocannibal on 11/28/20, 8:09 PM
by ignoramous on 11/28/20, 2:11 PM
...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.
fixes:
...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.
...making a number of changes to radically improve the cold-start time for the front-end fleet.
...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.
...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.
...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end.
[0] https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...
by terom on 11/29/20, 9:48 AM
> Amazon Cognito uses Kinesis Data Streams [...] this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers.
> And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates.
by londons_explore on 11/28/20, 7:00 PM
That should be tested at least quarterly (but preferably automatically with every build).
If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took...
by steelframe on 11/28/20, 7:10 PM
Translation: The eng team knew that they had accumulated tech debt by cutting a corner here in order to meet one of Amazon's typical and insane "just get the feature out the door" timelines. Eng warned management about it, and management decided to take the risk and lean on on-call to pull heroics to just fix any issues as they come up. Most of the time yanking a team out of bed in the middle of the night works, so that's the modus operandi at Amazon. This time, the actual problem was more fundamental and wasn't effectively addressable with middle-of-the-night heroics.
Management rolled the "just page everyone and hope they can fix it" dice yet again, as they usually do, and this time they got snake eyes.
I guarantee you that the "cellularization" of the front-end fleet wasn't actually under way, but the teams were instead completely consumed with whatever the next typical and insane "just get the feature out the door" thing was at AWS. The eng team was never going to get around to cellularizing the front-end fleet because they were given no time or incentive to do so by management. During/after this incident, I wouldn't be surprised if management didn't yell at the eng team, "Wait, you KNEW this was a problem, and you're not done yet?!?" Without recognizing that THEY are the ones actually culpable for failing to prioritize payments on tech debt vs. "new shiny" feature work, which is typical of Amazon product development culture.
I've worked with enough former AWS engineers to know what goes on there, and there's a really good reason why anybody who CAN move on from AWS will happily walk away from their 3rd- and 4th-year stock vest schedules (when the majority of your promised amount of your sign-on RSUs actually starts to vest) to flee to a company that fosters a healthy product development and engineering culture.
(Not to mention that, this time, a whole bunch of peoples' Thanksgiving plans were preempted with the demand to get a full investation and post-mortem written up, including the public post, ASAP. Was that really necessary? Couldn't it have waited until next Wednesday or something?)
by fafner on 11/28/20, 3:26 PM
by tmk1108 on 11/28/20, 5:53 PM
by zxcvbn4038 on 11/29/20, 4:19 AM
I’m wondering how many people Amazon fired over this incident - that seems to be their goto answer to everything.
by temp0826 on 11/28/20, 2:07 PM
by pps43 on 11/28/20, 1:20 PM
Is it because operating system configuration is managed by a different team within the organization?
by jaikant77 on 11/29/20, 8:17 AM
An auto scaling irony for AWS! We seem to be back to the late 1990s :)
by hintymad on 11/28/20, 6:58 PM
by metaedge on 11/28/20, 1:28 PM
First of all, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
Then move on to explain...