from Hacker News

Google Cloud Incident Report – 2025-06-13

by denysvitali on 6/14/25, 6:13 AM with 218 comments

  • by throwaway250612 on 6/15/25, 12:31 AM

    I am an insider, hence the throw away account.

    The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.

    This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.

    Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.

    This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.

    Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.

  • by blibble on 6/14/25, 12:05 PM

    this is really amateur level stuff: NPEs, no error handling, no exponential backoff, no test coverage, no testing in staging, no gradual rollout, fail deadly

    I read their SRE books, all of this stuff is in there: https://sre.google/sre-book/table-of-contents/ https://google.github.io/building-secure-and-reliable-system...

    have standards slipped? or was the book just marketing

  • by esprehn on 6/14/25, 1:29 PM

    I work on Cloud, but not this service. In general:

    - All the code has unit tests and integration tests

    - Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.

    - Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.

    I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.

    I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.

    Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.

    Disclaimer: this is all unofficial and my personal opinions.

  • by ofrzeta on 6/15/25, 9:37 AM

    The incident report is interesting. Fast reaction time by the SRE team (2 minutes), then the "red button" rollout. But then "Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load."

    In my experience this happens more often than not: In an exceptional situation like a recovery of many nodes quotas that make sense in regular operations get exceeded quickly and you run into another failure scenario. As long as the underlying infrastructure can cope with it, it's good if you can disable quotas temporarily and quickly. Or throttle the recovery operations that naturally take longer in that case.

  • by SageThrowaway on 6/14/25, 3:25 PM

    (Throwaway since I was part of a related team a while back)

    Service Control (Chemist) is a somewhat old service, been around for about a decade, and is critical for a lot of GCP APIs for authn, authz, auditing, quota etc. Almost mandated in Cloud.

    There's a proxy in the path of most GCP APIs, that calls Chemist before forwarding requests to the backend. (Hence I don't think fail open mitigation mentioned in post-mortem will work)

    Both Chemist and the proxy are written in C++, and have picked up a ton of legacy cruft over the years.

    The teams have extensive static analysis & testing, gradual rollouts, feature flags, red buttons and strong monitoring/alerting systems in place. The SREs in particular are pretty amazing.

    Since Chemist handles a lot of policy checks like IAM, quotas, etc., other teams involved in those areas have contributed to the codebase. Over time, shortcuts have been taken so those teams don’t have to go through Chemist's approval for every change.

    However, in the past few years, the organization’s seen a lot of churn and a lot of offshoring too. Which has led to a bigger focus on flashy, new projects led by L8/L9s to justify headcount instead of prioritizing quality, maintenance, and reliability. This shift has contributed to a drop in quality standards and increased pressure to ship things out faster (and one of the reasons I ended up leaving Cloud).

    Also many of the servers/services best practices common at Google are not so common here.

    That said, in this specific case, it seems like the issue is more about lackluster code and code review. (iirc code was merged despite some failures). And pushing config changes instantly through Spanner made it worse.

  • by Philpax on 6/14/25, 11:21 AM

    > Without the appropriate error handling, the null pointer caused the binary to crash.

    We must be at the trillion dollar mistake by now, right?

  • by jitl on 6/14/25, 1:04 PM

    > This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.

    Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:

    - Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.

    - Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.

    - Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.

    - Finally, the Service Control application is written in a language that allows for null pointer references.

    If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.

    Example of one that does: https://github.com/stepchowfun/typical

  • by asim on 6/14/25, 12:03 PM

    Google post-mortems never cease to amaze me. From seeing it inside the company to outside. The level of detail, its amazing. The thing is. They will never make the same mistake again. They learn from it, put in the correct protocols and error handling and then create an even more robust system. The thing is, at the scale of Google there is always something going wrong, the point is, how is it being handled not to affect the customer/user and other systems. Honestly it's an ongoing thing you don't see unless you're inside and even then on a per team basis you might see things no one else is seeing. It is probably the closet we're going to come to the most complex systems of the universe, because we as humans will never do better than this. Maybe AGI does, but we won't.
  • by dehrmann on 6/14/25, 5:46 PM

    I recently started as a GCP SRE. I don't have insider knowledge about this, and my views on it are my own.

    The most important thing to look at is how much had to go wrong for this to surface. It had to be a bug without test coverage that wasn't covered by staged rollouts or guarded by a feature flag. That essentially means a config-in-db change. Detection was fast, but rolling out the fix was slow out of fear of making things worse.

    The NPE aspect is less interesting. It could have been any number of similar "this can't happen" errors. It could have been mutually exclusive fields are present in a JSON object, and the handling logic does funny things. Validation during mutation makes sense, but the rollout strategy is more important since it can catch and mitigate things you haven't thought of.

  • by mkl95 on 6/14/25, 12:39 PM

    > The issue with this change was that it did not have appropriate error handling nor was it feature flag protected.

    I've been there. The product guy needs the new feature enabled for everyone, and he needs it yesterday. Suggestions of feature flagging are ignored outright. The feature is then shipped for every user, and fun ensues.

  • by mplanchard on 6/14/25, 3:22 PM

    Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.

    This reads to me like someone finally won an argument they’d been having for some time.

  • by darkwater on 6/14/25, 12:22 PM

    Usually Google and FAANG outages in general are due to things that happen only at Google scale but this incident seems from a generic small/medium company with 30 engineers at most.
  • by reassess_blind on 6/15/25, 11:14 AM

    Everyone loves to criticise downtime when it happens to others, saying these are “junior level mistakes” and so on. Until it happens to them, and then there’s a convenient excuse as to why it was either unavoidable or unforeseeable. Truth is humans make mistakes and the expectations are too high.

    When a brick and mortar business has to shut unexpectedly they’ll put a sign on the door, apologise and that’s that. Only in tech do we stress so much about a few hours per year. I wish everyone would relax a bit.

  • by cddotdotslash on 6/15/25, 11:22 AM

    It’s interesting that multi-region is often touted as a mechanism for resilience and availability, but for the most part, large cloud providers seem hopelessly intertwined across regions during outages like these.
  • by gdenning on 6/14/25, 2:02 PM

    Why did it take so long for Google to update their status page at https://www.google.com/appsstatus/dashboard/? According to this report, the issue started at 10:49 am PT, but when I checked the Google status page at 11:35 am PT, everything was still green. I think this is something else they need to investigate.
  • by Xenoamorphous on 6/14/25, 12:06 PM

    > We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.

    If I understood it correctly, this service checks proper authorisation among other things, so isn’t failing open a security risk?

  • by master_crab on 6/14/25, 11:52 AM

    The big issue here (other than the feature rollout) is the lack of throttling. Exponential backoff is a fairly standard integration for scaled applications. Most cloud services use it. I’m surprised it wasn’t implemented for something as fundamental as Service Control.
  • by softveda on 6/14/25, 7:32 AM

    So code that was untested (the code path that failed was never exercised), perhaps there is no test environment, and not even peer reviewed ( it did not have appropriate error handling nor was it feature flag protected.) was pushed to production, what a surprise !!
  • by flaminHotSpeedo on 6/17/25, 4:31 AM

    I think the most damning thing here is the "what's our approach moving forward" section:

    Every single bullet point there is basic stuff you expect a cloud provider to do correctly damn near every time. If you're not paying GCP for that expertise, or if that expertise is built on layers of cruft where that expertise was not applied... what are you paying them for?

  • by paulddraper on 6/14/25, 12:30 PM

    > We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage.

    No one wondered hm maybe this isn’t a good idea?

  • by HardCodedBias on 6/15/25, 3:58 PM

    This was a shockingly simple coding error. It should have been caught by the two code reviewers.

    Turns out asking an LLM for a code review finds the error and an LLM suggests the correct fix.

    Rarely do you see a major outage caused by such a glaring error. I suspect that policy changes will be required at GCP.

    The SRE team was very fast, the logs were inspected quickly and the relevant check that was being failed was identified within minutes. That's impressive.

    But the coding error, that was shocking.

  • by rvnx on 6/15/25, 9:51 AM

    Such absurdity that as a customer we (HN) knew more than the official support
  • by junon on 6/15/25, 9:22 AM

    Null pointers strike again.
  • by charcircuit on 6/14/25, 11:41 AM

    >We will enforce all changes to critical binaries to be feature flag protected and disabled by default.

    Not only if this was true it would kill developer productivity, but how can this even be done. When the compiler versions bumps that's a lot changes that need to be gated. Every team that works on a dependency will have to add feature flags for any change and bug fix they make.

  • by jeffrallen on 6/14/25, 12:08 PM

    Whatever this "red-button" technology is, is pants. If you know you want to turn something off at incident + 10 mins, it should be off within a minute. Not "Preparing a change to trigger the red-button", but "the stop flag was set by an operator in a minute and was synched globally within seconds".

    I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.

    At $WORK we use Consul for this job.

  • by rvz on 6/14/25, 12:11 PM

    > We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.

    > Without the appropriate error handling, the null pointer caused the binary to crash.

    Even worse if this was AI generated C or C++ code, wasn't this not tested before deployment?

    This is why you write tests before the acutal code and why vibe-coding is a scam as well. This would also never have happened if it was in Rust.

    I expect far better than this from Google and we are still dealing with null pointer crashes to this day.

  • by welder on 6/17/25, 6:47 AM

    Go could learn from this and implement Swift's optional feature: https://wakatime.com/blog/48-go-desperately-needs-nil-safe-t...
  • by secondcoming on 6/15/25, 10:20 AM

    > If this had been flag protected, the issue would have been caught in staging.

    I’m a bit confused by this, it seems the new code was enabled by default and so should have been caught in staging.

  • by QuinnyPig on 6/14/25, 2:34 PM

    I saw a lot of third party services (i.e. CloudFlare) go down, but did any non-Cloud Google properties see an impact?

    It’d say something that core Google products don’t or won’t take a dependency on Google Cloud…

  • by JaggerFoo on 6/14/25, 6:28 PM

    I would be interested in seeing the elapsed time to recovery for each location up to us-central-1.

    Is this information available anywhere?

  • by Kye on 6/14/25, 12:30 PM

    No error handling, empty fields no one noticed. Was this change carelessly vibe coded?
  • by sunrunner on 6/15/25, 12:39 PM

    Maybe it's just my lack of reading comprehension but some of the wording in this report feels off:

    > this code change came with a red-button to turn off that particular policy serving path.

    > the root cause was identified and the red-button (to disable the serving path) was being put in place

    So the red-button was or wasn't in place on May 29th? The first sentence implies it was ready to be used but the second implies it had to be added. A red-button sounds like a thing that's already in place and can be triggered immediately, but this sounds like an additional change had to be deployed?

    > Without the appropriate error handling, the null pointer caused the binary to crash

    This is the first mention of null pointer (and _the_ null pointer too, not just _a_ null pointer) this implies the specific null pointer that would have caused a problem was known at this point? And this wasn't an early issue?

    I don't mean to play armchair architect and genuinely want to understand this from a blameless post-mortem point-of-view given the scale of the incident, but the wording in this report doesn't quite add up.

    (Edit for formatting)

  • by owenthejumper on 6/14/25, 12:40 PM

    If this wasn’t vibe coded I’ll eat a frog or something.
  • by spacemadness on 6/14/25, 4:05 PM

    Guess all that leet code screening only goes so far, huh?
  • by Sytten on 6/14/25, 12:03 PM

    TLDR a dev forgot an if err != nil { return 0, err } in some critical service
  • by montebicyclelo on 6/14/25, 12:18 PM

    TLDR, unexpected blank fields

    > policy change was inserted into the regional Spanner tables

    > This policy data contained unintended blank fields

    > Service Control... pulled in blank fields... hit null pointer causing the binaries to go into a crash loop

  • by bananapub on 6/14/25, 11:31 AM

    lol at whoever approved the report not catching the fuckup of “red-button” instead of “big red button”.
  • by koakuma-chan on 6/14/25, 11:41 AM

    Still not rewriting in Rust?