from Hacker News

Google Kubernetes Engine's third consecutive day of service disruption

by rlancer on 11/11/18, 8:47 PM with 407 comments

  • by shareometry on 11/12/18, 12:51 AM

    I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:

    1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).

    2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.

    3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.

    4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.

    5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.

    6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.

    I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.

  • by usmannk on 11/11/18, 11:58 PM

    We had an issue a few weeks ago where the google front-end servers were mangling responses from Pub/Sub and returning 502 responses, making the service completely unusable and knocking over a number of things we have running in production. Despite paying for enterprise support and having in a P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the support staff that there was indeed a problem, because their monitoring wasn't detecting it. Right now I'm doing something similar (and since Friday!) but for TLS issues they're having. Again, because their support reps don't believe there's a problem. There are so many more problems than they ever show on their status page...
  • by Jedi72 on 11/11/18, 10:03 PM

    "The data says engagement is down 46%, I think its time we drop the product."

    - Someone at Google right now, probably.

  • by justinsb on 11/11/18, 10:15 PM

    Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!

    It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.

  • by hacknat on 11/11/18, 10:38 PM

    Question to Google employees:

    Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.

  • by scarface74 on 11/12/18, 2:25 AM

    Say I were a CTO (I’m nowhere near it), why would I choose GCP over AWS or Azure? Even if after doing a technical assessment and I thought that GCP was technically slightly better, if something happened, the first question I would be asked is “why did you choose GCP over AWS?”

    No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.

    Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.

    From a developer/architect standpoint, I’ve been focused the last year on learning everything I could about AWS and chose a company that fully embraced it. AWS experience is much more marketable than GCP. It’s more popular than Azure too, but there are plenty of MS shops around that are using Azure.

  • by AlexB138 on 11/11/18, 9:47 PM

    This has been going on longer than three days. We have been dealing with this exact issue since at least Monday (11/5) morning in us-central1.
  • by marcinzm on 11/11/18, 10:04 PM

    >Nov 09, 2018 05:59

    >We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.

    Wait, did the people tasked with fixing this just take the weekend off?

  • by rlancer on 11/11/18, 8:53 PM

    Status page is inaccurate as issues doesn't only affect the web UI, the same operations are not functioning via the CLI.
  • by scarface74 on 11/11/18, 10:10 PM

    A generic question: Our company is completely dependent on AWS. Sure we have taken all of the standard precautions for redundancy, but what happened here could just as easily happen with AWS - a needed resource is down globally.

    What would a small business do as a contingency plan?

  • by rlancer on 11/12/18, 1:26 AM

    UPDATE: Got some clarity, these issues are caused by "resource exhaustion" meaning there are no resources left to be allocated.
  • by 7ewis on 11/11/18, 11:15 PM

    I honestly don't mind if providers have outages - we can't expect 100.00% accuracy, I know the systems I manage certainly don't achieve that.

    One thing I do care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.

    (I'm not affect by the GKE outage so opinions may differ right now!)

  • by locusm on 11/12/18, 2:21 AM

    Do not use GCP without paying for support. We have had resource allocation errors for weeks, as have a lot of other people. Check out the posts in their forum where folk on basic support get zero love. https://groups.google.com/forum/?utm_medium=email&utm_source...
  • by thwy12321 on 11/11/18, 9:50 PM

    Been trying to spin up vm instances all day, had to try every single zone just to get one up. Not only is this incredibly harmful to a technology business dependent on this infra, it wasnt obvious to me what the issue was until I tried creating instances. Nothing says, hey resources are constrained here, try this one. Just about ready to bite the bullet and move to aws.
  • by sladey on 11/11/18, 10:10 PM

    Seems to be some weird underlying issue going on at GCP at the moment. Had cloud build webhooks returning a 500 error. Noticed we were at 255 images and deleting some fixed the issue. Created a P2 ticket about the issue before we managed to solve it and haven't had a response in 40+ hours.

    The timeline of this disruption matches when we started experiencing cloud build errors.

  • by ernsheong on 11/12/18, 7:45 AM

    "third consecutive day of service disruption" is not an accurate statement? Latest update was Nov 11 saying things resolved on Nov 9.

    https://status.cloud.google.com/incident/container-engine/18...

  • by 013a on 11/12/18, 1:16 AM

    Cloud providers have all of the potential in the world to make each region truly isolated. I shouldn't have to architect my application to be multi-cloud, at least for stability reasons.

    Yet, somehow every major cloud provider experiences global outages.

    That old AWS S3 outage in us-east-1 was an interesting one; when it went down, many services which rely on S3 also went down, in other regions beside us-east-1 because they were using us-east-1 buckets. I have a feeling this is more common than you'd think; globally-redundant services which rely on some single point of geographical failure for some small part.

  • by spiderPig on 11/12/18, 12:14 AM

    Our company is dependent on this as well and the way customer service has been handling this has been abysmal thus far.
  • by qaq on 11/12/18, 12:07 AM

    There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.
  • by arunoda on 11/12/18, 7:15 AM

    The is not only GKE. But for GCE as well. I cannot create instance is almost all zones. I tried both preemptible and normal as well.

    Always saying resource not available. My account is a pretty new account.

    In contrast, one of my friend is having a pretty old account which is very active. He has no such issue.

    So I think due to this issue, Google has enabled some resource limitation for new accounts.

    But they should properly communicate this issue.

  • by gigatexal on 11/11/18, 10:15 PM

    Oh man must be a tough time to be an SRE at google cloud. But... they’re Google. They have been doing internal cloud for years and years. Borg — which K8s is a reimplementation if — has been the heart of Google for so long now you’d think they’d be able to architect their systems to have no outages whatsoever. I mean nobody is perfect but this looks bad.
  • by closeparen on 11/11/18, 9:27 PM

    Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How is a widespread outage like this possible?
  • by fizzledbits on 11/12/18, 6:32 PM

    As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request"

    An instance in us-central1-a has refused to start since last Thursday or Friday.

    I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

    On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

    And yet, the status page says all services are available.

    Is the typical of others' experiences?

  • by wijowa on 11/12/18, 3:48 PM

    Right now we're experiencing an issue where a small percentage of end users on our GKE site are getting super slow speeds. The issue is ISP related as they can switch to a 4G hot spot in the same location and get normal speeds... and inside our system the timing looks normal. So there's a slowdown either TO the load balancer or FROM the load balancer. Took a week to convince Google's support contractor to even believe it wasn't an issue with our site and their advice is generally along the lines of Turn it off and Turn it back on again (which might actually fix the problem) though that's easier said than done in GCP.
  • by nielsole on 11/11/18, 9:57 PM

    I use preemptible machines in autodialing and for first time did not have any machines available for multiple hours yesterday. I am wondering whether this falls under the normal preemptible behaviour or this service degradation.
  • by wb3tech on 11/12/18, 2:12 PM

    If anyone is interested, here is my documented experience with this issue. I freaking love GCP and GKE, although I have not production environment as it was a HA cluster in us-central1. Working federation now.

    https://stackoverflow.com/questions/53244471/gke-cluster-won...

  • by regnerba on 11/11/18, 9:24 PM

    Is this just about creating new pools? I haven't noticed an issue with our existing pools scaling.
  • by _wmd on 11/11/18, 10:48 PM

    When guerilla marketing backfires
  • by bdibs on 11/12/18, 6:55 AM

    As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?

    And for those who have used both, which would you go with today?

  • by franky_g on 11/11/18, 10:38 PM

    Had it affected all regions or just some?

    Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..

  • by fulafel on 11/12/18, 6:34 AM

    Offtopic but are there some documented exceptions to the "keep the original title" rule?
  • by whatshisface on 11/11/18, 11:30 PM

    Why do cloud providers have more global outages than major flagship websites like google.com?
  • by fergie on 11/12/18, 6:46 AM

    Things break after everybody has gone home on a Friday? 3 day disruption.
  • by thomasfl on 11/12/18, 11:36 AM

    I'd like to upvote, but 666 points seemed relevant.
  • by haosdent on 11/12/18, 9:12 AM

    Time to use Mesos.
  • by shiftnight on 11/11/18, 10:57 PM

    I have a question. At what point does k8s make sense?

    I have a feeling that a microservice architecture is overkill for 99% of businesses. You can serve a lot of customers on a single node with the hardware available today. Often times, sharding on customers is rather trivial as well.

    Monolith for the win! Opinions?

  • by aaaaaaaaaab on 11/12/18, 9:35 AM

    Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ ͡°)
  • by spullara on 11/12/18, 1:06 AM

    If a hosting service is down and nobody uses it, is there really any disruption?