from Hacker News

Reddit Releases Post Mortem for Its 3 Hour Outage Last Week

by Rebles on 3/22/23, 12:34 AM with 29 comments

  • by mwint on 3/22/23, 4:22 AM

    > In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

    Wow, so the word police brought down Reddit. Why on earth did someone think it a good idea to screw with existing names in running clusters in a cluster management tool?

  • by post_break on 3/22/23, 1:47 PM

    Imagine if Cisco or Juniper decided to swap master or remove slave from their code. Core routers going down because of word police terminology and an admin who missed it in the change log.
  • by __turbobrew__ on 3/22/23, 3:03 AM

    This can be one reason to run the control plane not on k8s itself. When the control plane runs on k8s you can get these weird states where the control plane is borked and the system cannot recover.
  • by gundmc on 3/22/23, 3:28 AM

    I appreciate the transparency and detail in publishing this. With that said, the narrative style and wordy,casual language makes it harder to get to the meat (the five whys) than a typical postmortem.
  • by ethicalsmacker on 3/22/23, 3:35 PM

    This is a pretty funny "bug". Bring down those Nazi Kubernetes nodes. There's some humor in there somewhere... making a change to be inclusive results in Reddit going offline... mmmm.

    I'm still waiting for people to rename "white paper".

  • by hoseja on 3/22/23, 11:12 AM

    314 minutes is not three hours.