by fauria on 12/21/23, 11:20 AM with 73 comments
by rconti on 12/22/23, 4:13 PM
In this case, there were NUMEROUS suboptimal or misconfigurations of DNS but none of them mattered until the volume reached a tipping point, and, suddenly, ALL of them came into play. Fixing one overflowed into the next, which overflowed into the next.
by rdoherty on 12/23/23, 12:58 AM
* Centralize - 1 tracking doc that describes the issue, timeline, what's been tested, who owns the incident. Have 1 group chat, 1 'team' (virtual or in person). Get an incident commander to drive the group.
* Create a list of hypotheses and work through them one at a time.
* Use data, not assumptions to prove or disprove your hypotheses.
* Gather as much data as you can, but don't let a particular suspicious graph lead you into a rabbit hole. Keep gathering data.
If you don't do the above you are guaranteed to have a mess, have to repeat yourself over and over and waste time.
by debarshri on 12/22/23, 3:28 PM
And yes k8s world, dns fails more often than you think.
by mad_vill on 12/22/23, 3:12 PM
I hate coredns. Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly and using the node dnsservers for external hosts.
by Scubabear68 on 12/23/23, 1:07 AM
When they looked they saw all apps were not the same, and it was only a few kinds of apps that were affected.
When a big incident hits, you need people drilling down not just across; and hopefully people who know the actual apps in question.
Maybe this was DevOps people too far into the ops side and not as much on the dev?
by codetrotter on 12/22/23, 4:49 PM
by StianOvrevage on 12/22/23, 11:56 PM
by gunapologist99 on 12/22/23, 11:53 PM
But it wasn't DNS. DNS didn't break. The protocol didn't break. Not even issues with the CoreDNS or dnsmasq implementations.
The culprit was ndots (why did Kubernetes arbitrarily choose five dots) and the general way that Kubernetes (ab)uses DNS.
by Sohcahtoa82 on 12/22/23, 6:05 PM
I'm struggling with a problem where a VM is supposed to get an IP address from the host, but it takes forever to do so. The host is telling me it has assigned an IP, but the VM says it hasn't. It can take anywhere from 10-60 minutes for the VM to actually get the IP that the host has assigned.
by vinay_ys on 12/22/23, 5:24 PM
by mike503 on 12/22/23, 4:46 PM
99.9% of the time.
by octacat on 12/22/23, 4:36 PM
I am feeling that caching all DNS responses for 30 seconds is not always the solution for all kind of usage patterns... Ah, generic solutions are for generic problems (which are usually not your problems).
by pphysch on 12/22/23, 6:39 PM
I barked up the wrong tree for a while and then a more senior guy immediately found the issue. Anyways, now I grok this headline and have a new prank in my kit.
by renewiltord on 12/22/23, 5:30 PM
by zshrc on 12/22/23, 8:55 PM
by hitpointdrew on 12/22/23, 6:41 PM
Might have been a quicker, easier "fix".
by zeroxfe on 12/22/23, 8:26 PM
It's not always the Firewall -- unless it is :-)
by JohnMakin on 12/22/23, 7:12 PM
by AlecSchueler on 12/22/23, 3:17 PM