by kdp747 on 2/10/24, 8:10 AM with 57 comments
by eszed on 2/11/24, 6:19 AM
> I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying.
is an approach every every one of us should internalize.
by m3047 on 2/11/24, 4:50 PM
More datacenter stakeholders kept joining the call, most of whom had nothing to do with our data product. Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark? After an hour somebody noticed that the clocks on servers in the datacenter didn't match up with their laptop; shortly after that I was able to extricate myself from the call... still watching the logs, their downloads started working again a short while later.
by dancemethis on 2/11/24, 3:01 PM
by denton-scratch on 2/11/24, 4:44 PM
A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.
by arter4 on 2/11/24, 7:55 PM
by macintux on 2/11/24, 4:58 PM
It didn't take me long to figure out that the computers that weren't working had their clocks set well into the 21st century. The shell couldn't even display the year properly, I assumed a Y2K incompatibility, but after so many years now I can't remember exactly what I saw.
Anyway, easy fix, but I never did find out what caused such a weird glitch in their environment. It's small wonder that many people aren't fluent with computers: they misbehave in such a wide variety of ways.
by vdaea on 2/11/24, 8:25 PM
by cesarb on 2/11/24, 9:29 PM
by 8organicbits on 2/11/24, 2:57 PM
by bdw5204 on 2/11/24, 3:30 PM
by raffraffraff on 2/12/24, 7:23 AM
Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.
Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.
Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).
Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.
by frereubu on 2/11/24, 5:13 PM
(Just turn off JavaScript to read it if you hit a paywall).
by pxeger1 on 2/11/24, 4:38 PM
What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??