from Hacker News

Broken VPNs, the Year 2038, and certs that expired 100 years ago

by kdp747 on 2/10/24, 8:10 AM with 57 comments

by eszed on 2/11/24, 6:19 AM
This is a great mystery story, with a satisfying ending. And this
> I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying.
is an approach every every one of us should internalize.
by m3047 on 2/11/24, 4:50 PM
Ran into a case where a whole datacenter became untethered from its NTP upstream and drifted off into a timezone of its own creation. Customer was failing authentication for a data product we sold them (TSIG was failing). I was on the phone with them for an hour, reassuring them constantly that everything was working for our other customers, tailing logs, and reporting what I saw.
More datacenter stakeholders kept joining the call, most of whom had nothing to do with our data product. Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark? After an hour somebody noticed that the clocks on servers in the datacenter didn't match up with their laptop; shortly after that I was able to extricate myself from the call... still watching the logs, their downloads started working again a short while later.
by dancemethis on 2/11/24, 3:01 PM
I have a very soft spot for this kind of "campfire story". Open Office not printing on Tuesdays comes to mind. Anyone got some more?
by denton-scratch on 2/11/24, 4:44 PM
> I suspect the NTP server had a badly faulty internal clock which ran very fast.
A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.
by arter4 on 2/11/24, 7:55 PM
They 2038 thing I get, but the clock drift of BILLIONS of seconds really scares me. What kind of fucked up setup can lead to something like this?
by macintux on 2/11/24, 4:58 PM
Not nearly as interesting a story: in 1996 I visited a customer who was using up for dialup services, but reported some of their Windows desktops couldn't connect.
It didn't take me long to figure out that the computers that weren't working had their clocks set well into the 21st century. The shell couldn't even display the year properly, I assumed a Y2K incompatibility, but after so many years now I can't remember exactly what I saw.
Anyway, easy fix, but I never did find out what caused such a weird glitch in their environment. It's small wonder that many people aren't fluent with computers: they misbehave in such a wide variety of ways.
by vdaea on 2/11/24, 8:25 PM
Why does this NTP implementation accept a sudden change of 4 billion seconds? For example, the NTP implementation in Windows refuses to change the clock by more than 54,000 seconds.
by cesarb on 2/11/24, 9:29 PM
This reminded me of this article from last year: https://arstechnica.com/security/2023/08/windows-feature-tha... (HN discussion: https://news.ycombinator.com/item?id=37151220)
by 8organicbits on 2/11/24, 2:57 PM
I'm not sure I see why it was revoking the certificates, when you renew a certificate that's about to expire you can just let the old one expire, right?
by bdw5204 on 2/11/24, 3:30 PM
The solution to the year 2038 problem is to upgrade your time since the Unix epoch fields to 64 bit integers. Hopefully this won't be an actual issue 14 years from now because it's such a simple fix.
by raffraffraff on 2/12/24, 7:23 AM
Oh NTP... I remember a series of extremely annoying incidents that were caused by time skew on hundreds of Linux VMs in our data center. Our setup was typical of a startup - built to be good enough at first, and fall apart at scale.
Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.
Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.
Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).
Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.
by frereubu on 2/11/24, 5:13 PM
Related and fascinating article that came up on HN recently after the originator of NTP, David Mills, died: https://www.newyorker.com/tech/annals-of-technology/the-thor...
(Just turn off JavaScript to read it if you hit a paywall).
by pxeger1 on 2/11/24, 4:38 PM
> the CRL size for the median certificate is 51KB and that half of all CRLs are under 900B.
What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??