by mfincham on 10/3/18, 1:42 AM with 3 comments
It seems popular now though for large sites (e.g. twitter.com, github.com, ycombinator.com, stackoverflow.com just to name a few) to use relatively short DNS TTLs, between 1 and 5 minutes, presumably to make failover easier.
Has popular opinion around short TTLs being "OK" changed? Are these sites doing something special to make this viable?
by wahern on 10/3/18, 5:19 AM
Otherwise, it depends. Even with one DNS query per HTTP request, DNS will only represent a fraction of your network load. It's difficult to make any DNS server break a sweat. More likely the network link will saturate beforehand, causing lots of dropped packets. But it's trivial to advertise and use multiple DNS servers; much more trivial than a web application stack. DNS was built for high availability almost since day 1. This is also why you shouldn't worry too much about low TTLs exacerbating network faults--there's no excuse for not using geographically dispersed authoritative name servers.
For example, depending on the site I'll often host the domain on my own primary name server so I can control records without fscking with a web GUI or REST API, but the advertised authoritative name servers are EasyDNS servers which behave as secondaries mirroring my primary.
The real issue isn't load but latency. That's a more complex problem. If you're not using anycast then your site is probably not big enough or important enough for a few millisecond latency upfront for intermittent page loads to matter. Also, many caching resolvers these days will preemptively refresh records upon TTL expiration subject to usage patterns, which means if you're seeing moderate, repeat traffic then users may not experience any additional latency at all. (Similarly, caching resolvers will often remember failing servers and try them last, regardless of ordering in a response.)
As for how painful are erroneous DNS changes, low and high TTLs cut both ways. If it really matters you should be monitoring this stuff 24/7 (e.g. Pingdom), which means record errors should be quickly identified and reported. If you're setup to respond quickly (which you should be for a serious commercial operation), that augurs in favor of low TTLs.
by bigiain on 10/3/18, 2:12 AM
"Back in the day", Internet Explorer was a problem with TTLs, from memory IE6 was when they stopped caching all dns lookups for 24hrs no matter what the til was, and IE6 still coached for 4hrs. (This was a drama for me back in the early 2000's when I was trying to do dns based load balancing...)
My opinion these days is don't try to go much below 1 minute if you want other peoples resolvers or software to honour your ttls, but I do see people using 1 sec ttls occasionally, so presumably if your application doesn't mind too much if not everybody h9onours your ttl - it's still worth doing for some people...
by mfincham on 10/3/18, 1:51 AM