from Hacker News

Tarsnap outage postmortem

by anderiv on 7/27/23, 4:33 AM with 319 comments

  • by cperciva on 7/27/23, 5:20 AM

    blinks

    Ok, I really wasn't expecting this to land at the top of HN. I'd love to stick around to answer any questions people have, but it's 10PM and my toddler decided to go to bed at 5PM... so if I'm lucky I can get about 4 hours of sleep before she decides that it's time to get up. I'll check in and answer questions in the morning.

  • by deathanatos on 7/27/23, 4:26 PM

    > Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, on 2023-07-13 (after some dust settled and I caught up on some sleep) I credited everyone's Tarsnap accounts with 50% of a month's storage costs.

    This speaks volumes to me about what kind of person Percival is; that credit would appear to be generously on the "make customer whole" side of the fence, and unlike the major cloud providers, he didn't make each customer come and individually grovel for it. And a clearly written, technical, detailed PM, too. This is how it ought to be done, and done everywhere. Thanks for being a beacon of light in the dark.

  • by hightrees2023 on 7/27/23, 8:50 AM

    The downtime could have been much shortened if you had properly setup and _tested_ disaster recovery steps. Create a full fledged separate staging system which you can bring down and recreate and periodically test various failure modes + document all detailed steps of system restore etc.

    Also I would suggest to think about the business long term and seeing if you can increase the revenue enough to enable you to hire a part-timer who can be of great help in case a similar event happens.

    We are also a small cloud solution provider (we focus on ML API's) and over the years it has become clear to us that when you use cloud hardware (either dedicated or virtual), from time to time the outages periodically happen. RAM, HDD or other parts of the hardware just can malfunction anytime. So this is something which 100% needs to be taken into consideration when running any high availability online service over long-term.

  • by idlewords on 7/27/23, 3:18 PM

    Hats off to you for an honest postmortem and your capable handling of a difficult situation. The only remark I would offer is with respect to sleep deprivation—when you're the only person who can fix a problem, there's no shame in trading some additional outage time for a fresh mind. Though it feels weird to go nap when all the klaxons are blaring, problems are too easy to compound under the combination of adrenaline and inadequate sleep.
  • by zokier on 7/27/23, 6:06 AM

    Based on the description it sounds like it should be relatively easy to test this recovery process on a regular basis, to catch any lingering bugs and evaluate the recovery time. As they say, the only backups are the ones you have tested.
  • by mplewis on 7/27/23, 5:17 AM

    I always appreciate seeing a professional, courteous, and honest postmortem like this one.
  • by verytrivial on 7/27/23, 7:37 AM

    (caveat: I may be running on old tarsnap company info but) I must say, the ONLY thing that has ever made me shy away from seriously using tarsnap was the prospect of an unexpected Colin Percival outage. i.e. key person risk. I'm guessing I'm not alone in this.
  • by abiro on 7/27/23, 8:43 AM

    > The second step failed almost immediately, with an error telling me that a replayed log entry was recording data belonging to a machine which didn't exist. This provoked some head-scratching until I realized that this was introduced by some code I wrote in 2014: Occasionally Tarsnap users need to move a machine between accounts, and I handle this storing a new "machine registration" log entry and deleting the previous one

    Recommend writing a TLA+ model to catch stuff like this

  • by colonwqbang on 7/27/23, 1:27 PM

    What would be the benefit of tarsnap over using something like restic+backblaze at order(s) of magnitude lower cost? What specific need would motivate you to pay $3000 per TB-year?
  • by mherrmann on 7/28/23, 12:42 PM

    It sounds like most of the 26h downtime was spent restoring backups. Incidentally, this is exactly the reason why Tarsnap is unusable for me for production environments. Backup restoration (as a user) is excruciatingly slow. When my systems are offline, I have no patience to wait for hours for my backup service. Maybe things are better now; Last I tried was a few years ago when Tarsnap took on the order of magnitude of one hour to restore a backup of a few GBs.
  • by akashshah87 on 7/27/23, 9:39 PM

    Unfortunately, looks like https://www.tarsnap.com/infrastructure.html will have to be updated.

    >> So far such an outage has never occurred; but over time Tarsnap will become more tolerant of failures in order to minimize the probability that such an outage occurs in the future.

  • by Tachyooon on 7/27/23, 11:56 AM

    Unrelated to the outage, but I'm curious nonetheless: would it be possible to hook up Tarsnap's encryption software to a Dropbox folder? I'm not sure if it even makes sense to use Tarsnap for this, but I'd love to have an easy setup that allows me to use Dropbox's servers but only let them see encrypted data so they can't snoop.
  • by aborsy on 7/27/23, 7:36 AM

    Tarsnap is undoubtedly expensive, but it also donates to various efforts!

    Neglecting the pricing, does Tarsnap have any advantage over Restic?

    Restic also deduplicates, using little data.

  • by RockRobotRock on 7/27/23, 6:03 AM

    Aren't these storage prices absurd? Please let me know if I'm misunderstanding.
  • by switch007 on 7/27/23, 5:20 AM

    Not to be that guy, but it’s unreadable either zoomed in or in reader mode either horizontal or landscape on iOS.

    Colin, could the website be updated to the 2010s? :P

  • by zetalyrae on 7/27/23, 9:27 AM

    >The process of recovering the EC2 instance state consists of two steps: First, reading all of the metadata headers from S3; and second, "replaying" all of those operations locally. (These cannot be performed at the same time, since the use of log-structured storage means that log entries are "rewritten" to free up storage when data is deleted; log entries contain sequence numbers to allow them to be replayed in the correct order, but they must be sorted into the correct order after being retrieved before they can be replayed.)

    Far be it from me to tell anyone how to write software, but why build a database on top of S3 when you can just chuck the metadata into RDS with however much replication you want?

    The backups themselves should be in S3, but using S3 as a NoSQL append-only database seems unwise.

    This would benefit from being further from the metal.