by mbrain on 2/10/17, 11:04 PM with 257 comments
by ky738 on 2/11/17, 12:33 AM
by illumin8 on 2/11/17, 12:15 AM
Also, RDS gives you a synchronously replicated standby database, and automates failover, including updating the DNS CNAME that the clients connect to during a failover (so it is seamless to the clients, other than requiring a reconnect), and ensuring that you don't lose a single transaction during a failover (the magic of synchronous replication over a low latency link between datacenters).
For a company like Gitlab, that is public about wanting to exit the cloud, I feel like they could have really benefited from a fully managed relational database service. This entire tragic situation could have never happened if they were willing to acknowledge the obvious: managing relational databases is hard, and allowed someone with better operational automation, like AWS, to do it for them.
by KayEss on 2/11/17, 6:17 AM
They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.
Only when the replication is back in good order do you go through and kill the servers you no longer need.
The procedure for setting up these new servers should be based on the same scripts that spin up new UAT servers for each release. You spin up a server that is a near copy of production and then do the upgrade to new software on that. Only when you've got a successful deployment do you kill the old UAT server. This way all of these processes are tested time and time again and you know exactly how long they'll take and iron out problems in the automation.
by meowface on 2/11/17, 12:10 AM
I could feel the sweat drops just from reading this.
I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.
by atmosx on 2/11/17, 10:25 AM
My 2 cents... I might be the only one, but I don't like the way GL handled this case. I understand transparency as a core value and all, but they've gotten a bit too far.
IMHO this level of exposure has far-reaching, privacy implications for the ppl who work there. Implications that cannot be assessed now.
The engineer in question might have not suffered a PTSD, but some other engineer might haven been. Who knows how a bad public experience might play out? It's a fairly small circle, I'm not sure I would like to be part of a company that would expose me in a similar fashion, if I happen to screw up.
On the corporate side of things there is a saying in Greek: "Τα εν οίκω μη εν δήμω" meaning don't wash your dirty linen in public. Although they're getting praised by bloggers and other small-size startups, in the end of the day exposing your 6-layer broken backup policy and other internal flaws in between, while being funded at the tune of 25.62M in 4 rounds, does not look good.
by gr2020 on 2/10/17, 11:48 PM
I'm glad it all worked out in the end!
by greenrd on 2/10/17, 11:57 PM
by ancarda on 2/11/17, 11:56 AM
At my dayjob, we gradually stopped using email for almost all alerts, instead we have several Slack channels like #database-log where errors to MySQL go. Any cron jobs that fail post in #general-log. Uptime monitoring tools post in #status. So on...
Email has so much anti-spam stuff like DMARC that make it less reliable your mail will be delivered. For something failing like a backup or database query, it's too important to have potentially not reach someone who can make sure it gets fixed.
My 2 cents.
by matt_wulfeck on 2/11/17, 6:24 AM
I can only image this engineer's poor old heart after the realization of removing that directory on the master. A sinking, awful feeling of dread.
I've had a few close calls in my career. Each time it's made me pause and thank my luck it wasn't prod.
by nowarninglabel on 2/11/17, 12:17 AM
by aabajian on 2/11/17, 11:59 AM
>>The standby (secondary) is only used for failover purposes.
>>One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup.
IMO, secondaries should be treated exactly as their primaries. No operation should be done on a secondary unless you'd be OK doing that same operation on the primary. You can always create another instance for these operations.
by voidlogic on 2/11/17, 2:45 AM
Yikes. One common practice that would have avoided this is by using the just taken backup to populate stage. If the restore fails pages go out. If integration tests that run after a successful restore/populate fail- pages go out.
Live and learn I guess.
by _Marak_ on 2/11/17, 1:57 AM
It's unfortunate they had this technical issue, but it's good to see others ( besides Github ) operating in this space. I should give Gitlab a try sometime.
by pradeepchhetri on 2/11/17, 6:41 AM
by jsperson on 2/11/17, 12:14 AM
This is a great attitude. Too often opportunity cost isn't considered when making rules to protect folks from doing something stupid.
by yarper on 2/11/17, 12:16 AM
It's really simple to point the finger and try to find a single cause of failure - but it's a fools errand - comparable to finding the single source behind a great success.
by isoos on 2/11/17, 12:09 AM
by samat on 2/10/17, 11:54 PM
by XorNot on 2/11/17, 3:55 PM
How do you reliably check if something didn't happen? Is the backup server alive? Did the script work? Did the backup work? Is the email server working? Is the dashboard working? Is the user checking their emails (think: wildcard mail sorting rule dumping a slight change in failure messages to the wrong folder).
And the converse answer isn't much better: send a success notification...but if it mostly succeeds, how do you keep people paying attention to it when it doesn't (i.e. no failure message, but no success message)?
The best answer I've got, personally, is to use positive notifications combined with visibility - dashboard your really important tasks with big, distinctive colors - use time based detection and put a clock on your dashboard (because dashboards which mostly don't change might hang and no one notice).
by nodesocket on 2/11/17, 3:56 AM
>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.
Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?
by dancryer on 2/13/17, 9:44 AM
Is that correct? http://monitor.gitlab.net/dashboard/db/backups?from=14859419...
by nierman on 2/11/17, 12:02 AM
definitely monitor your replication lag--or at least disk usage on the master--with this approach (in case wal starts piling up there).
by nstj on 2/11/17, 3:50 AM
by Achshar on 2/11/17, 11:20 AM
by jsingleton on 2/11/17, 1:34 PM
I moved from AWS to Azure years ago. Mainly because I run mostly .NET workloads and the support is better. I've recently done some .NET stuff on AWS again and am remembering why I switched.
by AlexCoventry on 2/11/17, 10:27 PM
Are any organizational changes planned in response to the development friction which led to the outage? It seems to have arisen from long-standing operational issues, and an analysis of how prior attempts to address those issues got bogged down would be very interesting.
by oli5679 on 2/10/17, 11:48 PM
http://serverfault.com/questions/587102/monday-morning-mista...
by tschellenbach on 2/11/17, 1:15 AM
by khazhou on 2/11/17, 10:02 AM
DON'T PANIC
by grhmc on 2/11/17, 12:13 AM
by encoderer on 2/11/17, 7:33 AM
by dustinmoris on 2/11/17, 11:19 AM
It's good to be humble and know that mistakes can happen to anyone and learn from it, etc., but when you do in 2017 still the same stupid mistakes that people did a million times since 1990 and it's all well documented and there's systems built to avoid these same basic mistakes and you still do them today then I just think it cannot be described any different than absolute stupidity and incompetence.
I know they have many fans who just look past every mistake no matter how bad it was only because they are open about it, but common, this is now just taking the piss no?
by cookiecaper on 2/11/17, 10:46 AM
1. notifications go through regular email. Email should be only one channel used to dispatch notifications of infrastructure events. Tools like VictorOps or PagerDuty should be employed as notification brokers/coordinators and notifications should go to email, team chat, and phone/SMS if severity warrants, and have an attached escalation policy so that it doesn't all hinge on one guy's phone not being dead.
2. there was a single database, whose performance problems had impacted production multiple times before (the post lists 4 incidents). One such performance problem was contributing to breakage at this very moment. I understand that was the thing that was trying to be fixed here, but what process allowed this to cause 4 outages over the preceding year without moving to the top of the list of things to address? Wouldn't it be wise to tweak the PgSQL configuration and/or upgrade the server before trying to integrate the hot standby to serve some read-only queries? And since a hot standby can only service reads (and afaik this is not a well-supported option in PgSQL), wouldn't most of the performance issues, which appear write-related, remain? The process seriously needs to be reviewed here.
And am I reading this right, the one and only production DB server was restarted to change a configuration value in order to try to make pg_basebackup work? What impact did that have on the people trying to use the site a) while the database was restarting, and b) while the kernel settings were tweaked to accommodate the too-high max_connections value? Is it normal for GitLab to cause intermittent, few-minute downtimes like that? Or did that occur while the site was already down?
3. Spam reports can cause mass hard deletion of user data? Has this happened to other users? The target in this instance was a GitLab employee. Who has been trolled this way such that performance wasn't impacted? What's the remedy for wrongly-targeted persons? It's clear that backups of this data are not available. And is the GitLab employee's data gone now too? How could something so insufficient have been released to the public, and how can you disclose this apparently-unresolved vulnerability? By so doing, you're challenging the public to come and try to empty your database. Good thing you're surely taking good backups now! (We're going to glance over the fact that GitLab just told everyone its logical DB backups are 3 days behind and that we shouldn't worry because LVM snapshots now occur hourly, and that it only takes 16 hours to transfer LVM snapshots between environments :) )
4. the PgSQL master deleted its WALs within 4 hours of the replica "beginning to lag" (<interrobang here>). That really needs to be fixed. Again, you probably need a serious upgrade to your PgSQL server because it apparently doesn't have enough space to hold more than a couple of hours of WALs (unless this was just a naive misconfiguration of the [min|max]_wal_size parameter, like the max_connections parameter?). I understand that transaction logs can get very large, but the disk needs to accommodate (usually a second disk array is used for WALs to ease write impact) and replication lag needs to be monitored and alarmed on.
There were a few other things (including someone else downthread who pointed out that your CEO re-revealed your DB's hostnames in this write-up, and that they're resolvable via public DNS and have running sshds on port 22), but these are the big standouts for me.
P.S. bonus point, just speculative:
Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds like a stretch. Some data may've been recoverable with some disk forensics. Especially if your Postgres server was running at the time of the deletion, some data and file descriptors also likely could've been extracted from system memory. Linux doesn't actually delete files if another process is holding their handle open; you can go into the /proc virtual filesystem and grab the file descriptor again to redump the files to live disk locations. Since your database was 400GB and too big to keep 100% in RAM, this probably wouldn't have been a full recovery, but it may have been able to provide a partial.
The theoretically best thing to do in such a situation would probably be to unplug the machine ASAP after ^C (without going through formal shutdown processes that may try to "clean up" unfinished disk work), remove the disk, attach it to a machine with a write blocker, and take a full-disk image for forensics purposes. This would maximize the ability to extract any data that the system was unable to eat/destroy.
In theory, I believe pulling the plug while a process kept the file descriptor open should keep you in reasonably good shape, as far as that goes after you've accidentally deleted 3/4 of your production database. The process never closes and the disk stops and the contents remain on disk, just pending unlink when the OS stops the process (this is one reason why it'd be important to block writes to the disk/be extremely careful while mounting; if the journal plays back, it may destroy these files on the next boot anyway). But someone more familiar with the FS internals would have to say definitively if it works that way or not.
I recognize that such speculative/experimental recovery measures may have been intentionally forgone since they're labor intensive, may have delayed the overall recovery, and very possibly wouldn't have returned useful data anyway. Mentioning it mainly as an option to remain aware of.
by NPegasus on 2/11/17, 1:45 AM
> Root Cause Analysis
> [...]
> [List of technical problems]
No, the root cause is you have no senior engineers who have been through this before. A collection of distributed remote employees, none of whom has enough experience to know any of the list of "Basic Knowledge Needed to Run a Website at Scale" that you list as the root causes. $30 million in funding and still running the company like a hobby project among college roommates.Mark my words, the board members from the VC firms will be removed by the VC partners due to letting the kids run the show. Then VC firms will put an experienced CEO and CTO in place to clean up the mess and get the company on track. Unfortunately they will probably have wasted a couple years and be down to the last million $ before they take action.
by EnFinlay on 2/11/17, 1:30 AM