from Hacker News

Ask HN: How do you manage direct updates to databases in a production system

by dmundhra on 12/15/21, 8:07 AM with 78 comments

Hi HN! When you have a production system running, every now and then there are requirements to made a DB update because of some bug in the code, or for some use cases which were not thought of earlier. How do you handle such updates?

The common way I know is the access is given to a small number of trusted people who make the updates. Are there any other ways or best practices such that the small set of people don't end up becoming bottlenecks, apart from the potential risk of abuse of trust.

by quilombodigital on 12/15/21, 1:12 PM
Through the years I have many times worked on this problem. I´ve worked with rake tasks when using ruby, and many times in java, with flyway and my own system. Database Migration is not a complicated problem, you just commit to the repository a sequence of SQL/code scripts to execute and keep at the database the current step. The big misconception people have is that they can "rollback" when an issue arises. The true fact is that when there is an issue, it happened because something not planned happened, and this can be spurious data in production, production environment differences, script errors, or whatever, and trying to rollback automatically in an undetermined state is just crazy. That´s why migrations must be always FORWARD, no matter what. You try to mitigate all risks, the best you can, by running the migration in test databases (the perfect scenario is to perform a test in a production data clone, but this is not usually possible) and if anything wrong happens, deal with it, fix the database by hand, make code patches, and after it is solved, try to figure out why this slipped through your QA.
by aggrrrh on 12/15/21, 10:02 AM
A lot of replies talks about schema migration and tools like Liquibase etc. But as I understand the question, it's about running update / delete queries in production environment, not DDL.
In my opinion such queries should be subject of established development process: 1. Commit your query to a repository 2. Send to code review 3. After successful review and merge it should be deployed automatically by ci/cd
It may be necessary to run query directly via console in some cases, though. But such query should be committed to the repository latter any way.
And of course you should use proper tools like comments suggest.
by cabraca on 12/15/21, 9:23 AM
Use a tool like https://www.liquibase.org/ or https://flywaydb.org/ to manage updates to databases. You create a migration for your change, check it into git and run this migration on every environment you want to. liquibase for example creates a history table that tracks what migrations are already applied.
by ungawatkt on 12/15/21, 10:23 AM
If you're doing a update/delete/insert in prod because you really have to (I had this a couple times in the last few months):
- reproduce the issue and fix locally, with prod data copied over if needed (better than guessing)
- save (back up) all data that is being changed somewhere, even if it's on your local machine temporarily (again, better than not having it)
- if your datastore has transactions, run the change and a confirmation query in a transaction/rollback several times before committing
- don't hide what you're doing, tell everyone who will listen that you're mucking about in prod
- if someone will hop on a screen share with you, let them
Main theme here, when making data changes, is test as much as you can and make sure folks know it's happening. If you're making data changes, the code has already failed, so don't feel bad doing what you need to do, but mitigate loss as much as you can. Shit happens, deal with it, but don't make it worse if you can.
by KptMarchewa on 12/15/21, 12:40 PM
If it's critical problem that affects all users, then yes, small number of trusted people can execute raw SQL updates on db, while being supervised by another trusted person.
In other case, you write small job called fixtool, that goes through normal code review process, then gets deployed once, runs, and gets deleted after it fixes the situation.
by gerardnico on 12/15/21, 8:57 AM
Database migration is just code. You create your own database migration tool with a incremental version number and you apply it during deployment.
You can also use a specialized tool. https://datacadamia.com/data/database/migration#tool
Make a branch, test your code and deploy it.
by gwbas1c on 12/15/21, 2:09 PM
I'm not sure which question you're asking:
Is this about processes as a small team grows into a larger company? At a certain point your day-to-day software engineers will loose access to production systems. You will need to take away "ops" from the software engineers and have a dedicated "ops" team. Some places call the "ops" team "devops" for political reasons, especially if some founder has a chip on their shoulder about not having dedicated "ops." (Software engineers not touching production systems is industry-normal security practices.)
Is this about how to do the database migration safely? That really depends on your stack, business type, and scalability needs. Assuming you aren't running a hyper-scale product, and you're on a normal database, the easiest thing to do is to have "planned downtime" at a time when your load is lowest. Your business logic layer should return some kind of 5xx error during this period, and your clients / UI should be smart enough to retry after 1-2 minutes. If it's a minor update, (plenty of good advice in this thread,) the downtime should only be 1-2 minutes tops. (The only reason to plan and notify your customers is in case someone is trying to do something business critical and complains.) One thing you can do is "let it be known" that you have 5 minutes of planned downtime Monday night and Thursday night, and that's your window for anything that isn't an emergency.
Is this about the frequency of updates? This is a quality problem, and the only thing you can do is improve your testing and release process to catch these bugs sooner. This is the "growing up" that all small tech companies go through. As you grow, make sure to bring in some people who are mid-late career who've been through these changes. In short, you will need to introduce processes that catch bugs sooner, like automated testing and code coverage. You may find that your "test engineers" write nearly as much test code as your software engineers put into the shipping product.
by unklefolk on 12/15/21, 1:05 PM
We are a .NET shop and have been using DbUp (https://dbup.readthedocs.io) for years now. It is fantastic. We had so many false starts managing production databases and keeping them consistent. DbUp has taken all that away, and now every upgrade or patch is repeatable and predictable. It is worth spending time identifying what in your database is idempotent (e.g. stored procs, functions) and what isn't. This way the idempotent objects can be upgraded every time and everything else just receives a single incremental update as required.
by pingsl on 12/15/21, 12:07 PM
If you want to change the data stored in production database tables, following the below rules will be helpful.
1. Please don’t issue I/U/D SQL statements directly. You’d better write up some programs acting as the “revert transactions” to do the data modification. In this way, you don’t need to grant the I/U/D privileges of the production tables to any user IDs. Instead, you give access to those programs. It will help you to remove the human error as much as possible. And the behavior of the programs is more predictable and consistent.
2. You should have a fixed change window for these kinds of data changes. You should not execute those “revert transactions” whenever you want.
3. Then you give the execution access of those programs to the people who need to do the work only during the change window. That is you grant the execute access to the user ID beforehand and revoke the access afterward. Since this is grant/revoke between user ID and programs, it’s much safer. If you have to grant/revoke between user ID and tables, there might be cascade effects.
4. Before the change, capture the “before” data and get ready the fallback script.
5. Don’t forget to turn on the audio options/tools during the change window.
6. If you guys work in a physical office, you can think about binding those revert transactions to a dedicated desktop.
I know these rules are complicated and tedious, but they could protect the team and the data as well :)
by baash05 on 12/15/21, 8:55 AM
Typically I lean to rake tasks (rails dev).. these have spec and go through PR process just like everything else. In worst case we jump on directly.. But each rails instruction or db instruction is logged against the dev who did it (In slack chat that anyone can see, in real time)
by jacobwilliamroy on 12/15/21, 10:08 AM
The first idea I had was "don't". We have a VM where we do all of our changes and then the actual production sever is just a copy of the VM. So we test things on the VM and if it doesn't break, THEN those same changes are made on the production server.
by jkingsbery on 12/15/21, 3:02 PM
> The common way I know is the access is given to a small number of trusted people who make the updates.
I think this was a common way of doing it 10+ years ago, but what I've seen work the best is to have your CI/CD process apply the updates. Assuming the migration scripts follow the same process as all your other code (code reviewed, tested, and are runnable), automatically applying these scripts is much less error prone than having a person apply them. Humans are pretty terrible at following a script, which is one of the things computers are great at.
Some objections I've seen to relying on automatic deployments:
- "But we should have a person run them to make sure they do they right thing!" Agreed, you should, but that should happen at test time, or maybe code review time, not right before you are about to apply something into production the Shift Left [1] principle applies to database changes as well as code changes
- "What if something goes wrong?" Well, what if something goes wrong with any of your deployments? Usually, you can roll back (either literally rolling back, or rolling forward, or rolling in some other direction that leaves your system in an ok state), look at the logs to see what went wrong, and try again.
- "But data can get messed up!" Data can get messed up by code changes. You should have a backup/disaster recovery process.
[1] https://en.wikipedia.org/wiki/Shift-left_testing
by nickjj on 12/15/21, 1:01 PM
I use web frameworks that have database migration tools and then it becomes just another code deploy which gets kicked off by CI.
In development I create the migration which ends up being a file that alters the schema. I run the migration command and make sure things work as planned and all tests still pass.
Then I push up the feature branch and it gets code reviewed. This review could either be by yourself or a team member depending on team size.
After everything passes on CI it gets merged and deployed. At this point it depends on how you deploy things but that could be to a staging environment where it runs which gives you extra confidence that it works before rolling it out to production or you could go straight to production if you don't have a staging environment.
As for how it gets run, usually it happens after the new version of your code is available to be pulled but before your web app gets restart. This way it can run while the old version of your app is running, then once it completes you can restart your web app to pick up the new code base that was part of the deploy.
If you're careful about how you migrate things you can do a lot of migrations without hard downtime, but it's usually a balancing act between how complicated you want the migration process to be (multiple steps / deploys) vs how much downtime you can live with (1 step / deploy).
Basically migrations are code that get treated like any other code deploy with a proper review. If it's an especially important migration it wouldn't hurt to get it reviewed by more than 1 person and also perform a production DB backup right before you deploy the migration.
by lucastech on 12/15/21, 8:27 PM
This feels very timely, as I just encountered it yesterday.
The steps I took were:
1. backup production before making any changes
2. import the production backup locally
3. test the script locally, QA the result to ensure desired outcome
4. repeat 2-3 until you feel confident that you're able to consistently get what you need from the script
5. wait until a low volume time for your platform (if possible), run a backup, import and test 1 more time locally, apply the change in production
However, after doing this, a coworker noticed something we had missed when QAing the data. As a result I had to revert the changes, I had a rollback query ready to restore the modified records.
As part of the planning for this, I added a new column to the table in question to track if/when a fix was made to the record. This allowed me to identify the records which were modified to roll them back. While this isn't always practical, it made sense long term for our situation.
A secondary option I've used in the past is exporting the table to be updated and then writing a simple script with the modifications necessary to generate the SQL to update each record individually. I've then used the same script and export to build a revert update query as well just in case something unexpected comes up.
At the end of the day, make sure you're backing up before any changes, plan ahead in case of unexpected errors so that you can revert quickly to avoid downtime instead of scrambling when it goes wrong. Test it a few times locally (change + revert) to try and sort out the issues prior to applying them in production.
edit: formatting
by cable2600 on 12/15/21, 9:14 AM
Create a history table that logs each data to a timestamp and user name. The latest timestamp is the current revision. That way you know who last changed the record to what data. It worked for me as a federal contractor.
by zerocount on 12/15/21, 9:57 AM
At a previous employer, we had a 'fire call' to handle these situations. These were emergency situations and required a form submition and authorization from a manager. You were given temporary credentials with access to do what you needed to do.
I only had 2 of these in the 5 years I worked there, but here's an example. We had an internal purchase request system used for puchasing a keyboard or what ever you needed for work. Of course the purchase request went through a chain of approvals starting with your manager and eded with the CTO. The CTO threw a fit for having to approve keyboards and other small items, so it was deemed an 'emergency' to fix it right away. I had to immediately patch the code so he wouldn't see trivial requests. The 'fire call' allowed me to submit the code directly into production without going through the change control procedures, which only happened once per week.
And you better be damn sure that your changes are correct, crap rolls down hill very quickly when it involves very senior people.
by xchaotic on 12/15/21, 9:09 AM
I work with a 'nosql' database vendor and I take care of customers with PBs of data doing this. A very common pattern is having a DR setup - i.e. two separate clusters, with the database replicated, then breaking that replication, upgrading DR and if it's successful promoting that to acting prod and upgrading production (cluster #1) and re-enabling replication.
by lmilcin on 12/15/21, 2:13 PM
On my infrastructure the number of people with direct access to modify the database is zero.
What you do is you write any operations that are to be run on production as code (for example service that can be triggered).
You then get the code reviewed (second pair of eyes) after which it is automatically compiled and delivered to pre-production. On pre-produuction you have a chance to test it against full environment to see if it is breaking anything. If tests results are accepted it can be promoted to prod.
We actually have special service that is only meant to run one-off database operations. It is a webservice where you can trigger jobs.
Jobs do not take any parameters at all -- you just give the name of the job and that's it. This allows the entire operation to be reviewed in full.
Jobs can't run in parallel -- there is a queue and you can schedule a job to run. The job can also be saving progress (and it will restore from the saved progress in case it fails or is cancelled mid-operation).
by tailspin2019 on 12/16/21, 1:37 AM
This is a nice article from the folks at Octopus Deploy on how to solve this problem with their RunBooks feature:
https://octopus.com/blog/database-deployment-automation-adho...
by zv on 12/15/21, 9:01 AM
Gated automated deployment (for MSSQL - dacpac via Azure DevOps, similar things exist for other DBs). Usually can be pre-tested if prod db copy is restored on lower environments. If anything, rollback can be made to previous deployment.
Either way, you need to designate someone who is knowledgeable to oversee the process. Automated deployment just makes this process way faster (thus "reducing bottleneck") where this person only needs to review code to see if anything suspicious isn't there and approve the deployment.
Manual deployments are prone to human error, especially under stress/time pressure. If manual deployment "package" (a set of sql scripts) is poorly written, there's huge incentive to "contact directly", which again could lead to manual errors.
The biggest drawback is culture which is the hardest ("we done this way for n years and we don't want to change").
by tristor on 12/15/21, 9:30 AM
Greatly depends on the database in question. For MySQL I heavily recommend using pt-online-schema-change[1] from Percona Toolkit[2].
In all cases though you should have a change management process where the schema change is documented, reviewed, and approved and only a small number of people have the necessary access to run these types of changes in production. Change management is your friend.
1: https://www.percona.com/doc/percona-toolkit/3.0/pt-online-sc...
2: https://www.percona.com/software/database-tools/percona-tool...
by PeterisP on 12/15/21, 11:25 AM
"Pair programming" - don't do it alone, have a second person check every letter typed. Do explicit backups beforehand. And for non-DDL updates, do it inside a transaction where you can verify the results (preferably by running multiple automated reports) before committing.
by brightball on 12/15/21, 1:07 PM
Very carefully. :)
Migrations are a good place to store code for an update. If it’s an update that will take a while I might use the migration to enqueue a job.
by darkr on 12/15/21, 11:37 AM
Use your standard DB migration tooling, with code/change review as part of the process.
I’m a big fan of sqitch[1] but many migration tools will handle data updates as well as DDL.
1: https://sqitch.org/
by winrid on 12/15/21, 2:31 PM
At current and last job we trust all devs with production DB access, but it's logged any time they access a machine that has prod DB access. We also have a migration framework that can create an audit log that can then be used to perform or rollback the migration. It works at the row/object level. There is a review system around migrations and you have to pass your ticket ID to run the migration, and it will check if the ticket was approved.
Former job eventually had a system of request - you would request access for your user for 24 hours.
Job before that only managers had prod db access. Do not recommend if possible.
by JensRantil on 12/16/21, 10:44 AM
The best migration is no-migration. That is not having to move data at all. This becomes increasingly more important with larger datasets. One way of doing that is to simply serialize data in something like Protobuf and have a generic schema that rarely/never needs to be altered[1]. Adding a new field to the Protobuf is a no-op.
[1] https://kevin.burke.dev/kevin/reddits-database-has-two-table...
by thiht on 12/15/21, 11:05 PM
Assuming you’re talking about patch data and not schéma migration, the most effective way to me is direct write access to the database, and transaction usage.
Whenever I need to patch data on a Postgres, the first thing I type is ˋbegin;ˋ to start my transaction. After that I just run my updates safely and check that the impacted data matches what I expected. Then I just need to commit or rollback the changes.
2 things to do before that:
- have the queries reviewed by members of the team knowledgeable about the db - try the query on a dev environment (ideally a preprod with a replica of the data)
by rognjen on 12/15/21, 3:14 PM
If you mean updates to the schema then some sort of migration system surely exists for your framework of choice, such as in Rails and Laravel too I think.
If you mean one-off changes to existing data what has served me well so far is implementing an audit log from the start coupled with making sure that records are fudgeable by a select few admins.
This means that a few people can make changes to nearly everything from the interface rather than having to log into the DB directly. At the same time there is a record of changes done and the ability to comment on and revert them.
by fer on 12/15/21, 9:09 AM
If using SQLAlchemy, of course the answer is Alembic[0]. But I like Alembic so much that actually, in a couple cases, I copied the model into Python/SQLAlchemy only to be able to use Alembic for migrations. Of course if you use some other ORM instead, that's a no go, but for dealing with legacy/raw DBs, it's worth the effort to have code do it for you.
[0] https://alembic.sqlalchemy.org/en/latest/
by preetamjinka on 12/15/21, 8:11 AM
> The common way I know is the access is given to a small number of trusted people who make the updates.
We have the same approach. A very small number of people have write access to the production databases. If things can’t wait for a schema change release (15-30 min) and can’t be done through the back office API, we do it manually. It’s very rare.
by Notanothertoo on 12/16/21, 4:23 AM
I use go migrate. I basically a tool that has a version number associated with a. Sql file. You write an upscript and a fow of script in Sql, commit that to source code. If you use the tool to deploy. It gets a lot for very little extra.
by plasma on 12/15/21, 8:27 AM
Similar procedure, limited amount of people.
It should be a rare exception not a normal thing, so a few people is fine.
Best practice is to also have your change reviewed beforehand, and run in a transaction where you also validate expected results before committing, etc.
by Mister_Snuggles on 12/15/21, 1:36 PM
At work, we do the following:
* User reports the issue to a business analyst via a ticket.
* Business analyst confirms the issue, adds any relevant details, and opens a ticket with the technical team.
* The technical team creates a script to perform the update and runs it in the 'support' database (which is refreshed nightly from Production)
* Business analyst validates results
* User validates the results
* Technical team logs a Change Request, it is discussed at the next Change Advisory Board meeting, then the script is run during the next change window (weekly). If it's an emergency, we can run it the same day.
* The Database Administration team runs the script in Production
It sounds like a heavy weight process, but with multiple people involved and actions logged at every step of the way it's a very safe and auditable process.
TL;DR: A small number of trusted people have the required access, there is a fair amount of business process built around using this access.
by wizwit999 on 12/15/21, 9:43 PM
If this is becoming a bottleneck, there's a process problem, and u need to invest in an automation mechanism. It should be rare.
by natoliniak on 12/15/21, 9:01 AM
if you are running .Net, developers would utilize Entity Framework migrations to manage and source control all DB changes. Similarly, Alembic for the Python ecosystem. Treat the application of migration scripts as part of a regular prod deployment process, which is typically limited to a small group of ops staff.
by NicoJuicy on 12/15/21, 1:18 PM
EntityFramework Migrations or DbUp ( someone mentioned it here to).
https://dbup.readthedocs.io/en/latest/
by ItsBob on 12/15/21, 11:29 AM
Unfortunately the old-fashined way: do it out of hours with a backup first :(
I've never had to work on a 24/7 critical system so I dunno how to do that.
by tbrownaw on 12/15/21, 1:34 PM
There is a `fixes/` directory in source control, full of lots of little one-off things that have a ticket number in the filename.
by seancoleman on 12/15/21, 2:10 PM
I just set up Basedash yesterday (think Airtable over real SQL databases) and plan to use it for updating a production database.
by sgt101 on 12/15/21, 11:34 AM
I would build a rest service that provides the code to do this, and I would test it extensively before releasing it as the mechanism to effect the change.
In addition I would build a validation service that checks and tests the update data (the list of changes) before they are submitted to the change service.
I would not permit any adhoc changes to the prod database. The rest service should be the only way for a mutation to be done short of a complete new release.
by gizmore on 12/15/21, 3:34 PM
I use copy,drop,create,restore to keep the db schema up to date.
by mamcx on 12/15/21, 3:54 PM
Know SQL well, so you can know how revert any problem you have.
Other commenters say a lot about PREVENTING an issue. That is good, but you can get stuck in a situation that (very common, saddly) you are under a pile of "stuff" that hide a simple fact:
You CAN ruin and revert nearly everything* in a RDBMs and still get on top.
* as long you are inside SQL/transactions and not messing with the binary data
---
The point here is that you can model the database in ways you get stuck with limited options, or instead make it easier to get of trouble.
The good thing is that is mostly apply good design practiques and is not that problematic.
The key is that data in RDBMs IS values (aka:inmmutables (rows, tables) on top of a mutable facade (INSERT, DELETES), so you can always take snapshot of data and apply reversible operations.
This is what most do with "copy to sql, csv, do a backup, etc" but is in fact more in-built than that. You can, totally, side-step extra operational complexity if you know this fact (and the change is kinda small. I don't say don't make copies or backup, instead, that SQL is VERY friendly to mess with it!)
The major issues is when you turn your powerfull RDBMS in a pathetic datastore, with a lot of MongoDb-style data/schemas, anti-patterns everywhere and close-to-zero understanding of (advanced)sql like usage of views, indexes, triggers, TEMP tables (note this!), administrative commands, write your own functions, know what tools the db gives for free, etc.
It sound like a lot, but even a "complex" RDBMs like postgres requiere far less study than learn JS. Is just a matter of understand that RDBMs are not "dumb" datastores.
---
I forgot to make this actionable. Most RDBMs have a comparable way to do this:
```
    CREATE TEMP TABLE _address ON COMMIT DROP AS
     SELECT customer_code, address_code, is_default FROM address;
```
Do the changes there. Your temp tables are "FREE" to get ruined.
Then apply back with a update/insert from tables:
```
    UPDATE address
    SET ...
    FROM _address
    WHERE address.address_code = address.address_code;
```
You can ALSO copy to another schema (PG) or db (sqlite) and do the mess there:
```
    CREATE SCHEMA fix;

    CREATE TEMP TABLE fix.name AS
     SELECT ;

    --or
    CREATE DATABASE fix WITH TEMPLATE old;
```
And this mean you can also "rename" the old db/schema, do a copy of it, then see if the copy was ok, if not, drop the schema and rename again.
This is how I done some of the most nerve-cracking fixes!