by vii on 10/27/20, 3:20 AM with 332 comments
by csmattryder on 10/27/20, 2:26 PM
I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
by xamuel on 10/27/20, 1:51 PM
by harikb on 10/27/20, 6:36 PM
by roydivision on 10/27/20, 2:34 PM
https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...
by dgritsko on 10/27/20, 1:48 PM
by luhn on 10/27/20, 6:57 PM
by jasonpeacock on 10/27/20, 4:25 PM
Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.
Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.
It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.
by bronco21016 on 10/27/20, 8:04 PM
In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.
I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.
I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.
by illumin8 on 10/27/20, 5:51 PM
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
by educationcto on 10/27/20, 2:02 PM
Terraform will perform the following actions:
# google_compute_instance.vm_instance will be created
+ resource "google_compute_instance" "vm_instance" {
+ ... <more>
Plan: 2 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
by remram on 10/27/20, 8:24 PM
[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...
by Darkphibre on 10/27/20, 9:00 PM
To all machines. Employee and servers alike.
Yes. Including the DNS servers.
Took them a day or two to work out how to roll that one back.
by tialaramex on 10/27/20, 11:35 PM
1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.
This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?
But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.
I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".
2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.
Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.
by vondur on 10/27/20, 4:10 PM
by kbenson on 10/27/20, 7:48 PM
Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.
by rossjudson on 10/27/20, 8:41 PM
by gabeio on 10/27/20, 8:24 PM
by jaclaz on 10/27/20, 3:14 PM
How many/which companies have more than one million Linux machines?
by Ayesh on 10/27/20, 4:56 PM
I have to type "danger" to bypass this restriction, and I thought it was pretty cool.
Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.
by ineedasername on 10/28/20, 12:43 AM
I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.
I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."
by aqme28 on 10/27/20, 9:49 PM
> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "
It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.
by nemo1618 on 10/27/20, 3:34 PM
by tigger0jk on 10/27/20, 6:41 PM
by cle on 10/28/20, 2:05 AM
by rcarmo on 10/28/20, 7:59 AM
Or to all the machines, on one occasion.
(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)
by lqet on 10/27/20, 10:19 PM
by temporallobe on 10/27/20, 9:46 PM
by TravHatesMe on 10/27/20, 7:46 PM
by willvarfar on 10/27/20, 3:28 PM
Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.
by mcintyre1994 on 10/28/20, 6:47 AM
by heelix on 10/27/20, 8:36 PM
by Cthulhu_ on 10/27/20, 3:00 PM
by woliveirajr on 10/27/20, 9:24 PM
Edit: added name of software
by andrewfromx on 10/27/20, 7:37 PM
by sidpatil on 10/27/20, 3:04 PM
by vsnf on 10/27/20, 3:18 PM
The result of one too many mindlessly accidental pushes.
by regularfry on 10/28/20, 1:50 PM
by diebeforei485 on 10/28/20, 3:56 AM
by gitgud on 10/28/20, 2:27 AM
A few places!? What is an example of this?
by ComodoHacker on 10/27/20, 7:40 PM
by bnastic on 10/27/20, 10:54 PM
by Animats on 10/27/20, 6:41 PM
by larrik on 10/27/20, 2:32 PM
by RobRivera on 10/27/20, 8:59 PM
by wotton on 10/27/20, 10:47 PM
by konjin on 10/27/20, 8:20 PM
by eznzt on 10/27/20, 4:25 PM
by jerf on 10/27/20, 6:29 PM
Looks like https vs http link.
by jancsika on 10/27/20, 2:21 PM
Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.
by JoeAltmaier on 10/27/20, 2:23 PM
by outworlder on 10/27/20, 10:58 PM
"Do you care? (Y/N)"
Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.
Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).
If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.
In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).
The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.