from Hacker News

Type in the exact number of machines to proceed

by vii on 10/27/20, 3:20 AM with 332 comments

by csmattryder on 10/27/20, 2:26 PM
I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.
I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
[1] https://en.wikipedia.org/wiki/Pointing_and_calling
by xamuel on 10/27/20, 1:51 PM
I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."
by harikb on 10/27/20, 6:36 PM
I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.
by roydivision on 10/27/20, 2:34 PM
Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.
https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...
by dgritsko on 10/27/20, 1:48 PM
Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.
by luhn on 10/27/20, 6:57 PM
One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.
[1] https://aws.amazon.com/message/41926/
by jasonpeacock on 10/27/20, 4:25 PM
Raskin talks about the futility of this in his book The Humane Interface.
Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.
Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.
It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.
by bronco21016 on 10/27/20, 8:04 PM
It amazes me that something like this can be done by a single person.
In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.
I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.
I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.
by illumin8 on 10/27/20, 5:51 PM
This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: https://aws.amazon.com/message/41926/
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

by educationcto on 10/27/20, 2:02 PM

Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.

   Terraform will perform the following actions:

  # google_compute_instance.vm_instance will be created
  + resource "google_compute_instance" "vm_instance" {
  + ... <more>
 
   Plan: 2 to add, 0 to change, 0 to destroy.

   Do you want to perform these actions?
    Terraform will perform the actions described above.
    Only 'yes' will be accepted to approve.

   Enter a value: yes

by remram on 10/27/20, 8:24 PM
A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.
[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...
by Darkphibre on 10/27/20, 9:00 PM
Reminds me of when the Fortune 50 company (150k employees) I worked for rolled out new firewall restrictions that blocked the DNS port.
To all machines. Employee and servers alike.
Yes. Including the DNS servers.
Took them a day or two to work out how to roll that one back.
by tialaramex on 10/27/20, 11:35 PM
So, related obviously correct designs:
1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.
This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?
But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.
I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".
2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.
Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.
by vondur on 10/27/20, 4:10 PM
That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014. https://it.slashdot.org/story/14/05/17/051214/emory-universi...
by kbenson on 10/27/20, 7:48 PM
This is a topic near and dear to my heart, as I'm often that person arguing to make some slightly less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we don't actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.
Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.
by rossjudson on 10/27/20, 8:41 PM
This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)
by gabeio on 10/27/20, 8:24 PM
I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.
by jaclaz on 10/27/20, 3:14 PM
Side question.
How many/which companies have more than one million Linux machines?
by Ayesh on 10/27/20, 4:56 PM
I have an old laptop with a dead battery, and for a BIOS upgrade, it prevents me from updating without 50% battery.
I have to type "danger" to bypass this restriction, and I thought it was pretty cool.
Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.
by ineedasername on 10/28/20, 12:43 AM
Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.
I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.
I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."
by aqme28 on 10/27/20, 9:49 PM
Nitpicking
> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "
It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.
by nemo1618 on 10/27/20, 3:34 PM
Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"
by tigger0jk on 10/27/20, 6:41 PM
I've typically used pdsh https://github.com/chaos/pdsh for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.
by cle on 10/28/20, 2:05 AM
Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.
by rcarmo on 10/28/20, 7:59 AM
This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.
Or to all the machines, on one occasion.
(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)
by lqet on 10/27/20, 10:19 PM
Github has been doing this for quite a while know when you try to delete a repository - you have to type in the exact repository name to confirm.
by temporallobe on 10/27/20, 9:46 PM
This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.
by TravHatesMe on 10/27/20, 7:46 PM
Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.
by willvarfar on 10/27/20, 3:28 PM
I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.
Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.
by mcintyre1994 on 10/28/20, 6:47 AM
AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.
by heelix on 10/27/20, 8:36 PM
Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...
by Cthulhu_ on 10/27/20, 3:00 PM
I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.
by woliveirajr on 10/27/20, 9:24 PM
Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).
Edit: added name of software
by andrewfromx on 10/27/20, 7:37 PM
i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.
by sidpatil on 10/27/20, 3:04 PM
Hmm, it's conceptually like a combination of a CAPTCHA and a launch code.
by vsnf on 10/27/20, 3:18 PM
I do this with a git pre-push hook to the main branch of my repositories. It displays a prompt in red and forces me to type in the name of the branch.
The result of one too many mindlessly accidental pushes.
by regularfry on 10/28/20, 1:50 PM
I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.
by diebeforei485 on 10/28/20, 3:56 AM
I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.
by gitgud on 10/28/20, 2:27 AM
> "I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million."
A few places!? What is an example of this?
by ComodoHacker on 10/27/20, 7:40 PM
In role-playing games, it's a common practice to confirm deletion of your character by typing in some word, like 'delete' or character name.
by bnastic on 10/27/20, 10:54 PM
Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button
by Animats on 10/27/20, 6:41 PM
Yes. Github does that when you delete a repository. You have to confirm by typing in the name of the repository you are deleting.
by larrik on 10/27/20, 2:32 PM
I've seen this sort of thing in a few places, and I really do think it's a great idea.
by RobRivera on 10/27/20, 8:59 PM
Having babysat my fair share of critical clusters, i support this advice
by wotton on 10/27/20, 10:47 PM
Marketo, the marketing automation platform, does this when you try to do things to large data sets, very useful.
by konjin on 10/27/20, 8:20 PM
Finally the Roman numeral converter I programmed in university will be useful.
by eznzt on 10/27/20, 4:25 PM
Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.
by jerf on 10/27/20, 6:29 PM
https://news.ycombinator.com/item?id=24907002
Looks like https vs http link.
by jancsika on 10/27/20, 2:21 PM
It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.
Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.
by JoeAltmaier on 10/27/20, 2:23 PM
Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.
by outworlder on 10/27/20, 10:58 PM
> 1221425541 machines will be affected
"Do you care? (Y/N)"
Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.
Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).
If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.
In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).
The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.