by fcmam5 on 7/30/23, 12:03 PM with 91 comments
by Animats on 7/31/23, 7:52 PM
- When your device is in use in the field, the user will be too hot, too cold, too windy, too dark, too tired, too wet, too rushed, or under fire. Mistakes will be made. Design for that environment. Simplify controls. Make layouts very clear. Military equipment uses connectors which cannot be plugged in wrong, even if you try to force them. That's why. (Former USMC officer.)
- Make it easy to determine what's broken. Self-test features are essential. (USAF officer.)
- If A and B won't interoperate, check the interface specification. Whoever isn't compliant with the spec is wrong. They have to fix their side. If you can't decide who's wrong, the spec is wrong. This reduces interoperability from an O(N^2) problem to an O(N) problem. (DARPA program manager.)
- If the thing doesn't meet spec, have Q/A put a red REJECTED tag on it. The thing goes back, it doesn't get paid for, the supplier gets pounded on by Purchasing and Quality Control, and they get less future business. It's not your job to fix their problem. (This was from an era when DoD customers had more clout with suppliers.)
- There are not "bugs". There are "defects". (HP exec.)
- Let the fighter pilot drive. Just sit back and enjoy the world zooming by. (Navy aviator.)
Aerospace is a world with many hard-ass types, many of whom have been shot at and shot back, have landed a plane in bad weather, or both.
by gbacon on 7/31/23, 6:57 PM
Instrument training in FAA-land requires learners to understand the five hazardous attitudes: anti-authority ("the rules don't apply to me"), impulsivity ("gotta do something now!), invulnerability ("I can get away with it"), macho ("watch this!"), and resignation ("I can't do anything to stop the inevitable"). Although the stakes are different, they have applicability to software development. Before a situation gets out of hand, the pilot has to recognize and label a particular thought and then think of the antidote, e.g., "the rules are there to keep me safe" for anti-authority.
Part 121 or scheduled airline travel owes its safety record to many layers of redundancy. Two highly trained and experienced pilots are in the cockpit talking to a dispatcher on the ground, for example. They're looking outside and also have Air Traffic Control watching out for them. The author mentioned automation. This is an area where DevSecOps pipelines can add lots of redundancy in a way that leaves machines doing tedious tasks that machines are good at. As in the cockpit, it's important to understand and manage the automation rather than following the magenta line right into cumulogranite.
by eschneider on 7/31/23, 5:59 PM
by WalterBright on 7/31/23, 9:47 PM
The hydraulic actuators (rams) have an input and an output port. Connecting the hydraulic lines to the wrong port results in control reversal. To defend against that:
1. One port has left handed threads, the other right handed threads
2. The ports are different sizes
3. The ports are color coded
4. The lines cannot be bent to reach the wrong port
5. Any work on it has to be checked, tested, and signed off by another mechanic
And finally:
5. Part of the preflight checklist is to verify that the control surfaces move the right way
I haven't heard of a control reversal on airliners built this way, but I have heard of it happening in older aircraft after an overhaul.
by WalterBright on 7/31/23, 9:59 PM
Safe Systems from Unreliable Parts https://www.digitalmars.com/articles/b39.html
Designing Safe Software Systems Part 2 https://www.digitalmars.com/articles/b40.html
by KolmogorovComp on 7/31/23, 6:44 PM
NIT, A is written as Alfa in the NATO alphabet [0] as it is easier to understand its pronunciation. For the same reason J is written as Juliett (two t), because in some languages t can be silent.
by LorenPechtel on 7/31/23, 6:15 PM
And I hate systems that don't let you say "ignore *this* warning" without turning off all warnings. I have some Tile trackers with dead batteries--but there's no way I can tell the app to ignore *that* dead battery yet tell me about any new ones that are growing weak. (We haven't been using our luggage, why should I replace the batteries until such day as the bags are going to leave the house again?)
by warner25 on 7/31/23, 8:02 PM
I fit this myself: I grew up playing flight simulators, studied computer science as an undergrad, was a military helicopter pilot for a while, and then went to grad school for computer science. Along the way, I've personally met at least half a dozen other academic computer scientists with a pilot's license or military aviation background. Is it just selective attention / frequency illusion for me, or is there more to this?
by hcarvalhoalves on 7/31/23, 8:45 PM
This is important, but I'm not sure everybody necessarily agrees on what "fail safely" means.
Fail safely can mean one of:
- It doesn't fail silently
- It doesn't cause cascading failures
- It doesn't cause infinite failure loops
- It doesn't fail in ways that corrupt data
- It doesn't fail in ways you lose money
- You can safely retry
- You can safely retry anytime (not just today, or just this month)
by rad_gruchalski on 7/31/23, 5:59 PM
by akhayam on 7/31/23, 9:24 PM
by SoftTalker on 7/31/23, 8:09 PM
by jacquesm on 8/1/23, 12:41 AM
My own contribution is to recommend reading risks digest:
by vunderba on 7/31/23, 11:44 PM
by maxbond on 7/31/23, 8:36 PM
by r2on3nge on 8/1/23, 7:30 PM
by deathanatos on 8/1/23, 12:05 AM
A lot of the difficultly boils down to an inverse NIH syndrome: we outsource monitoring and alerting … and the systems out there are quite frankly pretty terrible. We struggle with alert routing, because alert routing should really take a function that takes alert data in and figures out what to do with it … but Pagerduty doesn't support that. Datadog (monitoring) struggles (struggles) with sane units, and IME with aliasing. DD will also alert on things that … don't match the alert criteria? (We've still not figured that one out.)
“Aviate, Navigate, Communicate” definitely is a good idea, but let me know if you figure out how to teach people to communicate. Many of my coworkers lack basic Internet etiquette. (And I'm pretty sure "netiquette" died a long time ago.)
The Swiss Cheese model isn't just about having layers to prevent failures. The inverse axiom is where the fun starts: the only failures you see, by definition, are the ones that go through all the holes in the cheese simultaneously. If they didn't, then by definition, a layer of swiss has stopped the outage. That means "how can this be? like n different things would have to be going wrong, all at the same time" isn't really an out in an outage: yes, by definition! This is too, of course, assuming you know what holes are in your cheese, and often, the cheese is much holier than people seem to think it is.
I'm always going to hard disagree with runbooks, though. Most failures are of the "it's a bug" variety: there is no possible way to write the runbook for them. If you can write a runbook, that means you're aware of the bug: fix the bug, instead. The rest is bugs you're unaware of, and to write a runbook would thus require clairvoyance. (There are limited exceptions to this: sometimes you cannot fix the bug: e.g., if the bug lies in a vendor's software and the vendor refuses to do anything about it¹, then you're just screwed, and have to write down the next best work around, particularly if any workaround is hard to automate. There are other pressures, like PMs who don't give devs the time to fix bugs, but in general runbooks are a drag on productivity, as they're manual processes you're following in lieu of a working system. Be pragmatic about when you take them on (if you can).
> Have a “Ubiquitous language”
This one, this one is the real gem. I beg of you, please, do this. A solid ontology prevents bugs.
This gets back to the "teach communication" problem, though. I work with devs who seem to derive pleasure from inventing new terms to describe things that already have terms. Communicating with them is a never ending game of grabbing my crystal ball and decoding WTF it is they're talking about.
Also, I know the NATO alphabet (I'm not military/aviation). It is incredibly useful, and takes like 20-40 minutes of attempting to memorize it to get it. It is mind boggling that customer support reps do not learn this, given how shallow the barrier to entry is. (They could probably get away with like, 20 minutes of memorization & then learn the rest just via sink-or-swim.)
(I also have what I call malicious-NATO: "C, as in sea", "Q, as in cue", "I, as in eye", "R, as in are", U, as in "you", "Y, as in why")
> Don’t write code when you are tired.
Yeah, don't: https://www.cdc.gov/niosh/emres/longhourstraining/impaired.h...
And yet I regularly encounter orgs or people suggesting that deployments should occur well past the 0.05% BAC equivalent mark. "Unlimited PTO" … until everyone inevitably desires Christmas off and then push comes to shove.
Some of this intertwines with common PM failure modes, too: I have, any number of times, been pressed for time estimates on projects where we don't have a good time estimate because there are two many unknowns in the project. (Typically because whomever is PM … really hasn't done their job in the first place of having even the foggiest understanding of what's actually involved, inevitably because the PM is non-technical. Having seen a computer is not technical.) When the work is then broken out and estimates assigned to the broken out form, the total estimate is rejected, because PMs/management don't like the number. Then inevitably a date is chosen at random by management. (And the number of times I've had a Saturday chosen is absurd, too.) And then the deadline is missed. Sometimes, projects skip right to the arbitrary deadline step, which at least cuts out some pointless debate about, yes, what you're proposing really is that complicated.
That's stressful, PMs.
¹ cough Azure cough excuse me.