from Hacker News

Trying to become a better developer by learning more about aviation

by fcmam5 on 7/30/23, 12:03 PM with 91 comments

  • by Animats on 7/31/23, 7:52 PM

    Minor lessons from time at an aerospace company:

    - When your device is in use in the field, the user will be too hot, too cold, too windy, too dark, too tired, too wet, too rushed, or under fire. Mistakes will be made. Design for that environment. Simplify controls. Make layouts very clear. Military equipment uses connectors which cannot be plugged in wrong, even if you try to force them. That's why. (Former USMC officer.)

    - Make it easy to determine what's broken. Self-test features are essential. (USAF officer.)

    - If A and B won't interoperate, check the interface specification. Whoever isn't compliant with the spec is wrong. They have to fix their side. If you can't decide who's wrong, the spec is wrong. This reduces interoperability from an O(N^2) problem to an O(N) problem. (DARPA program manager.)

    - If the thing doesn't meet spec, have Q/A put a red REJECTED tag on it. The thing goes back, it doesn't get paid for, the supplier gets pounded on by Purchasing and Quality Control, and they get less future business. It's not your job to fix their problem. (This was from an era when DoD customers had more clout with suppliers.)

    - There are not "bugs". There are "defects". (HP exec.)

    - Let the fighter pilot drive. Just sit back and enjoy the world zooming by. (Navy aviator.)

    Aerospace is a world with many hard-ass types, many of whom have been shot at and shot back, have landed a plane in bad weather, or both.

  • by gbacon on 7/31/23, 6:57 PM

    Fun to consider as both a computer scientist and a CFI.

    Instrument training in FAA-land requires learners to understand the five hazardous attitudes: anti-authority ("the rules don't apply to me"), impulsivity ("gotta do something now!), invulnerability ("I can get away with it"), macho ("watch this!"), and resignation ("I can't do anything to stop the inevitable"). Although the stakes are different, they have applicability to software development. Before a situation gets out of hand, the pilot has to recognize and label a particular thought and then think of the antidote, e.g., "the rules are there to keep me safe" for anti-authority.

    Part 121 or scheduled airline travel owes its safety record to many layers of redundancy. Two highly trained and experienced pilots are in the cockpit talking to a dispatcher on the ground, for example. They're looking outside and also have Air Traffic Control watching out for them. The author mentioned automation. This is an area where DevSecOps pipelines can add lots of redundancy in a way that leaves machines doing tedious tasks that machines are good at. As in the cockpit, it's important to understand and manage the automation rather than following the magenta line right into cumulogranite.

  • by eschneider on 7/31/23, 5:59 PM

    If you want to become a better developer through aviation, I can't recommend anything more highly than reading through NTSB accident reports. Learn from others the many, many ways small problems and misjudgements become accidents. It'll change the way you build things.
  • by WalterBright on 7/31/23, 9:47 PM

    Control reversal, when the surfaces move opposite from the command, have happened. They nearly always result in a crash. How does Boeing prevent controls from being hooked up backwards?

    The hydraulic actuators (rams) have an input and an output port. Connecting the hydraulic lines to the wrong port results in control reversal. To defend against that:

    1. One port has left handed threads, the other right handed threads

    2. The ports are different sizes

    3. The ports are color coded

    4. The lines cannot be bent to reach the wrong port

    5. Any work on it has to be checked, tested, and signed off by another mechanic

    And finally:

    5. Part of the preflight checklist is to verify that the control surfaces move the right way

    I haven't heard of a control reversal on airliners built this way, but I have heard of it happening in older aircraft after an overhaul.

  • by WalterBright on 7/31/23, 9:59 PM

    As a former Boeing flight controls engineer, I wrote a couple articles about lessons that transfer to software:

    Safe Systems from Unreliable Parts https://www.digitalmars.com/articles/b39.html

    Designing Safe Software Systems Part 2 https://www.digitalmars.com/articles/b40.html

  • by KolmogorovComp on 7/31/23, 6:44 PM

    > NATO Phonetic alphabet (Alpha, Bravo, Charlie…).

    NIT, A is written as Alfa in the NATO alphabet [0] as it is easier to understand its pronunciation. For the same reason J is written as Juliett (two t), because in some languages t can be silent.

    [0] https://en.wikipedia.org/wiki/NATO_phonetic_alphabet

  • by LorenPechtel on 7/31/23, 6:15 PM

    I very much believe in swiss cheese safety systems. There *will* be errors, you try to avoid them becoming catastrophes.

    And I hate systems that don't let you say "ignore *this* warning" without turning off all warnings. I have some Tile trackers with dead batteries--but there's no way I can tell the app to ignore *that* dead battery yet tell me about any new ones that are growing weak. (We haven't been using our luggage, why should I replace the batteries until such day as the bags are going to leave the house again?)

  • by warner25 on 7/31/23, 8:02 PM

    It seems like there are almost daily HN front page items about aviation, and a lot of pilots in the comments. I've wondered about the reasons for such an overlap in interests among people here.

    I fit this myself: I grew up playing flight simulators, studied computer science as an undergrad, was a military helicopter pilot for a while, and then went to grad school for computer science. Along the way, I've personally met at least half a dozen other academic computer scientists with a pilot's license or military aviation background. Is it just selective attention / frequency illusion for me, or is there more to this?

  • by hcarvalhoalves on 7/31/23, 8:45 PM

    > Build for resiliency and designed to fail safely

    This is important, but I'm not sure everybody necessarily agrees on what "fail safely" means.

    Fail safely can mean one of:

    - It doesn't fail silently

    - It doesn't cause cascading failures

    - It doesn't cause infinite failure loops

    - It doesn't fail in ways that corrupt data

    - It doesn't fail in ways you lose money

    - You can safely retry

    - You can safely retry anytime (not just today, or just this month)

  • by rad_gruchalski on 7/31/23, 5:59 PM

    This article isn't complete without mentioning DO-178C: Design guidance for aviation software development.
  • by akhayam on 7/31/23, 9:24 PM

    I have taken so much inspiration aviation industry when designing and operating software system. In addition to the inspirations mentioned in this blog, I find the idea of "antifragility" in aviation quite fascinating, where every near miss is studied, documented and checklisted across the entire aviation industry. This means that every near miss improves the resilience of the entire industry. We need to build similar means of learning from others' mistakes in complex software systems as well.
  • by SoftTalker on 7/31/23, 8:09 PM

    Makes sense if your software is responsible for keeping people alive. Most of us don't need to work to such a standard (thankfully).
  • by jacquesm on 8/1/23, 12:41 AM

    Fantastich thread this, thank you fcmam5, I'm bookmarking this for future reference.

    My own contribution is to recommend reading risks digest:

    http://catless.ncl.ac.uk/Risks/

  • by vunderba on 7/31/23, 11:44 PM

    When I started working for more consumer facing application developing companies, I tried to adopt the software developers equivalent of "if you're the developer of a new experimental plane, you're the first to go up in the said plane."
  • by maxbond on 7/31/23, 8:36 PM

    I had a similar experience, and found "aviate, navigate, communicate" to be an excellent model for responding to production incidents.
  • by r2on3nge on 8/1/23, 7:30 PM

    This is so fascinating! Continuing becoming better and better everyday.
  • by deathanatos on 8/1/23, 12:05 AM

    It's a lot of good advice, but IME the next step is "but how do I actually do this?"

    A lot of the difficultly boils down to an inverse NIH syndrome: we outsource monitoring and alerting … and the systems out there are quite frankly pretty terrible. We struggle with alert routing, because alert routing should really take a function that takes alert data in and figures out what to do with it … but Pagerduty doesn't support that. Datadog (monitoring) struggles (struggles) with sane units, and IME with aliasing. DD will also alert on things that … don't match the alert criteria? (We've still not figured that one out.)

    “Aviate, Navigate, Communicate” definitely is a good idea, but let me know if you figure out how to teach people to communicate. Many of my coworkers lack basic Internet etiquette. (And I'm pretty sure "netiquette" died a long time ago.)

    The Swiss Cheese model isn't just about having layers to prevent failures. The inverse axiom is where the fun starts: the only failures you see, by definition, are the ones that go through all the holes in the cheese simultaneously. If they didn't, then by definition, a layer of swiss has stopped the outage. That means "how can this be? like n different things would have to be going wrong, all at the same time" isn't really an out in an outage: yes, by definition! This is too, of course, assuming you know what holes are in your cheese, and often, the cheese is much holier than people seem to think it is.

    I'm always going to hard disagree with runbooks, though. Most failures are of the "it's a bug" variety: there is no possible way to write the runbook for them. If you can write a runbook, that means you're aware of the bug: fix the bug, instead. The rest is bugs you're unaware of, and to write a runbook would thus require clairvoyance. (There are limited exceptions to this: sometimes you cannot fix the bug: e.g., if the bug lies in a vendor's software and the vendor refuses to do anything about it¹, then you're just screwed, and have to write down the next best work around, particularly if any workaround is hard to automate. There are other pressures, like PMs who don't give devs the time to fix bugs, but in general runbooks are a drag on productivity, as they're manual processes you're following in lieu of a working system. Be pragmatic about when you take them on (if you can).

    > Have a “Ubiquitous language”

    This one, this one is the real gem. I beg of you, please, do this. A solid ontology prevents bugs.

    This gets back to the "teach communication" problem, though. I work with devs who seem to derive pleasure from inventing new terms to describe things that already have terms. Communicating with them is a never ending game of grabbing my crystal ball and decoding WTF it is they're talking about.

    Also, I know the NATO alphabet (I'm not military/aviation). It is incredibly useful, and takes like 20-40 minutes of attempting to memorize it to get it. It is mind boggling that customer support reps do not learn this, given how shallow the barrier to entry is. (They could probably get away with like, 20 minutes of memorization & then learn the rest just via sink-or-swim.)

    (I also have what I call malicious-NATO: "C, as in sea", "Q, as in cue", "I, as in eye", "R, as in are", U, as in "you", "Y, as in why")

    > Don’t write code when you are tired.

    Yeah, don't: https://www.cdc.gov/niosh/emres/longhourstraining/impaired.h...

    And yet I regularly encounter orgs or people suggesting that deployments should occur well past the 0.05% BAC equivalent mark. "Unlimited PTO" … until everyone inevitably desires Christmas off and then push comes to shove.

    Some of this intertwines with common PM failure modes, too: I have, any number of times, been pressed for time estimates on projects where we don't have a good time estimate because there are two many unknowns in the project. (Typically because whomever is PM … really hasn't done their job in the first place of having even the foggiest understanding of what's actually involved, inevitably because the PM is non-technical. Having seen a computer is not technical.) When the work is then broken out and estimates assigned to the broken out form, the total estimate is rejected, because PMs/management don't like the number. Then inevitably a date is chosen at random by management. (And the number of times I've had a Saturday chosen is absurd, too.) And then the deadline is missed. Sometimes, projects skip right to the arbitrary deadline step, which at least cuts out some pointless debate about, yes, what you're proposing really is that complicated.

    That's stressful, PMs.

    ¹ cough Azure cough excuse me.