from Hacker News

Zero Bug Tolerance

by karlerss on 2/24/21, 8:51 AM with 37 comments

by skrebbel on 2/25/21, 4:17 PM
I really like the idea of zero-bug policies but I struggle with them in practice.
For those who do this (post author or anyone else here), how do you deal with low-impact bugs?
As a concrete example, we're building a chat toolkit. One customer observed that in some versions of Firefox, when combining a particular set of features in our product, the scroll position wouldn't be remembered. This was 100% a bug. It's also an edge case of an edge case that likely only happened for this one customer, and even there, had a relatively small impact on UX for a small subset of their users. It was essentially a browser bug, and fixing it would require a big workaround that made one component of our product significantly more complex (and thus more prone to other bugs).
With a zero bug policy, we'd have to fix that before shipping anything else. But it made no business sense to do so, very much in the same sense that building a niche feature used by tiny % of customers tends to make no sense.
But once you let that one fly, there's no zero-bug policy left, right? You can just declare any bug as "not important enough right now" and -poof!- zero bugs! Yay, time to ship features.
For context, I'm talking a comparably small, tight-knit team as the author.
by WesolyKubeczek on 2/25/21, 3:59 PM
I can see how it can work in a very tight-knit, small team. I fail to see how such a policy can work at a bigger company, especially once you have middle management layer. Once your company grows large enough, you're going to have:
A) Prolonged discussions about what the exact and precise definition of the word "bug" is, and how whatever that was released last night and caused mayhem between clients and the support was not it
B) Bargaining so my pet stuff is released right after the holidays and I don't look like the unproductive schmuck
C) Using this to nitpick at and get rid of employees someone doesn't like
D) Stack ranking the employees by the number of bugs they let slip past them
E) A full-on war between developers and the QA department
F) Fears to make any progress at all because a bug might creep in
...and of course, anything else you might have seen in your favorite Kafka books, in "Brazil", in "1984", in Lem's "Memoirs Found in a Bathtub", you name it. Of course, in the metrics it's going to look as if the company exceeds any expectations in implementing its zero bug tolerance policy! The managers will work hard on the infographics to show you.
by Alex3917 on 2/25/21, 4:08 PM
At FWD:Everyone we always ensure there are zero known bugs in production at any given time. So if a bug is reported, it's always fixed with 24 hours, and no feature development is done until we're back to zero known bugs. If a user reports a bug before they go to bed, then more often than not they get an email with a postmortem and an explanation of the fix before they wake up. E.g. here is one that I published: https://www.fwdeveryone.com/t/Ebdvx32aSz2DAqxKBpee7w/feature...
IMHO in the long run this saves a lot of time and money. Even bugs with zero user impact can signal some deep misunderstanding about technology, and fixing the problem immediately before it gets replicated everywhere else in the codebase is hugely valuable. Several times there have been cases where there was an extremely inconsequential issue that led to us discovering and fixing all sorts of important bugs that we hadn't even known about.
by tantalor on 2/25/21, 3:57 PM
This is kind of silly. Some bugs are more severe than others. Some bugs cost you money, some do not. How can you prioritize fixing bugs vs. developing features when you have "zero tolerance" for bugs?
by mawise on 2/25/21, 6:25 PM
I recognize this as a valuable counter point to the "move fast and break things" ethos, but I disagree with the framing. A while ago I learned about a model of "error budgets" from google[1] which really resonated with me. You want errors to happen sufficiently infrequently that when the user encounters an issue it usually isn't your fault (instead something with the user's hardware, or their ISP, etc). Optimizing beyond this point is a waste of time because you can't eliminate errors that are outside the control of your system. It provides a very well defined framework for how to define the threshold of "does this matter enough to slow down and fix it".
[1]: https://sre.google/sre-book/embracing-risk/
by LegitGandalf on 2/25/21, 6:20 PM
It is useful to think of software change as being a mix of Value, Filler & Chaos. Value being something your customers need & use, Filler being something crafted, but customers say "meh" to, and Chaos being bugs, poor performance, etc.
If you accept that Chaos destroys Value (and it surely does), then it is a no brainer to do workflows that find and kill Chaos.
One value add pattern that is really helpful for finding Chaos is using software health metrics to find the echoes of Chaos. Much like how we find black holes by looking for gravitational lensing, Chaos can be found by looking at metrics like software response times under Representative Load, inconsistent response times are an indication of unhealthy contention in the solution (things waiting on other things that are waiting on other things, but some thing is pausing intermittently). Obviously becoming slower over time is also an indication of poor health as well.
Some other useful insights from the Value, Filler & Chaos model are:
• Teams run at 20% value or less. This really has to do with the nature of discovering new, valuable software embodiments. Discovery of new things requires many value attempts, most of which fail, but result in new learning
• Removing unused features is a win because you reduce Filler and sources of Chaos
• Mobile apps taken as a whole run about 1% value (positive revenue), the rest is all Chaos and Filler
• To know if something is Value vs Filler there has to be Traction. Chaos also destroys Traction. The article is a classic case of the team recognizing that Chaos was destroying Traction
by ufmace on 2/25/21, 7:28 PM
This is the kind of thing where everything is a judgement call and arbitrary policies applied strictly become absurd and useless.
It's very possible that this particular team could stand to put a higher priority on fixing bugs before implementing new features. That's ultimately going to be what it is, no matter what they call it. They are free to call it "Zero Bug Tolerance" as long as everybody understands that it's hyperbole and they don't get into endless bikeshedding on what constitutes a bug and if they really should fix it.
It's pretty obvious there will eventually be a bug that's too rare and weird to really troubleshoot, or too niche and complex to bother fixing, or more trouble than it's worth.
by r0s on 2/25/21, 5:20 PM
In my largish company, the CTO announced a similar "zero regressions in production" goal.
I came up with this system of coverage which would be a huge improvement and much tighter testing process, eventually moving "left" up the development pipeline:
https://eratestcoverage.org/
I proposed this and a bunch of other ideas, and the general reaction was flat. My boss said he didn't understand any of it and cut me off trying to explain.
I realize now, the goal set by the CTO was just talk, they had no interest in any real process change. And so, nothing changed.
The concepts are sound, granted they could be better explained, I'm working on it, just not being paid to do so.
by closeparen on 2/25/21, 5:36 PM
Our QA has zero working hours overlap with backend engineering, which has one working hour overlap with mobile engineering. QA’s bug reports never include the relevant IDs on the first pass, so if it’s potentially a backend issue where we need logs, we have to comment on the JIRA and wait 24 hours. It’s amazing. I wish we could have zero bug tolerance or drop everything to fix bugs. Local management would even like to. But a globally distributed cost-conscious company is physically incapable of collaborating that fast. All we could do is sit there and twiddle our thumbs while waiting for our peers around to the world to wake up and see our messages. So we work on features.
by S_A_P on 2/25/21, 6:02 PM
I am all for this. I think it is something to strive for. I also think that bug fixing can sometimes take multiple hundreds of percentages of the original development time. This is where initiatives like this lose steam. Complexity of setup/recreation/intermittent bugs means explaining to a project manager and/or development manager that the task is going to miss the sprint. Or that the task is taking x number of hours and this causes said manager to see what the bug actually costs to fix. Then they look at the feature backlog and something has to give. Can’t tell senior leadership that we are missing one of their arbitrary(or even well thought out and pragmatic) deadlines because then the PM worries that they will be perceived as losing control of the project. So priority changes and features are the focus. It’s just the way I have seen things go way too many times, whether I was involved or not.
by benibela on 2/25/21, 6:52 PM
I used to have a zero bug policy in my projects
But it gets hard when the users do not cooperate, and there is nothing to reproduce
In an open-source app I only got 3 bug reports in the bug tracker in nearly 15 years. And they were not really bugs either, one question and two https problems. I hope it is because I had tested any change for months and have thousands of automated tests. Or it is because the users do not find the bug tracker.
I do get a lot of mails. They are all useless. Most common is, "There is an error message 'Invalid password'". Then I reply that message comes when they enter an invalid password. Then they do not respond. And then I do not know if there is a bug or whether they have entered a wrong password. Then I also test it for a few hours and see it sends exactly the password to the server that was entered
Another project, a http client. Bug report: "untrusted https certificate" on someone else's server. I try it on my system, and it works fine. Then I ask for their OpenSSL version, and do not get an answer. Now what can I do about this? I try it on multiple computers, and it works on all of them
Another open-source project, much more popular. Bug report: it crashes frequently. Because it much more popular and has competent users, I do not have to do anything about it, the users investigate it themselves. Two months later, the user has extracted the crashing code. They remove as much code as possible, until they obtain a minimal crashing program. It shares zero code with my project. All remains are calls to an open-source library and they report it upstream to the developers of the library. I guess there is nothing to do until they fix it there?
On that project I also get emails. They are also useless, because the competent users use the bug tracker. After moving to 64-bit, I get a lot of "it does not start anymore". Guess they use a 32-bit OS
by juancn on 2/26/21, 4:32 PM
It's a waste of resources. Bugs have to be triaged and prioritized. Not all bugs are equal, some are existential threats to the business, others have so little impact that they can be postponed (note that a bug fix has a chance of introducing a new one, sometimes worse than the original).
What you have to do, is quickly triage all bugs, and then make a call on when you're going to address it.
Some you fix immediately, some you postpone to a scheduled release, some you never fix, just document them.
Before a release, you set a bar on what bugs are acceptable for release, but you make a call on them, involving quality, product and engineering teams.
A company has finite resources, you need to invest wisely.
by guenthert on 2/25/21, 11:36 PM
Zero bugs sounds neither practical nor necessarily desirable. I still remember to have been quite happy about the first bug report I received -- only then I knew that the software was actually being used and didn't vanish in someone's vault.
Time to market is still a thing and incompatible with overzealous bug fixing. I would settle on a zero-regression policy. Then at least you keep happy customers happy and potentially reach fewer bugs in the future.
Of course it depends on the application. An automotive ABS system has different requirements then an IRC server.
by pwinnski on 2/25/21, 8:45 PM
It's easier to have a "strive for 0 bugs" policy after you've already built a bunch of features and attracted paying customers.
"In retrospect, maybe the strategy to reach approximate feature parity real fast was not the optimal one."
Or maybe that strategy was, and usually is, the only way to attract paying customers.
How to balance bug-fixes with new feature development is always a trade-off. It's never as simple as "Zero Bug Tolerance." Unless, I suppose, you're writing software for a space ship or deep sea vehicle.
by jjjeii3 on 2/25/21, 9:10 PM
You don't need to comply with GDPR, unless your business is located in the EU or you have a subsidiary in the EU. There is no legal framework that will allow EU to enforce GDPR overseas, except both countries have an agreement. Due to this limitation, some people including Edward Snowden called GDPR a "paper tiger".
by collyw on 2/25/21, 4:40 PM
Zero bugs sounds like the zero covid fantasy that some authoritarians are pushing for.
Zero code is pretty much the only way you can guarantee zero bugs.