from Hacker News

Tests that sometimes fail

by sams99 on 5/28/19, 7:44 AM with 127 comments

by matharmin on 5/28/19, 12:37 PM
We've had a couple of cases of flaky tests failing builds over the last two years at my company. Most often it's browser / end-to-end type tests (e.g. selenium-style tests) that are the most flaky. Many of them only fail in 1-3% of cases, but if you have enough of them the chances of a failing build is significant.
If you have entire builds that are flaky, you end up training developers to just click "rebuild" the first one or two times a build fails, which can drastically increase the time before realizing the build is actually broken.
An important realization is that unit testing is not a good tool for testing flakyness of your main code - it is simply not a reliable indicator of failing code. Most of the time it's the test itself that is flaky, and it's not worth your time making every single test 100% reliable.
Some things we've implemented that helps a lot:
1. Have a system to reproduce the random failures. It took about a day to build tooling that can run say 100 instances of any test suite in parallel in CircleCI, and record the failure rate of individual tests.
2. If a test has a failure rate of > 10%, it indicates an issue in that test that should be fixed. By fixing these tests, we've found a couple of techniques to increase overall robustness of our tests.
3. If a test has a failure rate of < 3%, it is likely not worth your time fixing it. For these, we retry each failing test up to three times. Not all test frameworks support retying out of the box, but you can usually find a workaround. The retries can be restricted to specific tests or classes of tests if needed (e.g. only retry browser-based tests).
by mceachen on 5/28/19, 7:28 PM
Every company I've founded or worked for has struggled with flaky tests.
Twitter had a comprehensive browser and system test suite that took about an hour to run (and they had a large CI worker cluster). Flaky tests could and did scuttle deploys. It was a never-ending struggle to keep CI green, but most engineers saw de-flaking (not just deleting the test) as a critical task.
PhotoStructure has an 8-job GitLab CI pipeline that runs on macOS, Windows, and Linux. Keeping the ~3,000 (and growing) tests passing reliably has proven to be a non-trivial task, and researching why a given task is flaky on one OS versus another has almost invariably led to discovery and hardening of edge and corner conditions.
It seems that TFA only touched on set ordering, incomplete db resets and time issues. There are many other spectres to fight as soon as you deal with multi-process systems on multiple OSes, including file system case sensitivity, incomplete file system resets, fork behavior and child process management, and network and stream management.
There are several aspects I added to stabilize CI, including robust shutdown and child process management systems. I can't say I would have prioritized those things if I didn't have tests, but now that I have it, I'm glad they're there.
by joosters on 5/28/19, 12:50 PM
In an old job, we had a frustrating test that passed well over 99 times in 100. It was shrugged off for a very long time until a developer eventually tracked it down to code that was generating a random SSL key pair. If the first byte of the key was 0, faulty code elsewhere would mishandle the key and the test failed.
Keeping the randomness in the test was the key factor in tracking down this obscure bug. If the test had been made completely deterministic, the test harness would never have discovered the problem. So although repeatable tests are in most cases a good thing, non-determinism can unearth problems. The trick is how to do this without sucking up huge amounts of bug-tracking time...
(Much effort was spent in making the test repeatable during debugging, but of course the crypto code elsewhere was deliberately trying to get as much randomness as it could source...)
by pytester on 5/28/19, 11:07 AM
What I found to be the major reasons for flaky tests:
* Non-determinism in the code - e.g. select without an order by, random number generators, hashmaps turned into lists, etc. - Fixed by turning non-deterministic code into deterministic code, testing for properties rather than outcomes or isolating and mocking the non-deterministic code.
* Lack of control over the environment - e.g. calling a third party service that goes down occasionally, use of a locally run database that gets periodically upgraded by the package manager - fixed by gradually bringing everything required to run your software under control (e.g. installing specific versions without package manager, mocking 3rd party services, intercepting syscalls that get time and replacing them with consistent values).
* Race conditions - in this case the test should really repeat the same actions so that it consistently catches the flakiness.
by roland35 on 5/28/19, 1:08 PM
There was one weird bug reported to me in an microcontroller based project I was recently working on which shut off half the LCD screen. I wrote a test which blasted the LCD screen with random characters and commands and did not see the same error for awhile... but it finally happened during a test! I was able to then see that when I was checking the LCD state between commands I only would toggle the chip select for the first half of the LCD (there were 2 driver chips built into the screen and you had to read each chip individually). There would be no way I could have recreated the bug without automated tests.
I have had to deal with non-deterministic tests with my embedded systems and robotic test suites and have found a few solutions to deal with them:
- Do a full power reset between tests if possible, or do it between test suites when you can combine tests together in suites that don't require a complete clean slate
- Reset all settings and parameters between tests. A lot of embedded systems have settings saved in Flash or EEPROM which can affect all sorts of behaviors, so make sure it always starts at the default setting.
- Have test commands for all system inputs and initialize all inputs to known values.
- Have test modes for all system outputs such as motors. If there is a motor which has a speed encoder you can make the test mode for the speed encoder input to match the commanded motor value, or also be able to trigger error inputs such as a stalled motor.
- Use a user input/dialog option to have user feedback as part of the test (for things like the LCD bug).
Robot Framework is a great tool which can do all these things with a custom Python library! I think testing embedded systems is generally much harder so people rarely do it, but I think it is a great tool which can oftentimes uncover these flaky errors.
by darekkay on 5/28/19, 10:10 AM
Related stories: "unit tests fail when run in Australia" [1] and "the case of the 500-mile email" [2]. There is a whole GitHub repository dedicated to some very interesting debugging stories [3].
[1] https://github.com/angular/angular.js/issues/5017
[2] http://www.ibiblio.org/harris/500milemail.html
[3] https://github.com/danluu/debugging-stories
by zubspace on 5/28/19, 9:43 AM
We call them Flip Floppers.
We do a lot of integration testing, more so than unit testing, and those tests, which randomly fail, are a real headache.
One thing I learned is that setting up tests correctly, independent of each other, is hard. It is even harder if databases, local and remote services are involved or if your software communicates with other software. You need to start those dependencies and take care of resetting their state, but there's always something: Services sometimes take longer to start, file handles not closing on time, code or applications which keeps running when another test fails... etc, etc...
There are obvious solutions: Mocking everything, removing global state, writing more robust test setup code... But who has time for this? Fixing things correctly can even take more time and usually does not guarantee that some new change in the future disregards your correct code...
by lukego on 5/28/19, 10:48 AM
I have learned to love non-deterministic tests.
The world is non-deterministic. A test suite that can represent non-determinism is much more powerful than one that cannot. To paraphrase Dijkstra, "Determinism is just a special case of non-determinism, and not a very interesting one at that."
If a test is non-deterministic then a test framework needs to characterize the distribution of results for that test. For example "Branch A fails 11% (+/- 2%) of the time and Branch B fails 64% (+/- 2%) of the time." Once you are able to measure non-determinism then you can also effectively optimize it away, and you start looking for ways to introduce more of it into your test suites e.g. to run each test on a random CPU/distro/kernel.
by throwaway5752 on 5/28/19, 2:24 PM
Call it a pet peeve, but if we call it "chaos engineering" it costs a ton and gets people conference talks when a sporadic system integration issue is found. But if you have the same thing happen in a plain old CI half the time it will be ignored or flagged flaky.
by mekane8 on 5/28/19, 6:41 PM
As soon as I saw that whole section on database-related flakiness my mind went from "flaky unit tests" to "tests called unit tests that are actually integration tests". I worked on a team where we labored under that misconception for a long, long time. By the time we finally realized that many of the tests in our suite were integration tests and not unit tests it was too late to change (due to budget and timeline pressure).
I really like the different approaches to dealing with these flaky tests, that is a good list.
by jonthepirate on 5/28/19, 2:19 PM
Hi - I'm Jon, creator of "Flaptastic" (https://www.flaptastic.com/) and passionate advocate for unit test health.
Having coded at both Lyft and at DoorDash, I noticed both companies had the exact same unit test health problems and I was forced to manually come up with ways to make the CI/CD reliable in both settings.
In my experience, most people want a turnkey solution to get them to a healthier place with their unit testing. "Flaptastic" is a flaky unit tests recognition engine written in a way that anybody can use it to clean up their flaky unit tests no matter what CI/CD or test suite you're already using.
Flaptastic is a test suite plugin that works with a SAAS backend that is able to differentiate between a unit test that failed due to broken application code versus tests that are failing with no merit and only because the tests are not written well. Our killer feature is that you get a "kill switch" to instantly disable any unit test that you know is unhealthy with an option to unkill it later when you've fixed the problem. The reason is this is so powerful is that when you kill an unhealthy test, you are able to immediately unblock the whole team.
We're now working on a way to accept the junit.xml file from your test suite. We can run it through the flap recognition engine allowing you to make decisions on what you will do next if you know all of the tests that failed did fail due to known flaky test patterns.
If Flaptastic seems interesting, contact us on our chat widget we'll let you use it for free indefinitely (for trial purposes) to decide if this makes your life easier.
by andrey_utkin on 5/28/19, 10:42 AM
At Undo we develop a "software flight recorder technology" - basically think of `rr` reversible debugger, it is our open source competitor.
One particular usecase for Undo (besides obviously recording software bugs per se) is recording execution of tests. Huge time saver. We do this ourselves - when a test fails in CI, engineers can download a recording file of a failing test and investigate it with our reversible debugger.
by bhaak on 5/28/19, 9:57 AM
At our place, we call them "peuteterli" (losely translated: "could-be-ish" constructed from the French "peut être" and slapped on the local German diminutive -li.
For the ID issue I have a monkey patch for Activerecord:
```
      if ["test", "cucumber"].include? Rails.env
        class ActiveRecord::Base
          before_create :set_id

          def set_id
            self.id ||= SecureRandom.random_number(999_999_999)
          end
        end
      end
```
Unique IDs are also helpful when scanning for specific objects during test development. When all objects of different classes start with 1, it is hard to following the connections.
by notacoward on 5/28/19, 12:11 PM
I deal with this issue a lot in my current job, and did in my last job too. IMX timing issues are by far the most common culprit. Usually it's because a test has to guess how long a background repair or garbage-collection activity will take, when in fact that duration can be highly variable. Shorter timeouts mean tests are unreliable. Longer timeouts mean greater reliability but tests that sometimes take forever. Speeding up the background processes can create CPU contention if tests are being run in parallel, making other tests seem flaky. Various kinds of race conditions in tests are also a problem, but not one I personally encounter that often. Probably has to do with the type of software I work on (storage) and the type of developers I consequently work with.
No matter what, developers complain and try to avoid running the tests at all. I'd love to force their hand by making a successful test run an absolute requirement for committing code, but the very fact that tests have been slow and flaky since long before I got here means that would bring development to a standstill for weeks and I lack the authority (real or moral) for something that drastic. Failing that, I lean toward re-running tests a few times for those that are merely flaky (especially because of timing issues), and quarantine for those that are fully broken. Then there's still a challenge getting people to fix their broken tests, but life is full of tradeoffs like that.
by Slartie on 5/28/19, 1:16 PM
We're usually calling them "blinker tests" in our integration test suite. Reasons for blinker tests vary, but most are in line with what others here have already stated: concurrency, especially correct synchronization of test execution with stuff happening in asynchronous parts of the (distributed) system under test, is by far the biggest cause for problematic tests. This one is often exagerrated by the difference in concurrent execution on developer machines with maybe 4-6 cores and the CI server with 50-80, which often leads to "blinking" behavior that never happens locally, but every few builds on the CI server.
Second biggest is database transaction management and incorrect assumptions over when database changes become visible to other processes (which are in some way also concurrency problems, so it basically comes down to that). Third biggest is unintentional nondeterminism in the software, like people assuming that a certain collection implementation has deterministic order, but actually it doesn't, someone was just lucky to get the same order all the time while testing on the dev machine.
by jonatron on 5/28/19, 9:45 AM
"Making bad assumptions about DB ordering" That's caught me out before. Postgres is just weird, I had to run the same test in a loop for an hour before it'd randomly change the order.
by adamb on 5/28/19, 4:29 PM
If anyone is looking for ideas for how to build tooling that fights flaky tests, I consolidated a number of lessons into a tool I open sourced a while ago.
https://github.com/ajbouh/qa
It will do things like separate out different kinds of test failures (by error message and stacktrace) and then measure their individual rates of incidence.
You can also ask it to reproduce a specific failure in a tight loop and once it succeeds it will drop you into a debugger session so you can explore what's going on.
There are demo videos in the project highlighting these techniques. Here's one: https://asciinema.org/a/dhdetw07drgyz78yr66bm57va
by pjc50 on 5/28/19, 10:11 AM
The two big problems seem to be concurrency (always a problem) and state, which immediately suggest that making things as functional as possible would help a lot.
Ideally all state that's used in a test would be reset to a known value at or before the start of the test, but this is quite hard for external non-mocked databases, clocks and so on.
For integration tests, do you run in a controllable "safe" environment and risk false-passes, or an environment as close as possible to production and risk intermittent failure?
A variant I've seen is "compiled languages may re-order floating point calculations between builds resulting in different answers", which is extremely annoying to deal with especially when you can't just epsilon it away.
by rrnewton on 5/28/19, 11:51 PM
Both this article and this comment thread include a number of different ideas regarding controlling (or randomizing) environmental factors: test ordering, system time, etc.
But why do all of this piecemeal? Our philosophy is to create a controlled test sandbox environment that makes all these aspects (including concurrency) reproducible:
https://www.cloudseal.io/blog/2018-04-06-intro-to-fixing-fla...
The idea is to guarantee that any flake is easy to reproduce. If people have objections to that approach, we'd love to hear them. Conversely, if you would be willing to test out our early prototype, get in touch.
by invertednz on 5/28/19, 9:16 PM
I used to work at a company with over 10,000 tests where we weren't able to get more than an 80% pass rate due to flaky tests. This article is great and covers a lot of the options for handling flaky tests. I founded Appsurify to make it easy for companies to handle flaky tests, with minimal effort.
First, don't delete them, flaky tests are still valuable and can still find bugs. We also had the challenge where a lot of the 'flakiness' was not the test or the application's fault but was caused by 3rd party providers. Even at Google "Almost 16% of our tests have some level of flakiness associated with them!" - John Micco, so just writing tests that aren't flaky isn't always possible.
Appsurify automatically raises defects when tests fail, and if the failure reason looks to be 'flakiness' (based on failure type, when the failure occurred, the change being made, previous known flaky failures) then we raise the defect as a "flaky" defect. Teams can then have the build fail based only on new defects and prevent it from failing when there are flaky test results.
We also prioritize the tests, which causes fewer tests to be run which are more likely to fail due to a real defect, which also reduces the number of flaky test results.
by pure-awesome on 5/28/19, 2:11 PM
> A few months back we introduced a game.
> We created a topic on our development Discourse instance. Each time the test suite failed due to a flaky test we would assign the topic to the developer who originally wrote the test. Once fixed the developer who sorted it out would post a quick post morterm.
What's the game here? It just seems like a process. Useful, sure, but not particularly fun...
by boothby on 5/28/19, 11:26 PM
I'm the primary developer for a heuristic, nondeterministic algorithm. It's both production software, and also a neverending research project. Specifically, I can't guarantee that a particular random seed will always produce identical results because that hobbles my ability to make future improvements to the heuristic. I've got reasonable coverage of my base classes and subroutines, but minor changes to the heuristic can have significant impact on the "power" of the heuristic.
My solution was to add a calibrated set of benchmarks. For each problem in the test suite, I measure the probability of failure. From that probability, I can compute the probability of n repeated failures. Small regressions are ignored, but large regressions (p < .001) splat on CI. It's fast enough, accurate enough, and brings peace of mind.
I understand that, and why, engineers hate this. But it's greatly superior to nothing.
by tom-jh on 5/29/19, 8:14 AM
We run in-browser end to end tests for our browser extension. There were several reasons for flakiness:
* Puppeteer (browser automation) bugs or improper use. Certain sequence of events could deadlock it, causing timeouts relatively rarely. The fix was sometimes upgrading puppeteer, sometimes debugging and working around the issue.
* Vendor API, particularly their oauth screen. When they smell automation, they will want to block the requests on security grounds. We have routed all requests through one IP address and reuse browser cookies to minimize this.
* Vendor API again, this time hitting limits on rare situations. We could have less parallel tests, but then you waste more time waiting.
Eventually, we will have to mock up this (fairly complex) API to progress. It's got to a point where I don't feel like adding more tests because they may cause further flakiness - not good.
by mariefred on 5/28/19, 1:13 PM
Flaky tests are indeed a big issue, the main concern being loss of confidence in the results.
The otherwise good advice for randomization has its drawbacks-
- it complicates issue reproduction, especially if the test flow itself is randomized and not just the data
- the same way it catches more issues, it might as well skip some
Something else that was mentioned but not stressed enough is the importance of clean environment as the basis for the test infrastructure.
A cleanup function is nice but using a virtual environment, Docker or a clean VM will save you a lot of debugging time finding environmental issues. The same goes for mocked or simplified elements if they contribute to the reproducibility of the system- a simpler in-memory database can help re creating a clean database for each test instead of reverting for example
by notacoward on 5/28/19, 12:00 PM
Here's a Google testing blog post about the same thing in 2016.
https://testing.googleblog.com/2016/05/flaky-tests-at-google...
by rellui on 5/28/19, 3:29 PM
Personally I've always called them flaky tests. I agree with the article that flaky tests shouldn't be ignored completely. But the issue is they take much more effort than usual test failures to debug. So it comes down to a balancing act of how much effort you're willing to spend debugging these vs the chance that it's an actual issue.
In my few years of automation experience, I've only seen 2 actual instances where the flaky tests were an actual issues and one of them should've been found by performance testing. Almost all of the rest were environment related issues. It's tough testing across all of the different platforms without running into some environment instability.
by mannykannot on 5/28/19, 12:00 PM
Tests are part of the system too, and if you accept lower standards for your test suite than you think you hold the product to, you have actually lowered your standards for the product to those you accept for the tests.
by ArturT on 5/31/19, 2:19 PM
For annoying flaky features tests, I use rspec-retry gem to repeat the test a few times before marking it as failed. It helped for integration tests with external sandbox API.
I noticed discourse had a lot of flaky tests while using their repo to test my knapsack_pro ruby gem to run test suite with CI parallelisation. A few articles with CI examples of parallelisation can be found here https://docs.knapsackpro.com
I need to try the latest version of discourse code, maybe now it will be more stable to run tests in parallel.
by chippy on 5/28/19, 1:39 PM
One recent test that was sometimes failing was ordering a list. It was due to how I made a sequence of my fixtures using numbers as a affix to a string so it was ordering correctly unless e.g. "string 8, string 9, string 10".
I fixed it for me by creating a random selection from /usr/share/dict/words to make a large array of sorted words to choose from. This made the fixtures have better and amusing names such as "string trapezoidal, string understudy"
by boyter on 5/28/19, 10:05 AM
These sort of tests are perfect examples for me to add to https://boyter.org/posts/expert-excuses-for-not-writing-unit... Tongue in cheek it is but I’m always on the lookout for additional examples to flesh it out.
by pavel_lishin on 5/28/19, 5:30 PM
Flaky tests are one of the factors that led me to leave a previous job. Test coverage was already so bad (and honestly, so was the code) that it was difficult to do anything with confidence - add to this that tests sometimes worked meant that writing code was basically a dice-roll. I got tired of the stress.
by piokoch on 5/28/19, 11:18 AM
"Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite."
"To this I would like to add that flaky tests are an incredible cost to businesses."
I think that the misconception here is that "tests should not fail", because they are "cost", "has to be analyzed and fixed", etc.
An integration or functional test that is guaranteed to never fail is kind of useless for me. Good test with a lot of assertions will fail occasionally since things are happening - unexpected data are provided, someone manually played with the database, ntp service was accidentally stopped and date in not accurate and filtering by date might be failing, someone plugged in some additional system that alters/locks data.
In case of unit tests, well, if everything is mocked and isolated then yes, such test probably should never fail, but unit tests are mostly useful only if there is some complicated logic involved.
by rgoulter on 5/28/19, 9:34 AM
"You won't have code like this obviously contrived example, but you might have code which is equivalent."
Ha, yes! The problem sounds super dumb and obvious once you explain it, but can be a PITA to track down or recognise in the code.
by revskill on 5/28/19, 12:35 PM
To me, unit tests only make sense for pure code.
For impure code, it made no sense to make a unit test.
Ability to separate pure vs impure code determines your test suites, where should be put in unit test, where should be put in integration test.
by jdlshore on 5/29/19, 5:44 AM
This is a great article. Grounded in experience, detailed, actionable. Nicely done.