Back to all posts

The Ultimate Guide to Flaky Tests

By Josh MarinacciAugust 12, 2024
Testing

This is a guide to Flaky Tests. If you are interested in a new solution for dealing with Flaky Tests, please join our private beta for Trunk Flaky Tests.

What are Flaky Tests

Has this ever happened to you? You fix a few bugs and add a feature or two, push your Pull Request to the CI system that runs all the tests, then go grab a cup of coffee. When you come back you see the build failed, even though it worked on your machine. You dig into the report and see that a test failed, a test completely unrelated to the part of the code you were working on. How did that break? You shrug and click the ‘rerun tests’ button. This time they passed. Musta been a glitch. But was it really? No, it wasn’t. What you encountered was that terror from the deep, that glitch in the matrix that drains the coder’s soul. What you’ve found is: a Flaky Test.

Flaky tests are automated tests (unit, integration, or end to end) that exhibit inconsistent behavior. Unlike stable tests, which consistently pass or fail, flaky tests unpredictably sway between passing and failing, despite no changes in your code or testing environment. How can this happen? Flaky tests usually indicate there is some form of nondeterminism happening. Sometimes they are called heisenbugs because they feel as mysterious as quantum systems. Flaky tests are especially challenging to fix specifically because they are inconsistent and hard to reproduce.

Left unchecked, Flaky Tests cause your team to constantly re-run builds, reducing engineering velocity and increasing CI costs. Additionally, since the flaky tests themselves don’t work, they mask issues that could lead to bugs in production. And finally, flaky tests undermine confidence in the testing process itself.

Keeping a handle on flaky tests is crucial for maintaining a healthy and productive engineering team.

Root Causes of Flaky Tests

Flaky Tests can be caused for all sorts of crazy reasons. This isn’t really surprising. If there were simple straightforward reasons for the tests then they wouldn’t be flaky. We’d have fixed them by now. ;)

In my experience the root causes of flaky tests fall into one of four categories: race conditions, external resources, lack of isolation, and orphan code. Let’s look at these one by one.

Race conditions

Race conditions are when two things are racing to do something and sometimes one gets there first versus the other. This could be a timeout waiting for a web page to render or waiting for a request to return in time. Or it could be caused by the timeouts being too short, or by running multiple tests at once such that the system is under more load when doing the full suite vs a single test.

Another example would be when multiple things are loaded in parallel. The actual order may be different from run to run, but only in certain orders does the test fail. An example could be putting test data into a database. If some test data is put in before others then queries could fail. 

Fundamentally race conditions are hidden dependencies. Things need to be completed in a certain order but can happen out of order in some cases. The solution in these cases is to make the test setup more deterministic. If code in three different places needs to initialize the database, then make sure they are always run in the same order and that each part waits for the previous part to complete before moving on. Make sure your code rebuilds the starting state and cleans up properly between tests. You can also use transactions in database tests and rollback after each test.

Race conditions can be very tricky to diagnose. Sometimes even a log statement can cause a race condition. Be wary of singletons and static data.  This data is often designed assuming there will be only one thread working with it at a time, which may not be true any longer.

External Resources

If nothing in the code has changed when a test flakes then we have to look outside the code. The most common cause is some external resource changing. This could be a network service like a database that has gone offline or is having slower than normal response times. It could be something on the local machine like disk space or free RAM. Look for anything that could be different from one run to another.  

Here's an example. At a previous company we had a set of end to end tests running under headless Chrome. Sometimes these tests would pass and sometimes they would fail on the CI system, but they always passed on my local workstation. Whenever something works in one environment and fails in another, this is a good clue that some external resource is different. The root cause turned out to be memory related. We had multiple runners setup in our CI system, but they were not identical. Some had more RAM than others. Sometimes (but not always) Chrome would run out of memory on the systems with less RAM, causing the test suite to fail. This never happened on my local machine because my laptop had plenty of memory.

Lack of Isolation

Race conditions and flaky external resources often are symptoms of a root cause: lack of isolation between tests. Good tests are completely isolated. They each set up their own state: environment variables, dependencies and test data (ex: an in-memory database). They also each clean up any changes they made, so that the next test will have a clean state to start with.  State leaking from one test to another is one of the biggest causes of flaky tests. Even if the root is a race condition, it is the interaction between tests that often causes the tests to become flaky.

Orphan Code

Larger code bases often have orphans in them. There are parts of the codebase that are not used anymore but are still in the repo and still have tests run with them. They often contain assumptions about the environment that were true when the code was actively used but have since changed. 

One symptom of orphan code is resource leaks. Tests may allocate a resource (say a database table) and not properly clean it up. Because the test is an orphan no one realizes that this is the cause of a flaky test.

Orphan code can also leave old data and code vulnerabilities exposed, without the team even knowing it exists anymore!  Much like an abandoned house, the orphan code slowly bitrots and eventually tests begin to fail. The solution is usually to delete the orphan code and related tests. The old code is still available in the version control history if it really is needed again.

Other

Sometimes tests fail for reasons unrelated to the above general categories. Sometimes the failure truly is random. For example, I once wrote a test for a payment service. Each time it would randomly pick a price to pay, from one to ten dollars. However I wrote my test with code like this: Math.floor(Math.random()*10). Can you guess the problem? This code returns prices in the range of 0-9, not 1-10. If the price ends up being zero the payment service would reject it. This was fundamentally a logical bug in my code, but because it was working with random data it would only fail 10% of the time it was flaky.

Measuring & Mitigating Flaky Tests

Now that we know what flaky tests are and what causes them, how do we actually deal with them? The first step to getting out of a hole is to stop digging and see how deep we are. We need to know how many flaky tests we have and how bad they are. It might be painful, but it’s worth it.

Tracking

Start by tracking and cataloging your flaky tests. You can use your team’s issue tracking system to track the flaky tests as well. Some teams may also use a spreadsheet to have a single place to coordinate when doing bug bash. Either way you need a record of the flaky tests, even if you have to push them all to the backlog. If other developers come across the same issue, they will at least know it’s flaky and can add additional details. 

Along with the existence of the flaky test, you need to assess the test’s impact. Is this a test that affects only one small feature or is it blocking the entire project? Could you disable this test with little impact or is this test on the critical path for a core part of the product? 

A good way to judge the urgency of a flaky test is to calculate how many PRs have been impacted. Knowing the impact is the only way to judge the best way to move forward.

Tracking your flaky tests is vital, even if it means grooming through the backlog once a week. I highly suggest using a ‘flaky test’ tag to mark these tickets so you can find them again when you run a flaky test bug bash.

Whenever a flaky test is found it must be tracked by adding it to your issue tracking system. While it takes time to gather all of the information related to a flaky test, it’s the only way to be sure the test doesn’t get lost. Newer flaky test tracking tools can automate this process for you.

Mitigation and Quarantining

Once you have a handle on the scope of your flaky tests you can start mitigating them. First, configure your test runner to retry the tests when they fail. Many test runners support automatic retries up to some limit. This isn’t an end solution, but it can mitigate the problem for a while and give you time to catch up.

Another strategy is to look for tests related to timing. You can increase the timeouts to see if that makes the test more reliable. This still usually isn’t a proper fix but it can get the build moving again.  

Finally, quarantine the flaky tests. This means continuing to run them, but don’t make them prevent merging PRs or hold back deployment. They should continue to run so that the team maintains awareness of the flakiness, but the tests no longer block the build. Quarantining is a way to maintain engineering velocity.

Ownership

It is not enough to have tests in your code base. Tests must be owned. Test ownership is critical in managing flaky tests effectively. This means specific individuals must be responsible for maintaining the tests, fixing them when broken, and addressing any issues that arise from a test being broken. Having ownership ensures that tests don’t become orphans. Ownership should be split up among engineers by area of the code base, so that no one person is responsible for everything and becomes a bottleneck.

Having test owners also makes each owner responsible for the tough calls of whether a test should be deleted, fixed, or completely rewritten. Having accountability and responsibility encourages a healthy engineering culture and ensures flaky tests are handled quickly. 

Pro-tip: use a CODEOWNERS file. This is a file in your repo that indicates who is responsible for what parts of the code, specified with regular expressions and usernames. Tools like the GitHub website will use the code owners file to show who owns a particular file and to suggest reviewers when creating PRs.  Some flaky test handling tools can also use the code owners file to assist creating tickets for discovered flaky tests. See GitHub’s code owners docs for more info.

Fixing Flaky tests

Fixing tests is best done in three phases: reproduction, diagnosis, and finally fixing.

Reproduction

To fix a test you must first be able to reproduce the test. This may be difficult since one of the defining features of a flaky test is that it doesn’t consistently reproduce! Activities to help include:

  • Run through logs of previous test runs to see if the failure has happened before.

  • Identify when the flaky test first appeared in the codebase. Find the commit in your code history that first triggered it.

  • Run the test in isolation, then in different orderings with other tests. 

  • Run the test in parallel. If a test is only flaky when running the full suite in parallel mode, this is a good sign that it’s a concurrency issue.

Root Cause Analysis

Now you can look into why the test is flaky in the first place. Is it a concurrency bug? Does it depend on some external resource? Try reordering code lines. Sometimes the order of two statements makes all the difference.

Consider adopting tools designed for managing flaky tests. They can connect data about previous failures of the test, giving you insight into the root cause.

Fixing the Flaky Test

How a test is fixed greatly depends on the root cause. In the research paper An Empirical Analysis of Flaky Tests, one of the most cited papers on flaky test research, found that the most common causes of flaky tests are: Async Wait (45%), Concurrency (20%), and Test Order Dependency (12%). They found that these top causes were fixed by:

  • Async Wait: 54% are fixed using waitFor, often removing flakiness completely.

  • Concurrency: Fixed using various methods such as adding locks (31%) and making code deterministic (25%).

  • Test Order Dependency: 74% are fixed by cleaning the shared state between test runs.

This paper mostly studied Apache Java projects which commonly use concurrency, so the focus on ordering and async bugs makes sense. The specific causes will vary greatly based on the language, platform, and type of algorithms your own codebase uses.

 

Here are some common solutions I’ve found to be helpful.

  • Replace awaits and timeouts with callbacks. Rather than waiting a certain amount of time for something to happen, use a callback to be notified when it actually happened. This is very common in UI testing frameworks.

  • Make sure any tear down completely mirrors the setup. Flaky tests are often caused by something being created in the test setup and not being completely destroyed or reset in the tear down, which gives later tests an unclean environment. This often happens with test data in the database.

  • If the root cause is some resource being unavailable (ex: the database connection is down), then add a check for that resource before running the test itself. In the future if the resource is missing you’ll get a correct indication of where the issue actually is. Also consider using a local mock version of that resource.

  • Make sure dependencies are really set up in the order you think they are. In Javascript, waiting on multiple promises at once does not guarantee the order that they will be resolved in. This can be an even worse problem in languages with true concurrency and threading (like C++ & Rust)

  • Keep monitoring that the flaky test is really fixed over time. A flaky test can still be flaky even if it seems fixed once or twice. It is worth flagging your flaky tests for follow up a few weeks later to see if they have truly been solved.

Root causes can be very language and framework specific. Use a testing guide specific to your scenario. 

Maintaining the Testing Suite over Time

Don’t just restart the build and hope it passes this time. That’s just kicking the can down the road. Instead track and monitor your flaky tests over time. This is how you will know if the flaky test problem is getting better or worse.

If a test is flaky but is taking too much time to resolve, consider quarantining it. This means marking the test so that it will still be run along with the rest, but no longer blocks the build from merging. It no longer is required for acceptance. Quarantining is better than simply skipping the test because you still want to collect results from the flaky test for better diagnosis in the future. Sometimes a flaky test fully breaks and must be fixed ASAP. If the test is skipped then you’ll never know it broke. 

Consider having a bug day where everyone picks a flaky test from the backlog and fixes it for good. Pick the tests that have impacted the most PRs over time. Also do code reviews on the test themselves. The tests are just as important as the rest of the code base. Setting guidelines for writing reliable tests will help your team to have fewer tests over time.

Isolate your tests into groups, separating unit tests from integration tests. Ideally unit tests should be pure functional code. That means they don’t talk to the database or filesystem or communicate across the network.  If a unit test is entirely self contained then you know that it depends only on itself. If it fails, then it fails only because of internal logic. Put tests that do need external resources in a different group called integration tests. This reduces the surface area where flaky tests could appear, making them easier and faster to resolve.

This is a whole lot of work!

Maintaining a comprehensive suite of tests definitely is a lot of work. Making sure they remain useful and flake free is even more work, so many companies have taken matters into their own hands. 

Slack initiated Project Cornflake to build their own internal flaky test system that “reduced test job failure from 56.76% on July 27, 2020 to 3.85%” in less than a year. GitHub found flaky tests were such a big problem that they built an internal flaky test management system from scratch which reduced flaky tests by 18X!. 

And it’s not just big companies. Open source projects have to deal with flaky tests too. The Fuchsia project has specific rules to ensure flaky tests are “removed from the critical path of CQ as quickly as possible.”  Juan Rodriguez, director of product management at the Qt Group, says that “Software is eroding before our eyes” and that only when we “run static code analysis and … run our functional tests to obtain results, that’s when you start to understand where the issues are.” 

As you can see, large engineering organizations are spending significant time and money to reduce flakiness. That’s why we’ve been working on a new product called Trunk Flaky Tests. Flaky Tests will monitor your CI pipelines, identify flaky tests, then help you do something about them like automatically filing tickets, quarantining specific tests, and sending alerts to team members when a test is acting up. 

If you want to avoid building your own solution and improve your engineering velocity by reducing the impact of flaky tests, join our private beta for Trunk Flaky Tests.

References

In Martin Fowler’s Eradicating Non-determinism in Tests, he cites the following reasons for flaky tests: lack of isolation, asynchronous behavior, remote services, time, and resource leaks.

An Empirical Analysis of Flaky Tests (pdf) by Luo, Hariri, et al, University of Illinois at Urbana-Champaign; is one of the  one of the earliest and most cited research papers on flaky tests. The authors analyzed 1129 commits that likely fixed flaky tests in 51 open source projects.

The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software (pdf) found “tens of thousands of open source projects that contain files with known vulnerabilities even though the vulnerabilities have been fixed in the original project from where the vulnerable file was copied”.

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users