Back to all posts

The Ultimate Guide to Flaky Tests

By Josh MarinacciAugust 12, 2024
Testing

This blog post is a guide to fixing flaky tests. If you're interested, we just launched the public beta for Trunk Flaky Tests, our tool that helps you detect, quarantine, and eliminate flaky tests.

What is a flaky test?

Imagine this. You fix a few bugs, add a feature or two, push your pull request to trigger tests in CI, and then grab a cup of coffee.

When you come back, you see that your build failed, even though it worked on your machine! After digging into the report, it turns out that a test failed, a test completely unrelated to your PR and the code you were working on.

How did that break? You shrug and rerun the CI tests. This time, they pass, and you move on with your day.

Odds are, you were just the victim of a rogue flaky test. Flaky tests are automated tests that exhibit inconsistent behavior. Unlike stable tests, which consistently pass or fail, flaky tests unpredictably flip between passing and failing despite no changes in your code or testing environment. They are especially challenging to fix because they are inconsistent and hard to reproduce.

Why they're hard to fix

All sorts of crazy reasons can cause Flaky Tests. In my experience, the root causes of flaky tests fall into one of four categories: race conditions, external resources, lack of isolation, and orphan code. Let’s look at these one by one.

Race conditions

Race conditions occur when multiple processes try to access or alter the same resource and interfere with one another. The problem here is that their behavior depends on their relative timing. So when actions happen out of order, it results in flakiness.

A classic example of this is when multiple tests are run simultaneously and they compete for reading and writing to the same database table. If you’re testing against a staging database using read-modify-write operations, tests run in parallel may end up flaky if they read the same resource.

Another example is poorly written async code. It’s generally not good practice to write delays for a resource to become available. Instead, you should await the resource itself; one example is using explicit awaits.

Fundamentally, race conditions are hidden dependencies. Things need to be completed in a certain order but can happen out of order in some cases. The solution in these cases is to make the test setup more deterministic. If code in three different places needs to initialize the database, then make sure they are always run in the same order and that each part waits for the previous part to complete before moving on. Make sure your code rebuilds the starting state and cleans up properly between tests. You can also use transactions in database tests and rollback after each test.

Race conditions can be very tricky to diagnose. Sometimes even a log statement can cause a race condition. Be wary of singletons and static data.  This data is often designed assuming there will be only one thread working with it at a time, which may not be true any longer.

External Resources

If nothing in the code changes when a test flakes, then we have to look outside the code. The most common cause is external resource changes. This could be a network service like a database that has gone offline or is having slower than normal response times. It could be something on the local machine like disk space or free RAM. Look for anything that could be different from one run to another. Mocking is a decent strategy for testing specific behaviors, but it should not be used as a replacement for end-to-end testing.

Here’s an example. At a previous company, we had a set of end-to-end tests running under headless Chrome. Sometimes, these tests would pass, and sometimes, they would fail on the CI system, but they always passed on my local workstation. Whenever something works in one environment and fails in another, this is a good clue that some external resources are different.

The root cause turned out to be memory-related. We had multiple runners set up in our CI system, but they were not identical. Some had more RAM than others. Sometimes (but not always) Chrome would run out of memory on the systems with less RAM, causing the test suite to fail. This never happened on my local machine because my laptop had plenty of memory.

Lack of Isolation

Race conditions and flaky external resources often are symptoms of a root cause: lack of isolation between tests. Robust tests are completely isolated from other tests. They each set up their own state: environment variables, dependencies, and test data.

They also each clean up any changes they made so that the next test will have a clean slate to start with.  State leaking from one test to another is one of the biggest causes of flaky tests. Even if the root is a race condition, it is the interaction between tests that often causes the tests to become flaky.

Orphan Code

Larger code bases often have orphans in them. There are parts of the codebase that are not used anymore but are still in the repo and still have tests run with them. They often contain assumptions about the environment that were true when the code was actively used but have since changed.

One symptom of orphan code is resource leaks. Tests may allocate a resource (say a database table) and not properly clean it up. Because the test is an orphan, no one realizes that this is the cause of a flaky test.

Orphan code can also leave old data and code vulnerabilities exposed without the team even knowing it exists anymore!  Much like an abandoned house, the orphan code slowly bitrots, and eventually, tests begin to fail. The solution is usually to delete the orphan code and related tests. The old code is still available in the version control history if it really is needed again.

Other

Sometimes tests fail for reasons unrelated to the above general categories. Sometimes the failure truly is random. For example, from a test, I wrote in JS for a payment service. Each time it would randomly pick a price to pay, from one to ten dollars.

Math.floor(Math.random()*10)

Can you guess the problem? This code returns prices in the range of 0-9, not 1-10. If the price ends up being zero the payment service would reject it. This was fundamentally a logical bug in my code, but because it was working with random data it would only fail 10% of the time it was flaky.

Tracking Flaky Tests

Now that we know what flaky tests are and what causes them, how do we deal with them? We need to find out which tests are flaky and how severe they are. It might be painful, but proper strategy and tools make it easier.

Tracking and detecting flakes

You can start by using your team’s issue-tracking system to track the flaky tests; I highly suggest using a ‘flaky test’ tag to mark these tickets so you can find them again when you run a flaky test bug bash. Some teams even use a spreadsheet as a source of truth for doing this!

However you choose to do it, it's important to keep a record of flaky tests; even if you end up pushing them all to the backlog. If other developers come across the same issue, they’ll know it’s flaky and can avoid wasting time debugging. Doing this helps you understand the impact of flakiness on your velocity, and also lends to uncovering dangerous product bugs.


This is a lot of work, so I recommend you automate this process with a tool. You can build your own or try our tool, Trunk Flaky Tests. Our users automatically detect flaky tests by logging and tracking every test job they run.

How to decide which flaky tests to fix

Along with tracking a flaky test's existence, you need to assess the impact. Is this a test that affects only one small feature, or is it blocking the entire project? Could you disable this test with little impact, or is this test critical for your product? One way to measure a flaky test's severity and priority is to calculate how many PRs it impacts. PRs impacted is a measure of how many PRs have had their success/failure affected by a specific flaky test.

Common strategies for mitigating flaky tests

Once you have a handle on the scope of your flaky tests, you can start mitigating them. An easy way is to configure your test runner to retry the tests when they fail. Many test runners support automatic retries up to some limit. This isn’t an end solution, but it can soften the blow while you catch up.

Another popular strategy is to look for tests related to timing. It's pretty common to increase the timeouts to see if that makes the test more reliable. This usually isn’t a proper fix, but it can get the build moving again. Anecdotally, this has worked for me, but your experience could be different.

At Microsoft, research was done by Wing Lam et al. which found that "developers thought they “fixed” the flaky tests by increasing some time values in them, but our empirical experiments show that these time values actually have no effect."

In my opinion, that is why the best strategy is to quarantine the flaky tests. This means continuing to run them, but not letting them prevent merging PRs or block deployment. They should continue to run so that the team maintains awareness of the flakiness, which is great since you maintain engineering velocity without having to delete or disable tests. Additionally, tracking results lets you see if any “fixes” did anything.

Fixing individual flaky tests, step-by-step

Fixing flaky tests is best done in three phases: diagnosis, mitigation, and fixing.

Diagnosis

When you run into a test failure, you first have to determine if it's a real or flaky failure. To fix a test, you first have to understand what the test is doing. This may be difficult since the defining feature of a flaky test is that it generates inconsistent results.

Once you have that data, some activities to help include:

  • Run through logs of previous test runs to see if the failure has happened before.

  • Identify when the flaky test first appeared in the codebase.

  • Find the commit in your code history that first triggered it.

  • Run the test in isolation, then in different orderings with other tests.

  • Run the test in parallel. If a test is only flaky when running the full suite in parallel mode, this is a good sign that it’s a concurrency issue.

Doing this manually for each suspected flaky test is unscalable and painful, which is why you should collect test run data over time.

Consider implementing a service that gives snapshot summaries on codebase health for each pull request. For example, this is how we automatically leave PR comments on GitHub.

Fixing the Flaky Test

How a test is fixed greatly depends on the root cause. An Empirical Analysis of Flaky Tests found that the most common causes of flaky tests are Async Wait (45%), Concurrency (20%), and Test Order Dependency (12%).

Some numbers around the most common fixes for each category:

  • Async Wait: 54% are fixed by awaiting a response instead of static timeouts, often removing flakiness completely.

  • Concurrency: 31% are fixed using various methods such as adding locks, and 25% are fixed by making code deterministic.

  • Test Order Dependency: 74% are fixed by cleaning the shared state between test runs.

This paper mostly studied Apache Java projects, which commonly use concurrency, so the focus on ordering and async bugs makes sense. Specific causes vary greatly based on the language, platform, and type of algorithms your codebase uses.

Here are some common solutions I’ve found to be helpful.

  • Replace awaits and timeouts with callbacks. Rather than waiting a certain amount of time for something to happen, use a callback to be notified when it actually happened. This is very common in UI testing frameworks.

  • Make sure any tear-down completely mirrors the setup. Flaky tests are often caused by something being created in the test setup and not being completely destroyed or reset in the teardown, which gives later tests an unclean environment. This often happens with test data in the database.

  • If the root cause is some resource being unavailable (i.e., the database connection is down), then add a check for that resource before running the test itself. In the future, if the resource is missing, you’ll get a correct indication of where the issue actually is. Also, consider using a local mock version of that resource.

  • Make sure dependencies are really set up in the order you think they are. In Javascript, waiting on multiple promises at once does not guarantee the order that they will be resolved. This can be an even worse problem in languages with true concurrency and threading (like C++ & Rust)

  • Keep monitoring that the flaky test is really fixed over time. A flaky test can still be flaky even if it seems fixed once or twice. It is worth flagging your flaky tests for follow-up a few weeks later to see if they have truly been solved.

Root causes can be very language and framework-specific.

Maintaining your testing suite over time

Keeping your code base green is a marathon. If your team is doing a bug bash, it could also be a sprint. The practice of just rerunning your build jobs hoping your test suites eventually pass is equivalent to kicking a can down the road. That's the result of having no confidence in your codebase.

"Insanity is doing the same thing over and over again and expecting different results."

Consider having a bug day where everyone picks a flaky test from the backlog and fixes it for good. Pick the tests that have impacted the most PRs over time. Also, code reviews should be done on the test themselves. The tests are just as important as the rest of the code base. Setting guidelines for writing reliable tests will help your team to have fewer tests over time.

Isolate your tests into groups, separating unit tests from integration tests. Ideally, unit tests should be pure functional code. That means they don’t talk to the database or filesystem or communicate across the network.  If a unit test is entirely self-contained, then you know that it depends only on itself. If it fails, then it fails only because of internal logic. Put tests that do need external resources in a different group called integration tests. This reduces the surface area where flaky tests could appear, making them easier and faster to resolve.

Many parts are important for maintaining code quality, from individually writing better tests to implementing tools to mitigate flakiness.

Why bother fixing flaky tests?

In my experience, manually maintaining a comprehensive suite of tests is a lot of work, and convincing management that code quality is important is an uphill battle. It's hard to quantify, and a lot of the time, I'd rather finish my tickets and move on instead of fixing someone else's tests.

The keyword there is manually, which is why many large companies build systems to automatically detect, quarantine, and eliminate flaky tests.

Slack initiated Project Cornflake to build their own internal flaky test system that “reduced test job failure from 56.76% on July 27, 2020, to 3.85%” in less than a year.

GitHub found flaky tests were such a big problem that they built an internal flaky test management system from scratch, which reduced flaky builds by 18x.

Maintaining test suites is even harder in large open-source projects, where code quality is often inconsistent. The Fuschia project has specific rules to ensure flaky tests are “removed from the critical path of CQ as quickly as possible.”

Large engineering organizations place a huge emphasis on DevEx, and for good reason. That being said, not everyone has the privilege of spending significant time on building custom tools and infrastructure.

That's why our mission at Trunk is to help developers land code faster and develop happier.

Appendix

While writing this blog post, I realized that flaky tests are known by many other names. Some other names are intermittent failures, non-deterministic tests, brittle tests, unstable tests, heisenbugs, flappers, and many more. This may lend itself to a lack of standardization across software testing strategies and a global culture of software development. Regardless of how you refer to flaky tests, unreliable software is a language-agnostic problem that must be solved.

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users