Testing

Eradicating Flaky Tests

By Matt MathesonMatt MathesonMarch 24, 2025

The most important role of automated tests in software is as bug detectors. Automated test suites serve as regression tests, alerting when a code change introduces a bug to the system.

Flaky tests are the bane of any automated test suite. They provide no value in detecting bugs because they fail even when no bug has been introduced. The impact of flaky tests is twofold:

  1. They undermine confidence in the quality of the tests and the system under test

  2. They interrupt development. When a flaky test causes their CI pipeline to fail and require rebuild(s), developers are forced to perform time-consuming CI failure diagnosis, which negatively impacts the development cycle time and developer experience.

Resolving a flaky test in a timely manner is critical to maintaining project velocity and developer satisfaction.

Smaller teams are usually able to address flaky tests as they are found. However, as teams grow, this eventually breaks down, and developers resort to re-running CI to "get past" a flaky test.

Strong engineering organizations like Google, Slack, Dropbox, Reddit, Flexport, and more use quarantining as a way to keep their CI stable and their automated tests healthy.

Quarantine your flaky tests

Quarantining your flaky tests means they will still run, but their flaky failures will not cause CI jobs to fail.

Quarantining has benefits compared to disabling or skipping flaky tests:

  1. Quarantining does not require introducing a code change and can be done immediately when a flake is detected. Manually disabling a test requires merging and pulling a code change into all open branches.

  2. Since quarantined tests are still executed, stack traces of every failure are collected and added to the evidence available when it comes time to determine the root cause and fix the issue. More evidence means greater confidence in accurately determining the root cause and fixing the issue.

  3. If a test is disabled, it is not doing its job as a bug detector, which could introduce regressions into the system. Quarantined tests will continue to run, and Trunk will detect when a bug is introduced by marking the test as broken.

  4. Quarantining tests creates a centralized repository to track which tests must be addressed.

Automatic vs Manual quarantining

Manual quarantining means a human must set a test to be quarantined. When a test is set to be quarantined, an explanation of why can be included. Trunk will note who took the action and provide an explanation.

Automatic quarantining means Trunk will automatically mark any flaky test as quarantined.

Teams that are more cautious about which tests get quarantined may prefer manual quarantining. However, manually taking action on every flaky test on very large teams may not scale well, so these teams prefer to use automatic quarantining.

For teams using automatic quarantining, some tests may validate that critical behavior does not regress. If so, those tests should be marked as "never quarantine" to prevent the system from quarantining them. For very high-value tests, incurring the hit to productivity when they become flaky is more valuable than allowing them to be quarantined.

Even when using automatic quarantining, you may decide to quarantine a test manually if you know it is problematic, but the system has not yet gathered enough evidence of flakiness.

Quarantining is a tool, not the whole solution

A quarantined test should not be viewed as a solved problem but rather as a "TODO." Using quarantining without an established process for fixing or removing flaky tests leads to an ever-growing backlog of flaky tests. It causes the same erosion of trust in the testing system that flaky tests do.

Quarantining flaky tests is a powerful technique to improve CI stability and developer productivity, but it must be managed with discipline.

Establish and enforce a policy about fixing quarantined tests

As a team, set an SLA on how quickly quarantined tests must be resolved and socialize it across teams. Decide what will happen to a test that is quarantined beyond that SLA. If a test is quarantined but not deemed a high priority to fix within the determined SLA, it should be disabled or deleted.

Assign and track work to fix each Quarantined Test

Trunk emits webhooks when tests are marked as flaky. These webhooks can be used to schedule and track work to fix flaky tests. They include information about the test, including the file path, code owner, and an explanation of why the test was marked as flaky.

Use these webhooks to alert the code owner in Slack and create a ticket or issue in your project management system.

Learn more about Webhooks in Trunk

Review quarantined tests regularly

You should review your quarantined tests regularly, whether you are using automatic or manual quarantining.

  • Regularly review all newly quarantined tests to determine their urgency. Determine the impact of quarantining the test—which features are no longer being regression tested? Communicate with the test owner to establish a timeline for fixing the test.

  • Regularly review all quarantined tests that are nearing or exceeding the SLA established to fix broken tests. For each test, decide whether it should be fixed as a priority or deleted.

Prioritize fixing broken tests

Occasionally, when a test is quarantined, it may break or allow a bug to be introduced into the system. One benefit of quarantining over disabling tests is that this behavior will be detected, and the test will be marked as broken. Broken tests should be addressed with urgency.

Resources

The success of quarantining hinges on solid detection, integration with task management software like Linear and Jira, and automatic reporting to generate noise so quarantined tests aren't forgotten. It's a piece of a larger puzzle to tackle flaky tests successfully.

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users