Merge Queue

Stop Flaky Tests From Sabotaging Your Merge Queue

By Riley DrawardRiley DrawardMarch 5, 2025

TLDR; Automatically detecting and quarantining flaky tests can save hundreds or thousands of hours in wasted CI time when using a merge queue to protect the mainline branch.

Optimizing the CI workflow for large monorepos with hundreds of contributors can be difficult. Due to the high volume of contributions, PRs quickly become outdated, and conflicting changes from different committers frequently break the mainline branch.

Organizations often reach for a merge queue to help solve these problems but discover that flaky tests in their CI pipeline have massive, negative impacts on CI throughput when a queue is used.

Screenshot of hackernews. Original poster says "I presume merge queues are useless if you have flaky tests?". Others respond with "They're super annoying if you have flaky tests."

Flaky tests don’t have to render a merge queue useless. Automatically detecting and quarantining flaky tests at runtime allows developers to take full advantage of a merge queue while also tracking and handling any test flakes in a repo.

How flaky tests impacted Uber’s merge queue

As the number of employees and the size of the monorepo scaled up at Uber, they started to see problems with mainline branch stability. A team was created to investigate, and it was found that their iOS mainline was green for only 52% of the time when measured over a week!

To combat mainline instability, Uber engineers implemented a merge queue (named SubmitQueue). This project was successful — their mainline was protected against failures, and their merge queue scales to thousands of daily commits. 92% of engineers responding to a survey agreed that keeping the mainline green was a positive change.

You can read more about Uber’s SubmitQueue implementation in their paper: Keeping Master Green at Scale.

But SubmitQueue had a drawback. Failures from flaky tests were causing the queue to continually re-run CI jobs, backing up the queue and wasting massive amounts of CI and developer time.

From the blog post Flaky Tests Overhaul at Uber:

Flaky tests undermine the reliability of our CI pipeline, leading to chaos in developer experience–one bug becomes more bugs.

Furthermore, with our SubmitQueue speculation architecture, failing a revision can have cascading effects invalidating other revisions in the queue and causing blockage. (...) This may lead to developers constantly retrying their builds until the build becomes green, wasting engineering hours and CI resources.

It became an urgent need to develop an effective, scalable, and configurable system that can be easily adopted and responsive to thousands of tests’ state changes.

These problems aren’t unique to Uber’s merge queue. Any merge queue, including GitHub merge queues or GitLab merge trains, would suffer from the pain caused by flaky tests in CI.

So, Uber built another internal system - this time to detect and manage flaky tests in CI. That system, Testopedia, analyzes historical test runs to detect and manage flaky tests. Testopedia successfully improved the health of Uber’s CI pipeline and un-clogged their merge queue.

In the Go Monorepo, we are steadily detecting around 1000 flaky tests out of 600K in total and 1K/350K in Java. We also observed significant improvement in reliability of CI and huge reduction of retries.

Uber succeeded in building systems to protect their mainline branch and deal with CI throughput issues caused by flaky tests. But most organizations cannot dedicate the developer time and resources required to build these internal systems.

This is why Trunk (whose founders just happen to have worked together at Uber on the self-driving car project, Uber ATG) built Flaky Tests alongside Merge Queue. Large development teams can finally use a single CI toolkit to protect their mainline branch and maximize CI throughput while automatically detecting and handling flaky tests.

”This is crazy, (...) but myself and the rest of engineering leadership were all convinced that our workflow problems were the biggest problems holding the project back. Not solving the unsolved problem that is making a car drive better, and be self-driving.”

- David Apirian, Trunk co-founder

Let’s look at how flaky tests affect merge queues and the best strategy for handling flaky tests in any CI pipeline. But first, a quick review on how merge queues work and why CI failures are so damaging to throughput in a queue.

A quick review of merge queues

A merge queue is often necessary when a project or monorepo has a high volume of PRs from many different contributors, and the mainline branch becomes unstable. Conflicts in PRs from different contributors lead to CI failures and a broken mainline.

A band-aid solution can be to add a (manual) process and get developers to manually rebase before merging. This saps engineering time and does not scale as a repo and the number of contributions grow.

Merge queues are better solutions. Developers submit approved PRs to the queue, which stabilizes the mainline branch by:

  • Automatically testing PRs in the queue with the HEAD of the mainline branch and any changes from PRs already in the queue.

  • Merging PRs into the mainline branch once CI is finished.

  • Ejecting failed PRs from the queue.

Merge queue optimizations can also increase CI throughput so enqueued PRs don’t sit in the queue forever. These optimizations can include techniques like optimistic merging, batching, and parallel queues.

If you want an in-depth explanation of merge queues, you can read our blog post explaining what they are and how they optimize CI throughput.

What happens to a failed CI job in a merge queue?

When a PR’s CI job fails in a merge queue, that PR needs to be removed from the queue. This has a major impact on CI throughput because all PRs enqueued behind a failed job need to be restarted without the failed PR changes. This leads to a build-up of PRs in the queue, which leads to more restarted CI jobs, resulting in a tremendously inefficient and slow CI pipeline.

Merge queues have optimizations to help deal with failures. Optimistic merging can be combined with pending failure depth to combat the slowdown caused by occasional failures in a queue. This means that a failed CI job can sit in the queue, and if any enqueued PRs behind the failure in the queue are successful, then both PRs can be merged.

For example:

  • The pending failure depth for a merge queue is set to 2

  • PR1 is enqueued

  • PR2 is enqueued

  • PR3 is enqueued

  • PR1 fails but will stay in the queue until the CI jobs for PR2 and PR3 finish

If PR2 and PR3 both fail, PR1 is ejected from the queue, and CI needs to be re-run for PR2 and PR3.

A marble diagram with two steps. The first step has 3 PRs in a merge queue, and they are all failing, the first PR is marked as ejected. The second step shows PRs 2 and 3 being re-run

If either PR2 or PR3 are successful, optimistic merging determines that PR1 can also be merged! It looks like a flaky test was detected.

A marble diagram of a merge queue with 2 PRs. The first PR is failing, but the second one is successful, so both will be merged

These optimizations still have limitations. CI time is still wasted, especially on actual test failures. The queue waits for the pending failure jobs to finish before ejecting bad PRs and rerunning CI jobs. This slows your CI process and can lead to a backup of PRs in your queue.

This is why it is best to have an automated system to detect and quarantine flaky tests. CI throughput is optimized, and test flakes are handled without wasting CI time or developer intervention.

How do flaky tests impact a merge queue?

Flaky tests are tests that have a non-deterministic outcome. Sometimes, a test will succeed; sometimes, that same test will fail.

And they can devastate a merge queue.

The “randomized” test failures from flaky tests lead to PRs being ejected from the queue, which results in the re-running of every other enqueued PR. Test suites that suffer from flaky tests can see extremely poor CI throughput with a merge queue, just like what was seen with Uber’s SubmitQueue.

For example:

  • PR1 is added to a merge queue

  • A test flake is encountered on PR1, it is rerun

  • PR2 is enqueued, CI is run with changes from PR1

  • PR1 fails again, is ejected from the merge queue

  • PR2 must be re-run without changes from PR1

  • PR2 encounters a test flake, it is rerun

A marble diagram with 4 steps. In step 1, a PR is added to a queue, but it fails and is re-run due to a flaky tests. In step 2, PR 2 is enqueued, and PR1 is ejected. Step 3, PR2 is re-run without PR1's changes. Step 4, PR2 encounters a test flake, is re-run

This is an extreme example, but it illustrates how flakes can cause massive amounts of churn in a merge queue.

Bad solutions to flaky tests in CI

Re-running failed tests is often the first thing developers do to deal with flaky tests in CI. Combined with merge queue features like optimistic merging and pending failure depth, this can be enough to continue to get PRs into the mainline branch promptly, although retries do hurt overall CI throughput. Re-running failing tests does not scale as a project and organization (and the number of tests run in CI) continue to grow. 

Ambitious engineers will try to hunt down and fix every flaky test. Not only can flaky tests be a pain to fix because they can be difficult to reproduce locally, but just finding the root cause of test flakes can be difficult. In fact, research indicates that in most cases, flaky tests are difficult to fix, and initial attempts don’t actually fix test flakiness. Time taken to hunt down and fix flakes could be better spent on feature work or fixing user-facing bugs. It is worth spending this time to fix flakes in critical tests, but for most flakes, the time put in is not worth it.

You can always remove flaky tests from your test suite entirely, but that requires additional code changes and CI runs and isn’t appropriate for critical tests.

The best way to deal with flaky tests in CI is to automatically detect and quarantine these flakes.

The solution: Automatically detect and quarantine flaky tests to unblock your merge queue

To get the most out of a merge queue, it is best to automatically detect and quarantine flaky tests.

Don’t take it from me! Li Haoyi built the first flaky test management systems at Dropbox and Databricks, and wrote a great article on flaky tests, including recommendations for managing flakes. The drawbacks of test retries and advantages of quarantining are covered in great detail!

Quarantined tests are still run as part of a test suite in CI, but a failure will not prevent CI from passing. Instead, the test output can be logged, and the PR will be merged into the mainline branch. This means that flaky test failures will not fail a CI job in the merge queue, and CI throughput can be maximized without requiring any developer time.

Because quarantine tests will continue to be run as part of CI, you can also collect useful information about your test flakes that help prioritize fixes. You will understand what test flakes affect the most PRs in your CI pipeline, can continue to monitor whether or not a flake was actually fixed (once again, many flakes aren’t fixed on the first attempt), and genuine test failures can also be detected.

For example, if flaky tests from the example in the previous section are automatically detected and quarantined:

  • PR1 is added to a merge queue

  • A test flake is encountered on PR1

  • PR1 is marked as flaky and quarantined automatically

  • PR2 is enqueued

  • PR1 succeeds, it is merged into main

  • PR2 succeeds, it is merged into main

A marble diagram of a merge queue with 4 steps. Step 1 shows PR1 failing with a flaky test, but the test is auto quarantined. Step 2 show PR2 being enqueued, and PR1 merged. Step 3 shows PR2 continuing to run, step 4 shows PR2 successful and merged.

No test reruns are necessary, and no PRs need to be ejected from the queue. Instead, flaky tests are handled automatically, and the PRs are merged into main.

With automated detection and quarantining, engineering time spent tracking down or manually dealing with test flakes is saved, and no code changes are required to deal with test flakes. Large projects with lots of contributors should see an increase in CI throughput, and the merge queue will stabilize the repo’s mainline branch.

This is why Trunk Merge Queue and Flaky Tests features work great together. Large monorepos will get all the benefits and optimizations that come with a merge queue and will not be slowed down by flaky tests in CI.

Trunk Flaky Tests automatically detects and quarantines flaky tests. Test output, including stack traces, will be tracked across runs, and AI is used to detect and group similar failures to help developers debug and fix critical flakes.

Trunk Merge Queue and Flaky Tests will also help keep your engineering team up to speed with any failures or status changes. Both Trunk Flaky Tests and Merge Queue results will be automatically posted as comments to your GitHub PRs, and webhooks can be used to power Slack, Microsoft Teams, or any other messaging service. The built-in Jira integration can create and assign tickets automatically, or a custom integration can be powered by webhooks to accomplish the same thing in other ticketing systems like Linear.

With Trunk Merge Queue and Flaky Tests, organizations or projects with large monorepos can protect the mainline branch while also optimizing CI throughput.


Want to try Trunk Merge Queue and Flaky Tests? Get in contact with the Trunk team today!

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users