Trunk | CI Reliability Platform To Keep CI Green

TLDR; We ran simulations to show how advanced merge queues can be configured to speed up merge times or save CI costs, which becomes increasingly important as AI tools and agents increase dev productivity and expose new bottlenecks in CI.

Antropic's CPO (Chief Product Officer. I had to look it up too.), Mike Krieger, recently shared that 90-95% of changes to Claude Code’s codebase are now written by Claude Code. Dev productivity increased so suddenly that new bottlenecks emerged in their delivery pipeline, and Anthropic’s merge queue became a bottleneck for delivering features and fixes:

“And we really rapidly became bottlenecked on other things like our merge queue, which is the get in line to get your change accepted by the system that then deploys into production. We had to completely re-architect it because so much more code was being written and so many more pull requests were being submitted that it just completely blew out the expectations of it.”
- Mike Krieger, CPO, Anthropic

CI bottlenecks aren’t new. But a sudden increase in PR volume led to bottlenecks sooner than expected. (Assuming code review isn’t a major blocker; agents can “LGTM” too.)

And Anthropic won’t be the only company running into unexpected CI bottlenecks. Large organizations already using merge queues may need to optimize them to shorten a PR's time to merge. Smaller teams might find themselves in need of a merge queue to protect their mainline branch and save on ballooning CI costs.

Advanced merge queues can be configured to solve both of these problems (but not both problems at once!).

We’ve run merge queue simulations internally at Trunk to measure the impact of different queue configurations on CI throughput and time to merge, and I thought now would be a good time to share some results.

But first, a quick refresher on merge queues. (If you’re already familiar with merge queues and optimizations like batching and parallel queues, you can skip ahead to the simulation section.)

Refresher: What is a merge queue?

Merge queues keep your mainline branch green by testing PRs before they are merged into the mainline branch (such as main or develop). This is especially critical in large teams where many engineers are contributing to the same codebase, like in large monorepos, as they protect against logical merge conflicts

The key concept behind merge queues is predictive testing: PRs are tested with all the changes from PRs ahead of them in the queue, so the most up-to-date version of what the mainline would look like post-merge is always being tested.

Advanced merge queues have different settings and configurations to batch PRs together for testing and to test and merge unrelated PRs in parallel (among other optimizations). And because merge queues solve problems traditionally encountered in large monorepos, they need to scale to handle a large number of PRs.

This should be enough info to understand the simulation and results! If you want a more in-depth look at merge queues and their optimizations with examples (and fun videos using marbles to explain how merge queues work), you can check out this guide.

CI cost vs time to merge (mo’ money, mo’ merges)

There is a happy place on the continuum of “cheaper CI to faster merges” for every dev org using a merge queue.

Two settings that can be configured to determine where a merge queue fits between cheap and fast are batching and parallel queues.

Batching, or testing multiple PRs together in a queue, saves CI cost by running fewer CI jobs. When a batch fails, it is bisected and PRs are retested to isolate the failure, which can slow down CI and increase total costs.

Testing in parallel can be combined with batching and is great if you have available CI workers (and money to spend on additional CI resources). Parallel queues are only an option when a build system like Bazel is used in your repo.

To show how these concepts affect CI spend and time to merge, we ran simulations with different batch sizes and parallel queues.

SimMergeQueue™

At Trunk, we have a public repo that automates PR creation and submission to a merge queue. The simulation’s parameters are configured in .config/mq.toml, and a cron job is used to automate pull requests.

We ran 8 different scenarios to test the effects of batch size on the time spent running CI and the total time to merge for each PR.

Simulations were run with batch sizes of 1, 4, 8, and 16, and parallel queues were used for batch sizes greater than 1. Each batch size was tested twice, once with a logical conflict rate of 1/1000 (0.1%) and again with a rate of 1/500 (0.2%).

We also measured the mean time to reject PRs with flaky tests or logical merge conflicts, which is important because bisecting batches to isolate PR failures can be a long process as batch size grows.

We tested our merge queue using the following assumptions for every simulation,

where Flake Rate is the percentage of PRs that fail due to flaky tests.

(Sidenote: Flaky tests are often considered merge queue killers due to the cost of bisecting failed batches and re-running PRs in the queue. That doesn’t have to be the case! Quarantining flaky tests at runtime reduces this friction and is the best way to manage flakes in CI, whether you’re using a merge queue or not.)

Here are our results:

So what does this all mean? The simulation results show:

As the batch size increases, so does the average time to merge:

Comparing batch size 1 to 16 and a logical conflict percentage of 0.1, PRs are merged into mainline 54.9% slower, from 41m 56s up to 1h 5m.
However, the CI time per PR decreases by 77.5%, from an average of 41m 28s down to 9m 19s. Big batch = less CI spend.

The time to reject logical conflicts and flakes also increases because batches need to be bisected to isolate failing PRs.

The lowest peak concurrency occurs for a large batch size (16) and low logical conflict percentage (0.1), which means fewer infrastructure resources are needed for CI runners.

Repos with a lower logical conflict rate reject test flakes much faster, so they may be better suited for larger batch sizes if increasing the average time to merge is not a major issue. (Remember, flaky tests can also be automatically quarantined at runtime.)

That said, a higher logical conflict rate does not have a significant impact on time to merge or CI time per PR.

(Sidenote 2: Let’s throw another wrench in things: for these set batch sizes, PRs wait in the queue until a batch is “full” before the merge queue starts testing. Advanced merge queues may also have a configurable maximum wait time for batches. After that set amount of time has passed, the batch is tested even if it isn’t full. That configuration is not included here, but it can reduce the average time to merge for larger batch sizes.)

Wait, how does this apply to AI agents again?

With increased AI adoption, more PRs can be opened in parallel, and open PRs can be reviewed faster.

A February 2025 study reported that developers saw a 26.08% increase in productivity using Copilot compared to devs who did not use gen-AI tooling (Not to be missed: junior devs bring the average up, and the study acknowledges that the data measured is noisy).

A 2022 study shows how code review bots can focus developer discussion, leading to an increase in merged PRs in open source projects. And this study predates the release of modern AI review tools!

Effective merge queues enable dev teams to keep up with this increased PR volume. They help protect your mainline branch against failures and logical merge conflicts, while providing the configurability needed to either reduce CI costs or a PR’s time to merge.

Better agents require better merge queues

As AI agents continue to drive higher throughput in the development process, the need for an optimized merge queue becomes even more pronounced.

While organizations like Anthropic and OpenAI may have the resources to handle high CI costs, most teams will need to consider the trade-offs between speed and cost. Fortunately, merge queues can be tuned and optimized to address the growing challenges of CI bottlenecks. With the right configurations, organizations can significantly reduce delays and improve their development cycles.

"I would expect that a year from now, the way that we are conceiving of building and shipping software changes a lot. Because it's gonna be very painful to do it the current way."
- Mike Krieger, CPO, Anthropic

Trunk’s Merge Queue already supports batching and parallel queues, and plugs in with a Flaky Tests system that quarantines flakes and failures at runtime so your queue isn’t slowed down. Try Merge Queue today.