Brex

How Brex Cut P90 Time-To-Merge by 30%

30% P90 time-to-merge reduction50% fewer merge queue failures100,000+ tests monitored

“
"We would have never known how many flaky tests we had exactly, or to what level each of them was flaky. Now we say it meets this level, and that's it - it's quarantined."
Joshua Inoa, Engineering Manager, Release Infrastructure

Company Context

Brex is the intelligent finance platform that empowers growing companies to spend smarter and move faster by combining the world’s smartest corporate card with intuitive spend management software and banking. Their engineering organization of ~400 developers works out of a monorepo, pushing roughly 100 PRs per day through Bazel and Buildkite, with 10 to 20 PRs sitting in their GitHub merge queue at any given time. Each PR runs a subset of their 100,000+ tests - based on what was affected. Until mid-2025, flaky tests were a large source of friction in that pipeline.

The Visibility Gap

Flaky tests at Brex weren't a dramatic crisis. They were a slow, compounding drag that the team couldn't easily measure, prioritize, and build a business case around.

"Flakes become a huge disruptor," said Ellen Wiberg, Senior Software Engineer at Brex and a five-year veteran of the team.Not because any single flaky test was catastrophic, but because they were constant. Most of the failures hitting Brex's merge queue weren't real bugs. They were flakes, ejecting PRs and triggering full rebuild cycles for problems that had nothing to do with the code being merged.

The process for handling them was fully manual. Someone would notice a flaky failure, report it in a Slack channel, track down the code owner, and submit a PR to skip or fix the test. That skip PR then had to merge through the same backed-up queue. Meanwhile, every flaky failure in the merge queue triggered a full rebuild cycle, multiplying build minutes and Buildkite compute costs across all 100 daily PRs.

The team knew flaky tests were a problem. They just couldn't prove how much. There was no data on which tests were flaky, how often they failed, or how much time and compute they wasted. Developer surveys consistently flagged CI speed and reliability as a frustration, but without hard numbers, the Release Infrastructure team had no way to quantify the impact or justify investment.

"No one has true incentives to fix flakes," Ellen said. "Fixing one flake isn't really going to help the system either. You need to fix all flakes." But without visibility into which tests were actually flaky, "all flakes" was an undefined set.

Joshua Inoa, Engineering Manager for Release Infrastructure, saw the same pattern when he joined Brex in April 2025. "Historically, they've busted their butts to identify and surface issues, but then no one acts," he said. The team had been grinding to keep CI healthy with no data to turn frustration into action.

Why Trunk

Brex's team had the engineering talent to build flaky test detection in-house. They chose not to. The complexity of supporting multiple test types across Kotlin, Python, Go, and JavaScript in a unified quarantine lifecycle was significant, and the team needed a solution that worked across their entire stack from day one.

Josh justified buying Trunk by framing it around two things his leadership already cared about: CI reliability improvements and potential cost savings. "Retries happen many times, rebuilds at merge queue, and we get some multiplier of build minutes," he explained.

The budget was secured by tying the investment to existing CI improvement initiatives and demonstrating potential cost savings in CI time and resources.

“
"Instances running within Buildkite start to accumulate cost and obviously delay development. Looked at it purely from that perspective and it seemed like a no-brainer."
Joshua Inoa, Engineering Manager, Release Infrastructure

The Implementation

Calling Trunk's CLI to upload test results from Buildkite was "trivial," Ellen said. Trunk's native Bazel support was a "huge advantage" given Brex's build system, and the team was collecting data within days.

They initially ran Trunk in shadow mode, collecting data without quarantining, to build confidence in the detection. Once they enabled auto-quarantine for the backend test suite, the impact was immediate. Since then, they've expanded quarantining to front-end tests as well, steadily widening coverage across the entire monorepo. Ellen uses the Trunk dashboard to monitor test health across the org. "I'm using the dashboard all the time," she said.

Results

Within weeks of enabling auto-quarantining, the outcomes were clear.

Developers stopped complaining

Platform teams rarely get thank-you notes. The real signal is silence. Anecdotally, after quarantining went live CI disruption complaints stemming from flakiness went down. For a team that had been fielding complaints for years, that absence was the clearest proof that the problem was improving.

50% fewer merge queue failures

With 10 to 20 PRs in the merge queue at any given time and ~100 PRs flowing through per day, flaky failures were triggering rebuild cycles constantly. After quarantining, half of all merge queue failures disappeared. These weren't real test failures. They were noise. Removing them meant developers stopped getting kicked out of the queue for problems that had nothing to do with their code, and every avoided rebuild saved minutes of Buildkite compute across the entire queue.

30% reduction in P90 time to merge

P90 time to merge dropped from ~30 minutes to ~25 minutes. Ellen's team had set this as their primary success metric going in, and the improvement was immediate and sustained.

On-call shifted from firefighting to proactive work

Before Trunk, the on-call engineer for Release Infrastructure spent their time watching for failures, determining if they were flakes, and manually intervening. With auto-quarantining handling the most common failure mode, on-call health improved and the team could focus on higher-leverage work.

“
"Trunk is worth the cost. Improvements we've seen in CI, build type minutes, and on-call health have improved."
Joshua Inoa, Engineering Manager, Release Infrastructure

Looking Forward

With quarantining running across the full test suite, Brex is building automation on top of Trunk's data: weekly reports to code owners, automated Linear tickets for quarantined tests, and ownership pings via Trunk's APIs. As of recently, they started feeding Trunk's historical execution data into an AI agent that can propose fixes automatically within Brex's VPC. "Take this context you guys are giving us, feed it into some AI agent that has the ability to run tools internally," Josh said.

For engineering leaders evaluating whether flaky test tooling is worth the investment, Josh was direct: "Flaky tests exist in every company at this point. Having a tool that enables you for quarantining and lets you see historical execution can empower you to make business decisions and then invest. I would definitely recommend Trunk."