Descript

How Descript Merges ~1,450 PRs a Month Without Flaky-Test Retries

~1,450 PRs merged in 30 daysHundreds of flaky tests auto-quarantinedIn-house flake detection retired

“
"Agent-based PRs is more of a thing now. And also non-engineers opening PRs is a thing now, so CI reliability has become extremely important, and our CI has scaled to meet that demand."
Roland Zeng, Builder Experience Engineer

A Thousand Paper Cuts

Before Trunk, flaky tests were a persistent drag on engineering velocity at Descript. A typical CI run took 30 to 40 minutes. When a test broke, genuinely or intermittently, the whole org was blocked until someone opened a PR to skip it, and that PR had to wait again in the same slow queue.

"Skipping the test isn't healthy because ... who's going to fix it later?" said Roland Zeng, Builder Experience Engineer. Neither retries nor manual triage scaled. "It was just a thousand paper cuts."

The team had also built an in-house flaky test detection system, but it couldn't keep up. "We had an in-house solution so we know the cost of maintaining something like that, and clearly with the rate of CI flakiness we were encountering it wasn't working," said Yōu Wu, Senior Software Engineer for Infra. In late 2025, Descript made CI reliability a cross-team priority, with Infra leading and engineers from across the org pulled in.

AI Agents and Non-Engineers Are Opening PRs

The CI reliability Descript built with Trunk turned out to be a prerequisite for a shift the whole industry is navigating. Agents and non-engineers are now opening pull requests, and the infrastructure Descript already had in place is what keeps that workflow viable.

"Agent-based PRs is more of a thing now. And also non-engineers opening PRs is a thing now," Roland said. "So CI reliability has become extremely important, and our CI has scaled to meet that demand."

When a seasoned engineer gets a flaky failure, they rerun and move on, but when an agent or non-engineer gets one, the signal is just noise they can't debug. CI has to be trustworthy or it's irrelevant. Because Descript had already built a reliability layer with Trunk, the team didn't have to scramble when PR traffic shifted.

Today, a substantial share of PRs merged through Descript's queue originate from agent-assisted workflows. Every one runs against the same test suite and passes through the same merge queue. Without auto-quarantine catching flaky failures, that volume would generate a noise level no two-person platform team could triage manually.

What Mattered During Eval

Before committing to Trunk, Yōu ran a structured evaluation against half a dozen alternatives: third-party hosted tools, self-hosted open-source options, Buildkite's test engine, and continuing to build in-house. She assessed each across seven capabilities including hosting model, monorepo support, dashboards, team notifications, and cost.

Three requirements eliminated most of the field. First, monorepo support. Descript runs over 150,000 test cases in a single repo, and most alternatives couldn't categorize test suites by package or group them into one run per repo. Second, hosted infrastructure. Some open-source options offered strong dashboards but required self-hosting on Kubernetes, which meant more infrastructure for a lean platform team to maintain. And third, cost that scales. One alternative charged per test after a small free tier, and at Descript's volume that math doesn't close.

Trunk checked the most boxes: hosted, monorepo-native, dashboards, team notifications via Slack, and transparent pricing. What also helped was responsiveness. "It helped that y'all opened a Slack channel, so the feedback loop was always short," Yōu said.

“
"We are a pretty lean team so the philosophy is we will buy if this is not related to our core offering to external customers"
Yōu Wu, Senior Software Engineer, Descript

The Budget: "The Math Is Easy"

"Having transparent pricing is great, a lot of competitors are very opaque about it," Yōu said. At Descript's scale, the annual contract came in at a fraction of what a single engineer would cost to maintain the equivalent tooling.

Internal approval was fast.

“
"It's pretty easy for Ryan - then head of platform, now head of engineering - to make the case of budgeting given the eng hours we will be saving on green CI checks and one less in-house system to maintain,"
Yōu Wu, Senior Software Engineer, Descript

The Solution

Descript runs Trunk Flaky Tests and Merge Queue together. Flaky Tests monitor pass-fail signals on the main branch and auto-quarantines anything intermittent: the test keeps running for signal but no longer blocks a merge. Merge Queue runs at concurrency 1, deliberately optimizing for CI cost over raw speed, and processes roughly 50 PRs per day at that setting.

The team treats Trunk as a data platform, not just a tool. They built a custom observability layer using Trunk's webhooks and APIs to give every team visibility into their own quarantined tests and nudge them to fix flakes before too many pile up and erode trust in the suite.

Results

The most visible change is what stopped happening. Trunk quarantines hundreds of flaky tests out of the ~50,000 that run on every PR, so intermittent failures no longer block merges and engineers don't have to rerun jobs - and hoping for green.

Trunk's merge queue easily handles the volume. On a typical weekday, 50–80 PRs pass through and over the last 30 days ~1,450 PRs merged through the queue. Furthermore, the in-house flaky test detection system the team had been maintaining is gone.

"Engineers complain less about CI, so there's less CI maintenance that we need to do," Roland said. Builder Experience now spends its time on AI agent enablement, remote-VM infrastructure, and faster deploy cycles.

One feature that landed particularly well was Trunk's "direct merge to main" feature, where a PR already current with main skips the queue entirely.

“
"I remember that was like an aha moment for a lot of our engineers, like, oh wow, my PR immediately merged."
Roland Zeng, Builder Experience Engineer

What's Next

Descript is building more of its CI observability on top of Trunk's data: per-team test-health reports and deeper integration with Trunk's failure fingerprinting so Linear tickets carry richer debugging context from the start. Longer-term, the team plans to route quarantined-test investigations to AI coding agents, exactly the kind of workflow that becomes possible when CI infrastructure returns rich, contextual signal rather than raw pass/fail.

“
"Engineers don't have to worry - when they're ready, they can just enqueue their PRs and have everything just work."
Roland Zeng, Builder Experience Engineer