General

Don't build agents, build context enrichment

We’ve all heard our fair share of software rewrite stories. We built an AI Coding Agent to fix flaky tests, and threw it all out to start over. Here’s why.

February 2, 20266 min read
Tyler Jang
Tyler Jang
Engineer

Our First Approach

At Trunk, we’ve accumulated years of CI data, Git histories, and test results. Ever since we started working on detecting flaky tests, our customers have been asking us to help fix them, too.

We built a monolithic agent that would review your flaky tests and open PRs to fix them. We used the best models we could find and gave it tools to access all the types of data we collected on tests. Sometimes, it worked great. Most of the time, the agent would get tunnel vision from distracting information and disregard the real root cause of the flakiness. Once, the agent saw a warning log about a missing environment variable in CI and decided to treat that as the source of all other problems.

We spent weeks on prompt engineering, tool engineering, and eventually breaking our monolith into a series of subagents. Some of these subagents were procedural, others were agentic. Below is the final multiagent architecture we implemented. Each subagent had its own goal and RAG tools for analyzing the codebase. Each was specially tuned to question and verify the output of the previous subagents to avoid hallucinations. For every assertion from the root cause analyzer, the fix proposer was empowered to validate it with tool calls. We saw better results on our own repos and switched to dedicating most of our time to reviewing the agent’s traces on our beta partners.

The Realization

We showed some results to our beta partners, who had a strong appetite to try it out. Most critical feedback looked like this:

  • “This analysis seems plausible, but these proposed fixes don’t follow our repo’s conventions.”

  • “This root cause completely misses the fact that these are smoke tests against our deployed staging environment.”

  • “Can you tweak the output to be less opinionated? I want to feed it into Cursor and prompt it myself.”

  • “When I rerun the agent, I get completely different results. How am I supposed to trust it?”

Our agent lacked memory and repo context, so we started working toward adding these capabilities. However, something felt off. As flashy and powerful as these LLMs were, we either needed significantly more human-in-the-loop or massively more context about the repo’s test setup.

With that realization, we decided to throw it all out.

The Experiment

We knew that we weren’t going to be able to compete with other coding agents, so we trimmed down our multiagent to just our preagents. We hypothesized that if we removed the code generation pieces of our pipeline, we could step back and let existing coding agents work their magic. Customers had been resonating with the insights that our agent relied on, such as git culpability and patterns of test failures, so we focused on surfacing those.

To validate our hypothesis, we ran a controlled test across Copilot, Claude Code, and Cursor. Same Playwright test failures, two conditions:

  • Minimal context: Test name, failure summary, and sample error logs.

  • Enriched context: Everything above plus our proprietary signals - run conclusion statistics, flakiness trends over time, git blame correlation, and the specific commit where the failure pattern first appeared.

The difference was stark. We analyzed a test that had started failing 10% of the time due to a change in seed data that manifested as nondeterministic ordering. With just the test failures, all the agents we tested assumed it was a rendering error. When we gave them our insights and git analysis, they correctly identified that either additional ordering needed to be applied, or the seed data needed to change. Even our own multiagent, complete with the same enriched input, oscillated between blaming the environment variable and proclaiming there was a rendering error.

What we built instead

This experiment validated what we started to understand from our own development and from our customer feedback. We couldn’t build a great coding agent that works reliably across architectures and repos. Plus, prompt engineering taught us that new models can regress or improve context gathering and reasoning, requiring constant iterations on our end. The result wasn’t as flashy, but it was an objectively better user experience, and it let us focus on our strengths.

Tools like Claude Code already have the missing context that we needed. Our moat wasn’t a coding agent that fixes flaky tests; it was the data and insights to do so. And so we continued our evolution of making our agent smaller. And better.

We built sub-agents that extract insights from CI and git data, then expose those insights via MCP to any compatible coding tool. When you ask Claude Code to fix a flaky test, it can pull our analysis:

1Test: renders the flaky test information
2
3Key Diagnostic Metrics:
4
5- Overall pass rate: 90%
6- Failure mode consistency: 92% (same error)
7- Pass duration: 1.94s avg, 1.69s min, 2.92s max | Fail duration: 16.68s avg, 15.43s min, 16.82s max
8
9Failure Pattern: Element never appears when test fails (times out waiting 5s, retries 3x = 15s total). When test passes, element appears immediately (~2s)
10
11- Important signal: Test either passes quickly (well below the timeout threshold), or fails after exhausting all retries looking for the element. All failures exhaust all retries. This suggests flakiness caused by nondeterministic test data setup, NOT an issue with waiting for data to load
12
13Historical Context:
14
15- Flaky since introduction (commit `fe2ae4d4`)
16- Failures evenly distributed across branches
17
18

That's context that an agent can use. No hallucinated imports. No guessing which file to edit.

Here's a quick demo of the handoff in action:

[Embed https://www.loom.com/share/69a2dd3f98a7487da1d98180ecb5ef60]

We're not alone in this

Sentry recently shipped something similar: their Seer AI can now trigger Cursor agents to fix bugs, passing along error context and stack traces. Same thesis - don't rebuild the coding agent, augment it with the context you're uniquely positioned to provide.

The handoff patterns are emerging. Copilot lets you assign issues via API. Cursor has a cloud agent launch endpoint. The flow for a specialized insight provider augmenting a general-purpose agent is materializing.

Making the world safer for agentic engineers

Here's why this matters beyond flaky tests. We're entering an era where engineers increasingly delegate to AI agents. Code gets written, reviewed, and merged with less direct human oversight. That's leverage, not a problem. But it creates a new failure mode: agents confidently shipping broken code because they lacked the context to know better.

Flaky tests are a useful litmus test for agentic solutions. They're well-scoped: you can know when you've actually fixed one. And unlike a lot of agent tasks, you get a clear signal when you haven't.

We're building the context layer that makes agentic engineering safer. Not by competing with coding agents, but by giving them the historical, temporal, and statistical insights they need to actually get things right.

The smaller agent, it turns out, is the more important one.

Try it yourself or
request a demo

Get started for free