Back to all posts

What We Learned From Analyzing 20.2 Million CI Jobs In Trunk Flaky Tests - Part 1

By Vincent GeNovember 12, 2024
Testing

If you encounter problems with CI stability, it’s probably due to flaky tests. This has proven to be a recurring theme in our conversations with our CI Analytics and Merge Queue users. So, continuing with our goal to eliminate problems that engineers hate, we embarked on a journey to help developers eliminate flaky tests. 

Flaky tests are ubiquitous in teams with automated integration and end-to-end tests, where developers suffer in silence. They’re the biggest challenge to CI stability and a developer pain we’re looking to solve. Even the best teams like Google, Meta, Spotify, and Uber waste countless developer hours due to flaky tests, and wasted engineering hours are expensive.

It’s easy to dismiss flaky tests as a “skill issue” and blame bad test code. But the deeper we dug into test flakiness, the more it looked like a dark, confusing rabbit hole. With our closed-beta partners, we analyzed test results uploaded from 20.2 million CI jobs. We learned that addressing flaky tests in practice requires nuance, you can't just fix all of them because it's impractical. You also can't ignore them because they poison your trust in your test results and slow you down.

While building Flaky Tests, we’ve learned and suffered, and we can’t wait to share our discoveries with you. Both technical and non-technical. If any of what you read resonates with you or you’re also down the rabbit hole with us looking for a better way to tackle flaky tests, we’d love your feedback in the Trunk Flaky Tests Public Beta.

What’s in Part 1?

This blog is divided into two parts. Part one (what you’re reading) focuses on the problem space. It discusses what we learned about flaky tests, how they impact teams, why they’re prevalent, and why they’re so hard to fix. Part two covers the challenges of building a tool for flaky tests. 

You can read part 2 here (Coming next week).

You Will Always Have Some Flaky Tests

From our research, we observed a phenomenon in which organizations that start writing more tests and more realistic tests see a sharp increase in the flakiness of their tests. 

Roy Williams from Meta described the problem as that with a growing number of tests, they hit a “knee” or “hockey stick growth” with flaky tests detected in their code base. Jeff Listfield at Google described in one blog that “over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky.” When you consider that Google has over 4.2 million tests, this is a huge number of flaky tests.

A significant contributor to this phenomenon is that your tests become more complex when you write more realistic tests, like integration tests and end-to-end tests. In an end-to-end test, there can be hundreds of moving parts. Your tests can be flaky for various reasons, like running out of RAM, a service container not spinning up in time, or misconfigured DNS. 

Testing is about confidence, and you won't be confident if you don’t have any tests that span your whole system. Dylan Frankland, a lead engineer here at Trunk, puts it like this:

“You can try to make your tests completely un-flakable. But you will have very low confidence in your code. You can mock out every piece and never have a flake, but at the end of the day, then, you're testing mocks, not real code.” 

With complexity, writing tests that never flake is practically inevitable.

Suffering in Silence

Software engineers often suffer silently. Tenaciousness, independence, and resourcefulness are often considered standout qualities of great software engineers. However, when tackling flaky tests, these qualities work against them by masking the problem. 

When an engineer encounters a flaky test, they might debug it for 30 minutes, decide that this can’t be related to their change, and rerun the test. This sometimes fixes the problem. Since the problem is not critical, they sweep it under the rug, and no one else learns about it. This lets flaky tests accumulate until rerunning no longer clears failures.

If you’re like Uber, who in their Go Monorepo reports 1000+ flaky tests, you’ll rarely see PRs without flaky failures. Even if each test only has a 0.1% flake rate, the chances of having at least 1 flaky failure are 1 - 0.999^1000 or ~63%. That means 63% of PRs will need at least 1 rerun to clear the flaky failure and distract an engineer to debug the failures. If you let the innocent-looking flaky tests accumulate silently, when you finally notice it, it'll already be a monster that paralyzes your PR velocity.

We made a calculator to help you better understand how flaky tests could impact your team, and we recommend that you experiment with the numbers yourself.

We had this problem at Trunk, too. Federico, one of our TLMs, noted after adopting dogfooding flaky tests: "We knew we had flaky tests because we were experiencing failures, but the tool revealed the full extent of the issue." It wasn’t until he saw the numbers in the Flaky Tests dashboard that he realized just how many PRs are being blocked by flaky tests.

Our beta partners also shared this sentiment. They frequently underestimated the scale of their flaky test problems and were surprised by the data they saw.

What’s The Impact?

When we interviewed our closed beta partners on the value of eliminating flaky tests, the top response was the number of engineering hours saved. What we intuitively thought would matter the most was that the CI cost saved from not rerunning flaky tests didn’t even come close. We think flaky tests impact engineering velocity in three main ways: time lost from blocked PRs, time lost context switching, and time lost managing flaky tests.

Time lost from flaky tests blocking PRs is easy to understand. If your CI jobs aren’t passing, you can’t merge a PR. Most repositories have branch protections that prevent PRs from being merged unless all checks/tests are passing. If your tests are very flaky, trying to merge perfectly good code still feels like a dice roll. Combined with long-running tests, blocked PRs become very painful.

In the study Cost of Interrupted Work, research pointed to ~23 minutes of productivity lost per context switch. I wouldn't take this at face value, but if you observe your coworkers in an office, you might find this number close.

You might see a series of events like this:

110:00:00 - INFO - Bob submits a pull request (PR).
210:15:00 - ERROR - CI job fails on Bob's PR.
310:20:00 - INFO - Bob reruns the tests after identifying the failure is unrelated to his changes.
410:25:00 - INFO - Bob takes a break for coffee and begins a new task.
510:45:00 - INFO - Bob remembers to check the test results for his PR.
610:46:00 - SUCCESS - Bob merges his PR.
711:15:00 - ALERT - Bob is paged due to a flaky test failure in the main branch.

This list is an exaggeration, but if your tests are flaky and your team is large, those little distractions add up fast.

Looping back, we also mentioned a third way flaky tests impact engineering velocity, which is time lost managing flaky tests. This is all the time lost from replying to email threads, Slack messages, GitHub, and comments about which tests are flaky. It’s time lost when creating tickets, maintaining a list of known flaky tests, and updating the list when the test is fixed. Team leads who usually have the most insight into a team’s test health and are responsible for ensuring code standards usually feel this the most. They’re the ones that get pinged, and they manage the tickets.

The Importance of Communication

Making it easier to triage flaky tests and communicate them with your team is essential. A lot of the time wasted on Flaky Tests is in the interruptions and duplicated work debugging flaky tests. Each time a flaky failure pops up on your CI jobs, an engineer context switches to debug the test. If the fix is non-trivial and a rerun clears the failure, they’ll give up and move on.

The problem is that they don’t tell anyone else about this, and the same debugging process is repeated for every engineer who comes across this flaky test. As mentioned before, each context switch also takes time to recover from. If your tests take 20 minutes to run, you’ll be distracted by a notification 20 minutes into the next task you picked up after submitting your PR. That’s right when you enter deep focus.

Offloading these spontaneous disruptions and turning them into planned work reduces context switching. Communicating who owns which flaky test and if the ticket is being worked on can prevent wasted time from duplicated efforts to debug the same flaky test.

Fixing Every Flaky Test Isn’t The Solution

Flaky tests can be caused by flaky test code, flaky infrastructure, or flaky production code. This last part is why many teams intuitively aim to fix all flaky tests.

Although we’ve seen some of our beta customers adopt on-call rotations to swat flaky tests as they appear, most of our other beta customers decided against fixing all their flaky tests. Most teams opt against fixing all their flaky tests because it’s impractical.

Dropbox tried to offload the burden to an engineering rotation to fix broken tests, and they noted, “The operational load of that rotation became too high, and we distributed the responsibility to multiple teams, which all felt the burden of managing build health.” At some point, the cure became worse than the disease.

Addressing flaky tests is intended to speed up development and reduce pain. This means fewer blocked PRs, fewer developer hours wasted debugging flaky failures, and fewer wasted CI resources. If this is the goal, then not every flaky test is worth fixing. You can’t spend more engineering hours fixing flaky tests than the time lost when ignoring them; the investment has to be economical.

Some flaky failures appear once every few weeks and disappear again, while others might block every other CI job. Fixing the flaky tests that block most developers first and reducing the felt impact of other flaky tests yields the most bang for the buck. Trunk helps you make more cost-effective decisions by letting you sort tests by PRs impacted. 

For the difficult-to-fix and lower-impact flaky tests, you still want to reduce their impact even if fixing them is impractical. The sweet spot here seems to be quarantining flaky tests so they don’t block PRs. Quarantining lets you isolate test results from flaky tests at run time. These quarantined tests still run and report results but don’t block PRs. This also ensures that no tests are disabled because disabled tests are rarely revisited and disabled. They just rot, and you lose test coverage. When you fix a flaky test, this continued tracking also clarifies if that fix works.

Another important observation is that we sometimes see flaky tests because of the underlying infrastructure, which is often outside the teams’ control. For tests like that, there’s very little you can do to fix the test. These could be special tests that span many services, require specific hardware, or rely on external APIs. A “spillway” if you will, where failures due to reasons outside a team’s practical control get handled are useful in their own right.

What's more, is that studies have shown that "fixes" submitted by developers to fix flaky tests often don't work anyway, which is more of a reason to have a proper monitoring solution for flaky tests. You need to be able to see that a fix actually worked on your flaky test before marking it healthy again, otherwise, it just ruins your developers' trust in your tests.

Cleaning Under The Bed

One of the main concerns about quarantining tests is that they will be forgotten after the friction is gone. Quarantining fixes the issue of the “boy who cried wolf,” where teams ignored their CI results because the tests were so flaky they lost trust in tests. Quarantine can potentially do the opposite, where flaky tests cause too little friction, and people still ignore them.

To prevent this, we must ensure good reporting and insight to prevent erosion in test coverage over time. We’re currently trying a few ideas. One sees time series data on the number of quarantined test runs over time. You want to see this number go down, not up. 

Another is periodic reports. These should surface overall trends in test health, like if your tests are becoming flakier, point to the new flaky tests, and show if the top flaky tests have changed. 

What’s In The Next Part

To recap, flaky tests are prevalent if you write end-to-end tests, but teams underestimate how many of their tests flake and how much time is wasted dealing with Flaky Tests. Flaky Tests are annoying enough to have a substantial impact on productivity, but the fix for many teams is more expensive than to keep rerunning tests. The challenge, then, is to create a workflow that reduces the impact of flaky tests without making the cure worse than the disease.


Read part two for more about the challenges we faced building a solution to detect, quarantine, and eliminate flaky tests.

Try Trunk Flaky Tests

We don’t think flaky tests are a realistic or practical problem to tackle in-house. For this exact reason, we’re opening Trunk Flaky Tests for public beta so you can help us build a more effective solution for all. The sooner we stop suffering in isolation and begin pooling our feedback into a single solution, the sooner we’ll end flaky tests.


Join the Trunk Flaky Tests Public Beta and read part two of this blog here (Coming next week).

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users