How Spotify identifies and deal with Flaky Tests
Spotify has become a leader in music streaming, but it also faces unique challenges in software testing. One big issue they tackle is flaky tests. In this article, you'll learn what flaky tests are, why they're problematic, and how Spotify deals with them. You'll also find out about the tools and methods Spotify uses to keep their testing reliable.
What Are Flaky Tests?
Flaky tests are tests that pass and fail randomly, even when the code hasn't changed. This means you can run the same test in the exact same environment and get different results each time. Imagine running a test on your app today and it passes, but tomorrow it fails for no apparent reason. This inconsistency makes it hard to trust the tests.
The impact of flaky tests is significant. When tests are unreliable, developers lose confidence in the testing process. This can lead to more time spent re-running tests and trying to figure out if a failure is due to a real bug or just a flaky test. Over time, this erodes trust in the entire testing system.
Common perceptions and misunderstandings about flaky tests often arise. Some people think flaky tests are just a minor nuisance, but they can actually cause major delays and waste resources. Others might believe that flaky tests are unavoidable, but there are ways to reduce their occurrence. Understanding these misconceptions is the first step in tackling the problem effectively.
Why Flaky Tests Are Problematic
Flaky tests cause several issues that can slow down development and make the testing process less reliable. First, they waste time and resources. When a test fails due to flakiness, developers often need to retrigger builds and wait for the Continuous Integration (CI) system to complete. This can lead to long delays as developers wait for the CI to confirm that the tests pass. Imagine having to wait hours for a build to complete, only to find out that the failure was due to a flaky test and not an actual issue in the code.
Inconsistent results are another major problem. When tests produce different outcomes without any changes to the code, it becomes difficult to trust the test results. This inconsistency can make it hard to determine whether a failure is due to a real bug or just a flaky test. As a result, developers may spend unnecessary time investigating and fixing issues that don't actually exist.
Flaky tests also impact Continuous Delivery. Continuous Delivery relies on the ability to deploy changes confidently and quickly. However, if tests are flaky, it becomes risky to deploy new code. Developers might hesitate to push updates because they're unsure if the tests are giving accurate results. This hinders the overall goal of Continuous Delivery, which is to make frequent, reliable releases.
The long-term costs of flaky tests can be significant. Over time, the delays and reduced productivity caused by flaky tests accumulate. Developers spend more time re-running tests and less time writing new code or improving existing features. This can lead to slower development cycles and increased frustration among the team. In the long run, the costs of not addressing flaky tests can far outweigh the effort needed to fix them.
What Causes Flaky Tests?
Inconsistent Assertion Timing
Flaky tests often happen due to inconsistent assertion timing. The state of an application can vary between test runs, leading to different outcomes even if the code hasn't changed. For example, if a test checks for a specific value before the application has fully updated, the test might fail one time and pass the next.
To solve this, it's crucial to wait for the application to reach a consistent state before making assertions. Instead of using fixed wait times, use conditions that check for the desired state. This ensures that the test only proceeds when the application is ready, reducing the chances of flakiness.
Reliance on Test Order
Another common cause of flaky tests is reliance on test order. Some tests depend on the global state set by previous tests. If these tests run out of order or in isolation, they might fail because the required state isn't present.
Resetting the global state between tests can help. This involves cleaning up any changes made during a test run so that each test starts with a fresh state. This makes tests more reliable and ensures they can run independently of each other.
Running tests in isolation poses its own challenges. It's important to design tests in a way that they don't rely on the specific sequence in which they run. This might involve more setup and teardown steps but leads to more robust tests.
End-to-End Tests
End-to-end tests, which simulate user interactions from start to finish, are inherently flaky. These tests cover multiple components and systems, making them more likely to encounter intermittent issues. Network latency, third-party services, and timing issues can all cause end-to-end tests to fail unpredictably.
Reducing the number of end-to-end tests can minimize flakiness. Focus on writing fewer, more targeted end-to-end tests that cover critical user flows. For other scenarios, consider using unit or integration tests, which are less prone to flakiness. This approach helps maintain test coverage while reducing the chances of encountering flaky tests.
How Spotify Tracks Flaky Tests
Odeneye System
Spotify uses a tool called Odeneye to track flaky tests. This system visualizes test suites, making it easier to spot both flaky tests and infrastructure issues. Imagine a grid where each row represents an individual test and each column represents a test run. If you see scattered orange dots, these indicate flaky tests — tests that pass and fail inconsistently. Solid columns of failures, on the other hand, usually point to infrastructure problems like network failures.
Odeneye helps engineers quickly identify problem areas. By distinguishing between flaky tests and infrastructure issues, engineers can focus on fixing the right problems. This targeted approach saves time and resources, enabling Spotify to maintain a more reliable testing process.
Simple Table Tool
Another tool Spotify uses is a Simple Table Tool. This tool displays the performance of tests, categorizing them into fast, slow, and flaky. The simplicity of this table allows teams to easily see which tests need attention. If a test is slow or flaky, it stands out, prompting engineers to investigate further.
This tool has had a significant impact. Within two months of its implementation, Spotify reduced its test flakiness from 6% to 4%. By providing clear visibility into test performance, the Simple Table Tool empowers teams to take swift action, improving the overall reliability of their test suites.
Flakybot Tool
To further enhance test reliability, Spotify developed a tool called Flakybot. Engineers use Flakybot to check for test flakiness before merging their code. When a pull request is made, Flakybot runs the tests and generates a report. If any tests are identified as flaky, engineers receive immediate feedback.
This quick feedback loop builds confidence in the test process. Engineers know whether their changes are introducing new flaky tests, allowing them to address issues before merging the code. This proactive approach helps maintain the integrity of the codebase and ensures that only stable, reliable tests make it into production.
Taking Control of Testing
Taking control of flaky tests starts with reliable detection and prevention. Trunk is building a tool to conquer flaky tests once and for all. You’ll get all of the features of the big guy's internal systems without the headache of managing it. With Trunk Flaky Tests, you’ll be able to:
Autodetect the flaky tests in your build system
See them in a dashboard across all your repos
Quarantine tests with one click or automatically
Get detailed stats to target the root cause of the problem
Get reports weekly, nightly, or instantly sent right to email and Slack
Intelligently file tickets to the right engineer
If you’re interested in getting beta access, sign up here.