What Are Flaky Tests and Why Do They Matter?
Flaky tests are tests that show inconsistent results even when the code has not changed. Imagine running a test on your code and sometimes it passes, but other times it fails. This inconsistency makes flaky tests very frustrating.
Flaky tests can have a big impact on development. They cause delays in deployment because developers spend extra time figuring out why a test failed. This can be both time-consuming and annoying. For example, you might push code that works perfectly on your local machine, but it fails when tested on another environment. This makes it hard to know if your code is really ready for deployment.
Examples of flaky tests include:
Tests that pass or fail based on the time they run.
Tests that fail because of differences in testing environments.
Tests that use random data, causing unpredictable results.
Addressing flaky tests is crucial for several reasons. It helps maintain the efficiency of Continuous Integration/Continuous Deployment (CI/CD) pipelines. When tests are reliable, you can trust the results and move your code to production faster. It also boosts developer confidence. Knowing that the tests are accurate means you can focus on writing good code instead of debugging test issues.
Why Do Flaky Tests Occur?
Flaky tests occur for several reasons, making them a common issue in software development. Understanding why they happen can help you prevent them.
Environmental Inconsistencies
Different testing environments can yield different results. For example, your code might pass tests on your local machine but fail on a CI server. This happens because the environments are not identical. Differences in operating systems, hardware, or software versions can cause these inconsistencies. Ensuring that all testing environments are as similar as possible helps reduce this issue.
Test Data Issues
Not refreshing test data between runs can lead to failures. Imagine a test that relies on a specific dataset. If this dataset isn't reset or refreshed before each test run, the test results become unreliable. This can cause tests to fail randomly, even if the underlying code hasn't changed. Ensuring that test data is consistent and fresh for every test run is essential.
Timing and Time Zone Problems
Time-dependent tests can fail unpredictably due to timing and time zone issues. For instance, a test might check if an event happens within a certain time frame. If the test runs in a different time zone, it might fail because the expected time doesn't match the actual time. Similarly, tests running at different times of the day might produce different results. Handling time zones and ensuring that tests are not overly dependent on specific times can help solve this problem.
Dependencies and Order of Execution
Tests influenced by the order they are run can also be flaky. Some tests might depend on the state set by previous tests. If the order of execution changes, these tests might fail. For example, a test that relies on a database entry created by a previous test might fail if that entry isn't there. Ensuring that each test is independent and doesn't rely on the order of execution helps make them more reliable.
By understanding these common causes, you can take steps to reduce flaky tests in your projects. Keeping environments consistent, refreshing test data, managing time-related issues, and ensuring test independence are key strategies.
How GitHub's System Manages Flaky Tests
Github's system for managing flaky tests showcases significant improvements. Initially, the flaky test rate was high: 1 in 11 commits affected. This meant developers often faced unexpected test failures, causing delays and frustration.
Initial Flaky Test Rate
Imagine pushing code and finding that 1 out of every 11 commits had a test fail for no clear reason. This happened frequently at GitHub. With such a high rate of flaky tests, developers wasted time diagnosing issues that weren't even related to their code changes. This slowed down the development process considerably.
Current Improvements
To address this, GitHub introduced a system designed to handle flaky tests. This system brought about a notable improvement. Now, less than 0.5% of commits are affected by flaky tests. This reduction signifies an 18x improvement, making tests more reliable and the development process smoother.
The System's Process
How does this system work? It's quite sophisticated. The system detects flaky tests automatically. When a test fails, the system checks if the failure is consistent or if it's a random occurrence. If the test results vary without any code changes, the system flags it as flaky.
Detection: The system compares the test results across multiple builds.
Retrying: It retries tests in different environments to see if they pass under varying conditions.
Flagging: If inconsistencies persist, the system marks the test as flaky.
Developer Impact
One of the key benefits of this system is its impact on developers. Instead of notifying the entire team, the system only alerts the author of the flaky test. This targeted notification ensures that the person most likely to understand and fix the issue gets the information. It prevents unnecessary interruptions for other team members, allowing them to focus on their tasks.
By managing flaky tests efficiently, GitHub's system improves the overall workflow. Developers can trust that most test failures are genuine issues rather than random flukes. This trust in the system helps maintain a smooth and productive development pipeline.
How GitHub Detects Flaky Tests
Flaky test detection is crucial for maintaining a reliable CI/CD pipeline. A flaky test fails inconsistently, meaning it sometimes passes and sometimes fails without any changes to the code. This unpredictability can waste developers' time and slow down the release process.
Two-Pronged Approach
GitHub employs a two-pronged strategy to detect flaky tests:
Comparing Test Results Across Builds: The system monitors test outcomes over multiple builds. When a test fails in one build but passes in another with the same codebase, it raises a red flag. This comparison helps identify tests that don't behave consistently.
Retrying Tests: If a test fails, the system automatically retries it. By running the test again, GitHub can determine if the failure was a fluke. If the test passes on a retry, it's likely a flaky test.
Enhanced Retry Strategy
To further improve detection, GitHub adopts an enhanced retry strategy. This involves running tests in different environments and under various conditions:
Same Process Retry: The system first retries the test under the same conditions—same virtual machine, database, and host. If the test passes, it might indicate a random issue or a race condition.
Future Shift Retry: Next, the system retries the test with a time shift. This helps catch time-dependent flaky tests. If the test passes after the time shift, the failure could be due to incorrect time assumptions.
Different Host Retry: Lastly, the system runs the test on a completely different host. This checks if the failure was environment-specific. If the test passes here but fails in the initial retries, it suggests a dependency on the test order or shared state.
Automation Benefits
Automation plays a crucial role in GitHub's flaky test detection. By automating these strategies, GitHub identifies 90% of flaky failures automatically. This high detection rate means developers spend less time diagnosing test failures and more time writing code.
Efficiency: Automated detection speeds up the identification process. The system works continuously, catching flaky tests as soon as they appear.
Accuracy: With multiple retries and environment checks, the system accurately distinguishes between genuine and flaky test failures.
Developer Relief: Automation reduces the burden on developers. Instead of manually checking tests, they can rely on the system to flag inconsistencies.
By using these methods, GitHub ensures that flaky tests are identified quickly and accurately, keeping the development process smooth and efficient.
How to Fix Flaky Tests
Fixing flaky tests is essential for maintaining a stable and efficient CI/CD pipeline. Here are effective strategies to tackle flaky tests:
Isolate the Test
First, make sure the test is isolated. External dependencies can cause tests to fail unpredictably.
Remove External Dependencies: Ensure the test does not rely on external services or databases. Use mock objects or stubs to simulate these dependencies. This isolation helps control the test environment and makes outcomes more predictable.
Eliminate Randomness
Tests should be deterministic, meaning they produce the same result every time they run. Random elements can lead to inconsistent test outcomes.
Remove Random Elements: If a test includes random data or relies on random functions, replace these with fixed values. This approach ensures the test behaves the same way each time it runs.
Control External Factors: Factors like network latency or system load can introduce randomness. Use tools to simulate these conditions consistently during each test run.
Stabilize Environments
Consistent testing environments are crucial. Differences in environments can lead to different test outcomes.
Standardize Test Environments: Use containerization tools like Docker to create consistent environments for tests. This standardization ensures that tests run in the same conditions every time.
Environment Configuration: Make sure the configuration of the testing environment is the same across all runs. This includes system settings, software versions, and hardware characteristics.
Address Time-Based Issues
Time-dependent tests can fail due to changes in the environment's time settings or time-related logic errors.
Mock Time Functions: Use libraries that allow you to mock time functions. This approach helps simulate a consistent time environment.
Avoid Real-Time Dependencies: Design tests to avoid dependencies on the current time. Instead, use fixed time values or mock objects to simulate time-based operations.
Review Time-Based Logic: Ensure that the code under test correctly handles edge cases like leap years, daylight saving time changes, and different time zones. Tests should account for these scenarios to avoid unexpected failures.
By applying these strategies, you can effectively reduce the occurrence of flaky tests, leading to a more reliable and efficient development process.
How GitHub Measures the Impact of Flaky Tests
GitHub employs a robust system to measure the impact of flaky tests. This system helps identify which flaky tests need immediate attention by tracking their frequency and spread.
Impact Scoring
Impact scoring is a crucial method used by GitHub.
Tracking Failures: The system tracks how often a specific test fails and on how many branches. This data provides insight into the test's reliability.
Spread of Failures: It also monitors the number of developers and projects affected by the flaky test. Tests that fail across multiple branches and affect many developers receive higher impact scores.
Automated Issue Creation
Once the system identifies a high-impact flaky test, it automatically opens an issue.
Issue Details: The issue includes detailed information about the flaky test, such as the frequency of failures and affected branches. This information helps developers understand the scope and severity of the problem.
Immediate Notification: Developers receive notifications as soon as the system creates an issue. This prompt alert ensures that flaky tests do not go unnoticed.
Developer Assignment
To address flaky tests efficiently, GitHub links issues to the relevant developers and commits.
Commit Analysis: The system analyzes the commit history to identify the most likely source of the flaky test. It then assigns the issue to the developer responsible for the problematic code.
Developer Responsibility: By linking issues directly to developers, GitHub ensures that the individuals most familiar with the code can investigate and resolve the issue quickly.
Prioritizing Fixes
Not all flaky tests are equally disruptive. GitHub prioritizes fixes based on the impact scores.
High-Impact Focus: Tests with the highest impact scores are addressed first. This prioritization ensures that the most disruptive flaky tests are resolved promptly, minimizing their impact on the development process.
Ongoing Monitoring: Even after resolving a flaky test, the system continues to monitor it. If the test starts failing again, it can be re-prioritized and addressed quickly.
By employing these strategies, GitHub effectively manages flaky tests, ensuring a smooth and efficient CI/CD pipeline. This system not only identifies and tracks flaky tests but also prioritizes and assigns them to the right developers for quick resolution.
Taking Control of Testing
Taking control of flaky tests starts with reliable detection and prevention. Trunk is building a tool to conquer flaky tests once and for all. You’ll get all of the features of the big guy's internal systems without the headache of managing it. With Trunk Flaky Tests, you’ll be able to:
Autodetect the flaky tests in your build system
See them in a dashboard across all your repos
Quarantine tests with one click or automatically
Get detailed stats to target the root cause of the problem
Get reports weekly, nightly, or instantly sent right to email and Slack
Intelligently file tickets to the right engineer
If you’re interested in getting beta access, sign up here.