testing

How Uber manages Flaky Tests

By The Trunk TeamAugust 20, 2024

How uber deal with Flaky Tests

How Uber's CI Process Handles Large Codebases

Uber faces a big challenge with flaky tests. They handle over 2,500 pull requests daily, which means they need a strong Continuous Integration (CI) process. This process ensures that new code doesn't break the existing codebase. Let's explore how Uber manages this.

Overview of Uber's CI Infrastructure: Handling 2,500+ Pull Requests Daily

Uber's CI system is robust. Each day, it processes more than 2,500 pull requests. This high volume of changes means that the CI system must be efficient and effective. Every pull request triggers thousands of tests to ensure new code doesn't introduce bugs. This setup helps maintain code quality and reliability.

Challenges with Monorepos: Complexity and Scale of Managing Tests

Monolithic repositories, or monorepos, present unique challenges. Uber uses monorepos to manage their codebase. This means all code, across different projects, lives in a single repository. The complexity and scale of managing tests in such an environment are significant. When a single change can affect many parts of the codebase, tests must be thorough and wide-ranging. This increases the risk of flaky tests, which can cause false test failures and slow down development.

Need for Scalability: Building Solutions That Grow with the Repository Size

As Uber's codebase grows, their CI system must scale. The system should handle more tests and more data without slowing down. To achieve this, Uber builds solutions that can grow with their repositories. They use dynamic partitioning and cone queries to manage data efficiently. This approach ensures that the CI system remains fast and reliable, even as the codebase expands.

Keeping the Main Branch Green: Importance of Reliable CI Pipelines

Keeping the main branch "green" means ensuring it is always in a good state. This is crucial for Uber's development process. A reliable CI pipeline helps achieve this by quickly identifying and fixing flaky tests. When the main branch stays green, developers can confidently merge their changes. This reduces downtime and keeps the development process smooth. Flaky tests undermine this reliability, making it essential to address them promptly.

Uber's approach to handling flaky tests involves a mix of strong infrastructure, scalable solutions, and a focus on reliability. By addressing these areas, they maintain a smooth and efficient development process.

Why Visibility and Customization are Crucial for Uber's Testing Strategy

Legacy Challenges: Lack of Detailed Visibility in Test Categorization

Uber's early testing systems struggled with visibility. The initial tools lacked detailed insights into test categorization. This meant engineers couldn't easily see which tests were flaky or why. Without this information, diagnosing and fixing flaky tests became a cumbersome task. Flaky tests would often go unnoticed or unresolved, causing delays and frustration.

Enhanced Visibility: Detailed Historical Data and Test Stats

To improve, Uber focused on enhancing visibility. They started collecting detailed historical data and test statistics. Every test run now logs extensive metadata, including execution time, failure rates, and specific error messages. This data helps engineers quickly identify patterns and root causes of flakiness. With better visibility, they can make informed decisions about which tests need attention and which ones can be trusted.

Extensible Customization: Supporting Various Strategies for Different Languages and Repos

Customization plays a key role in Uber's testing strategy. Not all projects are the same, and different languages or repositories have unique needs. Uber's system supports various strategies tailored to these differences. They built a flexible framework that allows teams to implement custom analyzers and strategies. This means a test in a Java project might be treated differently from one in a Python project. By supporting diverse strategies, Uber ensures that each project gets the specific attention it needs to manage flaky tests effectively.

Ownership and Accountability: Assigning Responsibility for Flaky Tests

Assigning ownership is another critical aspect of Uber's approach. They ensure that every flaky test has a designated owner. This means specific teams or individuals are responsible for addressing these tests. By creating clear accountability, Uber makes sure that flaky tests don't fall through the cracks. When a test starts failing, the responsible team gets notified and can take prompt action. This approach not only improves the speed of resolution but also enhances the overall quality of the codebase.

Uber's focus on visibility and customization in their testing strategy addresses the complexities of managing a large and diverse codebase. By enhancing data insights, supporting tailored strategies, and ensuring clear ownership, they tackle the challenge of flaky tests head-on. This methodical approach helps maintain a robust and reliable CI pipeline, essential for their fast-paced development environment.

Scalable Data Management: Cone Queries and Dynamic Partitioning

Cone Queries: Efficiently Querying Tests by Prefix

Uber deals with a massive number of tests daily. Efficiently querying these tests is crucial for performance. Cone queries allow Uber to search for tests using a prefix, such as "golang.unit/src/uber.com/infrastructure/*". This method speeds up the process by narrowing down the search to specific paths in the repository. Instead of sifting through millions of records, the system quickly locates relevant tests under a given prefix. This approach saves time and computational resources, making the CI pipeline more efficient.

Dynamic Partitioning: Flexible Bucketing Algorithm for Scalable Data Storage

Dynamic partitioning is another innovative method Uber uses. The system employs a flexible bucketing algorithm to manage data storage. Here's how it works:

  • Generate Bucket ID: Each test is assigned a unique bucket ID.

  • Identify Prefixes: The system identifies the first few prefixes of the test's path.

  • Store in Prefix Table: The bucket ID is stored along with these prefixes in a table.

For example, a test with the path "golang.unit/a/b/c/d:test" might be assigned bucket ID 10. The prefixes "a/b/c", "a/b", and "a" are stored with this bucket ID in a prefix table. This method allows for efficient data retrieval, as the system can quickly locate the relevant bucket IDs for a given prefix.

Read and Write Operations: Optimizing for Performance and Parallel Processing

Optimizing read and write operations is essential for handling large volumes of data. Uber's system uses parallel processing to manage these operations. By dividing tasks into smaller, concurrent operations, the system enhances performance. This approach ensures that the CI pipeline remains responsive even under heavy load.

  • Parallel Writes: Data is written to multiple partitions simultaneously, reducing the time needed for storage.

  • Parallel Reads: Queries are executed across multiple partitions in parallel, speeding up data retrieval.

This parallel processing framework ensures that Uber's CI pipeline can handle the massive scale of operations without bottlenecks.

MySQL Backend: Using Traditional Relational Databases for High Performance

Uber leverages traditional relational databases like MySQL for their backend. Despite the high volume of data, MySQL provides the required performance and reliability. The choice of MySQL allows Uber to benefit from:

  • Scalability: MySQL's support for sharding and partitioning helps manage large datasets.

  • Reliability: Known for its robustness, MySQL ensures data integrity and availability.

  • Performance: With optimized queries and indexing, MySQL delivers fast data access.

By combining these traditional databases with innovative querying and partitioning techniques, Uber achieves a scalable and efficient data management system. This setup supports their extensive CI pipeline, ensuring reliable and timely test results.

Configurable Analyzers: Tailoring Test Analysis to Specific Needs

Analyzer Interface: Common Interface for Custom and Default Analyzers

Uber's testing strategy hinges on a versatile analyzer interface. This interface allows for both custom and default analyzers to operate seamlessly. By providing a common interface, Uber ensures that all analyzers can interact with the CI pipeline in a standardized manner. This not only simplifies integration but also makes it easier to maintain and update the system. Custom and default analyzers can both plug into this interface, allowing for flexible and scalable test analysis.

Linear Analyzer: Default Strategy for Identifying Flaky Tests

The linear analyzer serves as Uber’s default strategy for detecting flaky tests. This analyzer examines test results over a specified window of runs. If a test fails even once within this window, it is flagged as flaky. This straightforward approach helps identify unreliable tests quickly and efficiently. The linear analyzer works well for most scenarios, providing a reliable baseline for test analysis. However, Uber recognizes that one size does not fit all, which is why they support custom analyzers for more complex situations.

Custom Analyzers: Tailored Solutions for Different Test Types and Error Patterns

Custom analyzers offer tailored solutions for various test types and error patterns. Different tests have unique characteristics and may require specialized analysis methods. For instance:

  • Integration Tests: These tests might be more prone to timeouts and failures due to their complexity. A percentage-based analyzer could be more effective here, flagging tests if their failure rate exceeds a certain threshold.

  • Resource-Intensive Tests: Tests that consume significant computational resources might need an analyzer that accounts for resource variability. This could involve analyzing the test’s performance under different load conditions to identify flakiness.

By allowing for custom analyzers, Uber ensures that each test type receives the most appropriate analysis. This flexibility helps in accurately identifying and addressing flaky tests, improving the overall reliability of the CI pipeline.

Reducing False Positives: Balancing Accuracy and Coverage in Test Detection

A critical challenge in test analysis is balancing accuracy and coverage to reduce false positives. False positives occur when a test is incorrectly flagged as flaky, which can lead to unnecessary reruns and wasted resources. Uber addresses this by fine-tuning its analyzers to strike the right balance.

  • Thresholds and Lookback Windows: Adjusting the thresholds for failure rates and the lookback windows for test runs helps in minimizing false positives. For example, a higher threshold might be set for tests known to be more stable, reducing the chances of incorrectly flagging them.

  • Error Pattern Recognition: Analyzers can be configured to recognize specific error patterns that are more indicative of true flakiness. This involves analyzing the types of errors and their frequency, allowing the system to differentiate between transient issues and genuine flaky tests.

By focusing on these aspects, Uber ensures that its analyzers provide accurate and reliable results. This approach not only improves the efficiency of the CI pipeline but also builds developer confidence in the testing process.

Uber’s configurable analyzers, with their common interface, default strategies, and tailored solutions, form the backbone of their effective test management system. By reducing false positives and providing accurate test analysis, they maintain a robust and reliable CI pipeline.

Managing Flaky Tests: Strategies and Notifications

Treating Flaky Tests

Uber’s CI process emphasizes avoiding flaky tests whenever possible. Flaky tests can disrupt the entire CI pipeline, causing delays and confusion. To mitigate this, Uber provides general guidance to developers, recommending best practices for writing stable tests. This includes avoiding dependencies on external systems, ensuring tests are repeatable, and isolating test cases from each other. However, there are exceptions where running flaky tests is unavoidable.

For instance, some tests are deemed critical and must run regardless of their flakiness. Critical tests often cover vital functionalities that cannot be skipped. Uber’s CI system identifies these tests and ensures they are executed every time, even if they exhibit flaky behavior. This approach ensures that essential features remain under constant scrutiny, maintaining the integrity of the main codebase.

Notification System

Uber integrates its CI system with JIRA to manage flaky tests effectively. When a test is identified as flaky, the CI system automatically creates a JIRA ticket. This ticket includes details about the flaky test, such as the error patterns observed and the frequency of failures. Automated ticket creation streamlines the process, ensuring that no flaky test goes unnoticed.

Assigning ownership of these tickets is crucial. Each ticket is assigned to the responsible team or individual, ensuring accountability. This practice fosters a sense of responsibility among developers, prompting them to address flaky tests promptly. By assigning ownership, Uber ensures that flaky tests are not ignored and are resolved in a timely manner.

Continuous Improvement

Continuous improvement is a core principle in Uber’s approach to managing flaky tests. Feedback loops play a significant role in this process. By collecting data on test performance, Uber can identify trends and patterns in flaky tests. This data is then used to refine and improve the CI processes. For example, if a particular type of test frequently fails, Uber can investigate and address the underlying issues, leading to more stable tests in the future.

Monitoring and adjusting strategies based on test performance is another key aspect. Uber continuously monitors the effectiveness of its test management strategies. If a strategy proves ineffective, adjustments are made to improve its efficacy. This might involve tweaking the parameters for identifying flaky tests, updating the algorithms used by analyzers, or changing the notification thresholds. This adaptive approach ensures that Uber’s CI system evolves with the changing needs of its codebase, maintaining high standards of reliability and efficiency.

Taking Control of Testing

Taking control of flaky tests starts with reliable detection and prevention. Trunk is building a tool to conquer flaky tests once and for all. You’ll get all of the features of the big guy's internal systems without the headache of managing it. With Trunk Flaky Tests, you’ll be able to:

  • Autodetect the flaky tests in your build system

  • See them in a dashboard across all your repos

  • Quarantine tests with one click or automatically

  • Get detailed stats to target the root cause of the problem

  • Get reports weekly, nightly, or instantly sent right to email and Slack

  • Intelligently file tickets to the right engineer

If you’re interested in getting beta access, sign up here.

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users