testing

How to avoid and detect flaky tests in playwright?

By The Trunk TeamSeptember 10, 2024
Read Docs

How to avoid and detect flaky tests in Playwright?

Playwright is a great tool for automated end-to-end testing for modern web apps. However, like any other method to perform end-to-end testing, flaky tests can be a big problem. Flaky tests are tests that sometimes pass and sometimes fail without any changes in the code. This “flakiness” is often the result of flakiness in the underlying infrastructure or production code, outside the scope of just buggy test code. This can cause headaches, especially in Continuous Integration (CI) pipelines. You need to understand what flaky tests are, why they happen, and how to avoid them to keep your testing smooth.

What Are Flaky Tests in Playwright?

Flaky tests produce different results each time you run them. One day, they pass; the next day, they fail even though you haven't changed anything in the code. This inconsistency can make it hard to trust your test results. Some flaky tests are hard to reproduce when they have a relatively low flake rate. A 5-10% flake rate is annoying to reproduce when working on a fix, and still appears frequently enough to have a noticeable effect in high-velocity repos.

In Playwright, flaky tests are particularly tricky because Playwright handles browser interactions. Browsers are complex and can behave differently based on various factors such as network speed or system load. When a Playwright test passes one time but fails the next, you likely have a flaky test on your hands.

Impact of Flaky Tests on CI Pipelines

Flaky tests cause major problems in CI pipelines. CI systems, like Jenkins or GitHub Actions, rely on consistent test results to verify code changes. When a flaky test fails, it can block your pipeline, delaying deployments.

When a test is flaky, it is difficult to know whether a failure truly indicates a broken test. This wastes both CI resources and time. If a developer manually investigates the failed CI job, they might waste an hour debugging before they realize that the test is flaky. If your tests already have a reputation for being flaky, your engineers might default to rerunning every CI failure, which wastes CI resource.

Regardless of how your engineers react to a flaky failure in CI, the largest impact is when PRs are blocked from being merged. If your flake rate is high or your tests take a long time to run, flaky tests can block a PR for hours before finally getting merged.

Why Do Flaky Tests Occur?

Flaky tests are a common issue in automated testing, and understanding why they happen can help you avoid them. Here’s a closer look at the main causes of flaky tests in Playwright.

Race Conditions Caused by Concurrent Operations

Race conditions occur when two or more operations run simultaneously and interfere with each other. This can result from buggy server-side code or poor test isolation when run in parallel. For example, if you have two tests that modify or depend on the same row in your database. If one test deletes the row and another reads from the row, the two tests could cause a race condition and cause one of the tests to fail. These bugs are extremely difficult to reproduce because you usually run tests 1 at a time locally, which bypasses the potential race conditions.

Slowdowns Due to Machine Performance Variability

The performance of the machine running your tests can vary due to many factors, such as CPU load or available memory. These variations can cause tests to behave differently each time they run. Many CI runners are low-spec and lower memory machines, and you might see your tests occasionally run out of memory if the pool of CI runners contains machines with different specs. This can lead to flaky test results that are difficult to reproduce because they might never happen. This is hard to reproduce because your local dev machines are likely much more powerful.

Bugs in Test Scripts, Such as Hard-Coded Timeouts or Bad Setup/Takedown

A very common cause of flaky tests is just low quality test code. Some common examples are hardcoded timeouts or hardcoded setup/takedown code.

Hardcoded timeouts are unreliable in end-to-end tests because when you test a complex system, its timing can vary from run to run. If you have hardcoded timeouts, for example awaiting a resource like a DB or queue to be setup, or a timeout to await a background job to finish, they can be very unreliable. If the timeout isn’t long enough, it’ll cause the test to fail occasionally. If the timeouts are too long, it’ll waste a lot of CI resources. Better practices include polling for resources and awaiting a promise or callback.

Poor setup and takedown code is another example. If your tests leave behind artifacts, such as on a staging database or as a cache file, they impact the results of the following test runs that read these same resources. To prevent this, you should create the resources necessary to perform a test, and then clean up any artifacts to prevent this from happening.

Collective Contribution of Multiple Factors to Test Flakiness

Often, it’s not just one thing that makes a test flaky but a combination of several factors. For example:

  • A slowdown in machine performance might exacerbate a race condition.

  • A bug in the test script might only show up when the server responds slower than usual.

These combined factors make pinpointing the exact cause of a flaky test challenging. To improve test reliability, you need to consider all potential issues and address them collectively.

Understanding these common causes can help you prevent flaky tests in Playwright. By being aware of race conditions, machine performance variability, and script bugs, you can take proactive steps to make your tests more stable.

How Do I Deal With Flaky Tests?

Flaky tests are inevitable if you write e2e tests at scale. Your software is complex, so you need some complex tests to provide confidence and validate your changes. If your tests are complex, you will have some flakiness in your tests. It’s a matter of economics; you can make your tests complex and reliable, but that will require many engineering hours that you may be unable to justify. Instead, you should focus on reducing the impact of flaky tests and tackle high-impact flaky tests efficiently.

To effectively reduce the impact of flaky tests, you should do the following:

  • Avoid flaky tests by learning common anti-patterns

  • Detect flaky tests with automated tools and communicate them with your team

  • Quarantine known flaky tests to mitigate it’s impact on the team

  • Fix high impact flaky tests for the bang for your buck

Let’s walk through each of these steps in more detail.

How to Avoid Writing Flaky Tests in Playwright

Creating stable and reliable tests in Playwright requires specific practices. Here’s how you can avoid writing flaky tests:

Never Rely on Hard Waits

Fixed time delays, known as hard waits, are a common source of flaky tests. They can make tests unreliable and slow. Hard waits cause tests to pause for a fixed amount of time, regardless of whether the condition being waited for has been met. This can lead to tests passing in some environments but failing in others due to differences in execution speed, and it wastes time if the awaited event is completed faster than expected.

Example of a Test Using Hard Waits:

1test('"Load More" button loads new products', async ({ page }) => {
2 await page.goto('https://localhost:3000/products');
3 const loadMoreButton = await page.getByRole('button', { name: 'Load More' });
4 await loadMoreButton.click();
5 await page.waitForTimeout(10000); // Hard wait
6 const productNodes = await page.locator('.product').count();
7 expect(productNodes).toBe(20);
8});

This test waits 10 seconds regardless of whether the products have loaded.

Replacing Hard Waits with Web-First Assertions: Use Playwright’s built-in web-first assertions to wait for conditions dynamically:

1test('"Load More" button loads new products', async ({ page }) => {
2 await page.goto('https://localhost:3000/products');
3 const loadMoreButton = await page.getByRole('button', { name: 'Load More' });
4 await loadMoreButton.click();
5 await expect(page.locator('.product')).toHaveCount(20, { timeout: 10000 });
6});
7

This approach waits up to 10 seconds for the products to appear but doesn’t delay the test if they load sooner. By following these practices, you can minimize the occurrence of flaky tests in Playwright, ensuring more stable and reliable test results.

Control Your Testing Environment

If your test environment changes from run to run, your tests are much more likely to be flaky. Playwright helps you write e2e tests that depend on your entire system's state to produce consistent results. If you test on a persistent test environment or a shared dev/staging environment, that environment might differ each time the test is run. 

If you’re using a persistent test environment, you must ensure that artifacts created by your tests, such as new database tables and rows, files created in storage, and new background jobs started, are properly cleaned up. Leftover data from run to run might affect the results of future test runs. For example, if a test creates a random user and the user persists in the test environment, future test runs may fail if the new unique ID collides with an old user’s ID. These flaky failures are hard to reproduce and debug, so it’s important to avoid them when the tests are first written.

Similarly, you should be careful when testing a development or staging environment. These environments might have other developers who create and destroy data as part of their development process. This constantly changing environment can cause flakiness if other developers accidentally update or delete tables and files used during testing, create resources with unique IDs that collide with a test, or use up the environment’s compute resources causing the test to timeout.

The easiest way to avoid these potential problems is to use a fresh environment whenever possible when you run your tests. A transient environment set up and destroyed at the end of each test run will help ensure that each run’s environment is consistent and no artifacts are left behind. If that’s impossible, pay close attention to how you design your tests’ setup and takedown to remove all dependencies.

Beware of Random Numbers and Datetimes 

It can be tempting to use random numbers and current datetimes in your tests as inputs, but they can often introduce hard-to-catch flaky tests. Random numbers and date times can work in tests but can also introduce flakiness if edge cases aren’t properly considered.

For example, if you’re testing a money transfer feature with a random amount:

1await page.getByLabel('Transfer Amount').fill(Math.random()*10);
2await page.getByText('Transfer').click();

This test will pass most of the time, but if the random number is less than $0.01, it will get rounded to 0 and fail because you (presumably) can’t transfer $0.00. This will happen about 0.1% of the time, which will be difficult to reproduce but noticeable enough to be annoying, especially if you have a bunch of tests with similar bugs.

Similar issues can arise with date times. If you always pick the current date time, you could run into issues where certain dates are considered “invalid,” such as trying to schedule events outside of work hours and during holidays. Again, it might be quite annoying to reproduce and notice because they might never appear while you’re working during the day.

Limit External Resources

While it’s important to test all parts of your app fully, it’s important to limit the number of tests that involve external resources, like external APIs and third party integrations. These external resources may go down, they might change, and you may hit rate limits. They’re all potential sources of flaky failures.

Mocking is a decent strategy for testing specific behaviors, but it should not be used as a replacement for end-to-end testing. You can’t entirely avoid external resources, but you should be mindful of which test suites involve these resources. Have some dedicated tests that involve external resources and mock them for other tests.

Test Order and Concurrency

Running tests out of order or concurrently can cause some of the hardest-to-debug flaky tests. You should always aim to write tests that can be run both out of order and concurrently, because as you write more tests, you’ll need to run them concurrently, so your CI jobs take a reasonable amount of time to complete.

Tests that can’t be run out of order or concurrently are a sign of poor isolation between tests. If one test relies on another test as setup, or two tests update then read from the same resource, they can cause nasty bugs. These tests also make future tests more flaky as you add more tests or modify tests as features are updated.

Configure Automatic Retry

A possible way to deal with flaky tests is configuring Playwright to retry failed tests. If you only have a few flaky tests, this can suppress them.

You can do so by passing the retries option:

1# Give failing tests 3 retry attempts
2npx playwright test --retries=3

Or by updating your Playwright config:

1import { defineConfig } from '@playwright/test';
2
3export default defineConfig({
4  // Give failing tests 3 retry attempts
5  retries: 3,
6});

Retries can be dangerous because flaky tests often indicate underlying problems with your infrastructure or production code. Sudden changes in test flakiness can signal real problems or bugs. Retries completely mask these issues until they become out of control.

Detecting Flaky Tests in Playwright

If you only have a few flaky tests, a great place to start is by having a central place to report them, such as a simple spreadsheet. To identify flaky tests, you can use commit SHAs and test retries. A commit SHA is a unique identifier for a specific commit in your version control system, like Git. If a test passes for one commit SHA but fails for the same SHA later, it’s likely flaky since nothing has changed. 

Collecting test results in CI and comparing results on the same commit is a great starting point for detecting flaky tests. Some flaky tests, however, are much harder to catch if failures are infrequent. This creates a need to rely on engineers to manually investigate, debug, and report flaky tests; which forces them to context switch and eats up their time.

Instead, approach detecting and fixing flaky tests systematically. One strategy is to run new tests many times on the same commit to test for potential flakiness before they’re first introduced.

There are some nuances to consider for the different branches on which the tests are run. You should assume that the test should not normally fail on the `main` branch but can be expected to fail more frequently on your PR and feature branches. How much you factor different branches’ flakiness signals will depend on your team’s circumstances.

If you’re looking for a tool for this, Trunk Flaky Tests automatically detects and tracks flaky tests for you. Trunk also aggregates and displays the detected flaky tests, their past results, relevant stack traces, and flaky failure summaries on a single dashboard.

For example, you’ll get a GitHub comment on each PR, calling out if a flaky test caused the CI jobs to fail.

Learn more about Trunk here.

Quarantining Flaky Tests in Playwright

What do you do with flaky tests after you detect them? In an ideal world, you’d fix them immediately. In reality, they’re usually sent to the back of your backlog. You have project deadlines to meet and products to deliver, all of which deliver more business value than fixing a flaky test. What’s most likely is that you’ll always have some known but unfixed flaky tests in your repo, so the goal is to reduce their impact before you can fix them.

We’ve written in a past blog that flaky tests are harmful for most teams because they block PRs and reduce trust in tests. So once you know a test to be flaky, it’s important to stop it from producing noise in CI and blocking PRs. We recommend you do this by quarantining these tests.

Quarantining is the process of continuing to run a flaky test in CI without allowing failures to block PR merge. We recommend this method over disabling or deleting tests because disabled tests are usually forgotten and swept under the rug. We want our tests to produce less noise, not 0 noise. It’s important to note that studies have shown that initial attempted fixes for flaky tests usually don’t succeed. You need to have a historical record to know if the fix reduced the flake rate or completely fixed the flaky test. 

To quarantine tests in Playwright, you can use the tag feature. For example, you can tag tests you wish to quarantine like this:

1import { test, expect } from '@playwright/test';
2
3test('test login page', {
4  tag: '@quarantined,
5}, async ({ page }) => {
6  // ...
7});

To run all unquarantined tests, configure your CI job to error out on non-zero response code and use the following command:

1npx playwright test --grep-invert @quarantined

To run quarantined tests, configure a separate CI job to continue on error and use the following command:

1npx playwright test --grep @quarantined

A better way to approach this is to quarantine at runtime, which means to quarantine failures without updating the code. As tests are labeled flaky or return to a healthy status, they should be quarantined and unquarantined automatically. This is especially important for large repos with many contributors. A possible approach to accomplish this is to host a record of known flaky tests, run all tests, and then check if all failures are due to known flaky tests. If all tests are from known flaky tests, override the exit code of the CI job to pass.

If you’re looking for a way to quarantine flaky tests at runtime, Trunk Flaky Test can help you here. Trunk will check if failed tests are known to be flaky and unblock your PRs if all failures can be quarantined.  Learn more about Trunk’s Quarantining.

Fixing Flaky Tests in Playwright

We’ve covered some common anti-patterns earlier to help you avoid flaky tests, but if your flaky test is due to a more complex reason, how you approach a fix will vary heavily. We can’t show you how to fix every way your tests flake; it can be very complex. Instead, let’s cover prioritizing which tests to fix and reproducing flaky tests.

When deciding on which tests to fix first, we first need a way to rank them by their impact. What we discovered to be a good measure of impact when we worked with our beta partners is that flaky tests blocking the most PRs should be fixed first. Ultimately, we want to eliminate flaky tests because they block PRs from being merged when they fail in CI. You can do this by tracking the number of times a known flaky test fails in CI on PR branches, either manually for smaller projects or automatically with a tool.

This also helps you justify the engineering effort put towards fixing flaky tests. When you reduce the number of blocked PRs, you save expensive engineering hours. You can further extrapolate the number of engineering hours saved by factoring in context-switching costs, which some studies show to be ~23 minutes per context switch for knowledge workers. Being able to justify the time spend to leadership is often more difficult than fixing the flaky tests themselves.

After deciding on which tests to fix, you need to be able to reliably reproduce the flaky test. You can’t root cause and fix a bug if you can’t reproduce it. If you’re using a tool like Trunk Flaky Tests, you’ll see summaries of past failures and their full stack traces, making this much easier. But if you’re not, here are some tips to reproduce difficult flaky tests.

  • Reproduce the bug on the machines they’re first discovered. Some bugs occur because of memory or CPU limitations.

  • Pay attention to network setup. If you see failed-to-connect or connection timed-out errors, they’re more likely to be network-related.

  • Run the tests concurrently and in random order if this is how they’re run in CI. Keep track of the order they’re run. Many flaky tests are caused by unexpected side effects of one test impacting another.

  • Check for leftover processes, files, and database/cache entries. In many cases, flaky tests appear because of the accumulation of leftover artifacts. Reproducing in a clean environment might be very difficult.

Some flaky tests occur very rarely and are a pain to reproduce. While they won’t be a high priority for a fix, you don’t want to give up on reproducing them. These tests might reveal bugs in your production code, which is important to investigate. In these cases, your only real bet is to add better telemetry and logging in CI and wait for that flake to reappear.

If you’re looking for a straightforward way to report flaky tests, see their impact on your team, find the highest impact tests to fix, and track past failures and stack traces, you can try Trunk Flaky Tests. Learn more about Trunk Flaky Tests dashboards.

Taking Control of Flaky Tests

Flaky tests take a combination of well-written tests, taking advantage of your test framework’s capabilities, and good tooling to eliminate. If you write e2e tests with Playwright and face flaky tests, know that you’re not alone. You don’t need to invent your own tools. Trunk can give you the tools needed to tackle flaky tests:

  • Autodetect the flaky tests in your build system

  • See them in a dashboard across all your repos

  • Quarantine tests manually or automatically

  • Get detailed stats to target the root cause of the problem

  • Get reports weekly, nightly, or instantly sent right to email and Slack

  • Intelligently file tickets to the right engineer

Try Trunk Flaky Tests

Try it yourself or
request a demo

Get started for free

Try it yourself or
Request a Demo

Free for first 5 users