7 steps for fixing flaky tests

Every once in a while you might encounter a flaky unit test. I know, unit tests can’t be flaky by definition … but also life comes in the way, so sometimes it happens. Such a test will consistently pass on most developer machines, but fail in a CI environment every once in a while. In this post I’ll share how I go about fixing such problems.

1. Error message, logs & stacktrace

The first step is to collect as much information about the problem as possible. If the unit test is written good, the error message will tell you exactly where the problem is. If that’s not the case, there are two immediate actions:
Action 1: Raise a new ticket in your work / bug tracking system to improve the error messaging for this test. This will save you time in the future, should a similar failure occur. It is important to track it, otherwise it won’t happen.

Action 2: Fallback to collecting logs & stacktraces from your CI in an attempt to get an indication what’s wrong.

This first step has two main purposes:
– confirm if the failure is always happening in the same place (should be ?)
– get an indication what’s the issue

2. Is this a real problem?

Now that you have an idea where the problem is, look at its surroundings – other tests, major classes and components impacted, screens where it should happen, etc. Play around with your app to see if that’s a real issue. Ask your wider team if they have seen it – other DEVs, QAs, Designers, Product Owners, etc. Check the Play Store reviews to see if users are complaining about this (or similar) issue.

The purpose is to get a feel if you have a real business problem OR just something is wrong in your tests. The former indicates you haven’t tested this part of the code correctly, so you need to revise it from scratch. The latter is slightly better, as your overall testing strategy is correct, only you’ll need to learn something new by fixing the flakiness 🙂

3. Reproduce it locally

Reproducible bug is a fixable bug. But the test passes locally, right?

image showing the unit test passing when ran in isolation

IntelliJ / Android Studio has a handy tool to help us here – Run Configurations. You can access it by clicking on the Top Toolbar -> Test being run -> Edit Configurations -> Select your failing test -> Configuraton -> Repeat.

first step to edit Run Configuration is to click in the top toolbar on the name of your test

select Edit Configurations from the dropdown menu

in the Run Configurations dialog, select Configuration - Repeat - Until Failure

Run the flaky test Until Failure and see what happens. If you get a failure within say ~1000 runs, that’s great – now you have a reproducible, thus fixable bug ??

If the test doesn’t fail, try running the whole class Until Failure. If it fails that’s great as well – you have a reproducible bug ?? You also learned the test doesn’t fail in isolation, but only when run together with others. This indicates the problem is a leaking state.

If your project is nicely split into modules, repeat this process on the module level. Same rule as above – a failure will give you more information about the issue.

4. Can you see the issue?

Sometimes the truth is before our eyes. We were lucky above and reproduced the issue locally:

running the test "Until Failure" allows us to see the failing test locally after a number of runs

The test failure itself is pretty much unreadable:

The Diff is slightly better, but still requires too much effort to comprehend it:

the standard Diff & Merge dialog where the error is really hard to see due to the default colour scheme

The screenshot above has the standard Diff&Merge dialog which is used in multiple places in a typical workflow. The most common are:
– when reviewing code changes
– when fixing conflicts
– when checking test failures

Its style can be tweaked via: Android Studio -> Preferences -> Editor -> Color scheme -> Diff & Merge. For this particular use-case important are the three Changed Lines -> Changed colours (remember to uncheck Inherit ignored color). This combination worked for me, but feel free to tweak them to your liking:

my custom Diff&Merge dialog colour scheme

Now it’s better – it’s clear something is wrong with our dates.

the same Diff&Merge dialog but the error is clearly visible due to the improved colour scheme

5. Common pitfalls

This is the most generic part, as it depends on your project and tests. Below are a few things to watch out for. If a failing test employs some of these, more often than not they’ll be causing the issue.

5.1 Test Double issues?

If your test is using any kind of Test Doubles, be sure to double-check if they are used correctly. As tests usually follow the pattern of setup – execution – verification, be sure to check the setup and verification phases for your doubles. Some logs in this area can help. Top tip -> be sure to read the famous Mocks Aren’t Stubs article by Martin Fowler to challenge your understanding of Test Doubles.

5.2. Test Data issues?

This usually goes hand-in-hand with the above. If you are creating TestData for your tests, either manually or using a library (like Java-Faker, jFairy or DataFactory), it can be sabotaging your tests. Again, some logs can help. If you are manually generating it, make sure it is valid as per your business logic. And pay special attention to the next three points, as time, timezones and locales are often different on your CI environment.

5.3 Time

Time is a tricky concept to get right in programming in general. If your tests depend on Time, ensure you are in control of it by using concrete, specific moments in time or by using Test Doubles. Don’t forget the milliseconds, which must be set separately on a Calendar, but are part of date comparisons. If you are working with Calendars, the safest way is to explicitly set all of it’s components:

val testDate = Calendar.getInstance().apply {
    set(2020, 1, 1, 14, 30, 30)
    set(Calendar.MILLISECOND, 123)
5.4 TimeZone

TimeZones are platform- or setup- specific. If they take part of your tests, ensure you are in control of them by using specific TimeZone or Test Doubles. A good practice is to have something like this in test classes that utilise TimeZone:

fun setUp() { 
5.5 Locale

Similar to the above, Locale can be overlooked source of trouble. If used in your tests (directly or indirectly), ensure you are in control of it. You can add:

fun setUp() {
5.6 Mutating Data

Mutating data in multiple places and abstraction levels is a programming anti-pattern that deserves a blog post on its own. In this context, just keep in mind that excessive data mutation is a common source of issues. If you feel you are doing this, consider validating your state at various stages in the process. Finding a wrong one is a clear indication of a problem.

6. Debugging

There are many ways to debug an issue. A good summary of most popular approaches can be found HERE. A more in-depth look is the book “Debugging” by David J. Agans which I also recommend.

As we are dealing with a flaky test, it’s best to start with the more passive debugging strategies.

6.1 Logging

A few logs in the right place can hand us the problem on a silver plate. The pitfalls from step 5 are the best places to start adding logs. Do this iteratively, as too much logging makes it unreadable, thus less helpful.

6.2 Assertions

Assertions in your code and tests can help you track down the issue. They validate the state is correct at a given place, thus reducing the part of code you need to inspect using other debugging methods. Just don’t forget to remove these assertions from your Production code after fixing the issue.

6.3 Problem Simplification

This is a more involved technique where you are substituting components of your app with simpler versions. Or where possible – removing them altogether. The idea is to reduce complexity, dependencies and functionality, gradually eliminating possible sources of issues. Not only this helps pinpoint problems, but usually leads to ideas for better design of your code.

7. Putting it all together

How did these steps play for our example?

In step 1 we saw the failure is happening in the same place. After asking around and testing ourselves, we don’t think it’s a real issue (step 2). Luckily we were able to reproduce the issue when running the test “Until Failure” (step 3). After tweaking the colour scheme in step 4 we can clearly see the issue has something to do with dates. This is a potential pitfall (5.3) alongside the Mocks (5.1) and TestData (5.2) used in the test. There is even a crossing point of all three where we are setting up a time-related mock. After adding a one-line log (6.1) there, we get the following output for the failing test:

Setting up mocks: 
-> 2020-07-30T16:49:12.564+0100 will be mocked to Sun Jul 19 16:49:12 BST 2020
-> 2020-07-31T16:49:12.564+0100 will be mocked to Tue Jul 28 16:49:12 BST 2020
Setting up mocks:
-> 2020-07-30T16:49:12.564+0100 will be mocked to Thu Jul 30 16:49:12 BST 2020
-> 2020-07-31T16:49:12.564+0100 will be mocked to Sat Aug 08 16:49:12 BST 2020

The error becomes clear – for some reason we are mocking the same datetime value to two different Date objects. In our mocking framework (Mockito) the last call to the library wins, which results in the wrong setup for our test. It’s now a question to backtrack and understand why these dates are the same.

It turns out the below is called multiple times when building the TestData (5.2) needed for our tests:

private fun constructDummyDate(inFuture: Boolean) = Date(System.currentTimeMillis() + if (inFuture) TIME_OFFSET else NEGATIVE_TIME_OFFSET)

We are not in full control of our Time (5.3). If System.currentTimeMillis() returns the same value for two invocations of this method (which is possible), we’ll be in trouble. Our tests need dynamically generated dates, so adding a random time offset in the mixture fixes our problem:

private fun randomTimeOffset() = randomGenerator.nextInt()

private fun constructDummyDate(inFuture: Boolean) = Date(System.currentTimeMillis() + randomTimeOffset() + if (inFuture) TIME_OFFSET else NEGATIVE_TIME_OFFSET)

Even with the “Until Failure” option, our test passes all the time now:

confirmation the test is passing now - it was ran 10000 times and was still successful

Although pretty high-level, I hope you’ll find these tips useful. Please focus on the described process, not the specific example. And if I have missed your favourite debugging strategy, please let me know in the comments.

Happy debugging!

Post your Comments