Flaky tests are:
- Tests passed and failed with the same code.
- Tests passed before but sometimes failed
Flaky tests caused problems like:
- Engineering time. It is very hard to fix. The person who fix is usually not the author.
- Very expensive to your team. They get in the way when you need to hotfix deploy something.
- People lost confidence of your test suite. They stopped/less willing to write tests.
- Mitigations of flaky tests are costly: qurantine, rerun flaky, retry few times, rerun your tests hourly, daily.
Flaky tests usually caused by:
- Testing for fixed Time and Timezone
- Concurrency — Things run together
- Wait — Assert state fo elements not showing up quick enough on UI.
- Untidy cleanup between tests. Order of Tests.
- Order of result from database is not the same as you think: not behaves the same locally and on CI
- 3rd-party library (Capybara)
- Checking datetime, time, elapsed time fields but precision is not enough
- Checking for ambiguous HTML elements (two
.btn
on the page?)
Flaky tests usually tackled by:
- Automatic Retry
- Rerun til passed
- Having a team who fixed them
- Skip them and record somewhere
- Rerun regularly
- Bisect against test database state (seed) (minitest-bisect)
-
Test Flakiness – Methods for identifying and dealing with flaky tests
Spotify take on test flakiness. -
Eradicating Non-Determinism in Tests by Martin Fowler
- Introduced the idea of quarantine
-
Athena: Our automated build health management system by Dropbox
-
Flaky tests by GitLab
rspec-retry + quarantine -
Why are my tests so slow?” A list of likely suspects, anti-patterns, and unresolved personal trauma
-
Reducing flaky build by 18x — GitHub Engineering
- The dashboard looks insightful to find what to fix.
-
- Practical tips on how to tackle
-
Broken windows theory when there is a window broken, more window will get broken.
-
An empirical study of flaky tests
- The majority of flaky tests are caused by asynchronous waits, concurrency and test order dependency.
- Most of the tests are flaky when they are written. 15% Became flaky at later.
- Google 1.6M test failures per day, 73K (4.5%) are flaky. Repeat 10 times before marking as flaky.
- 97% unit test failures in Apache are harmless (out of 21% are flaky)
-
Empirical Study of Restarted and Flaky Builds on Travis CI
- Developers restart 1.72% builds (961 builds of 56552). More mature/complex projects more prompted to restart builds. Those restarted builds are flaky, network issues, execution timeout. Flaky slows down developer workflow. Increase merge time from 16h to 48h.
-
Google Testing Blog: Where do our flaky tests come from?
- Large Tests
- Some tools are more flaky1
-
Google Testing Blog: Flaky Tests at Google and How We Mitigate Them
- 1.5% - 2% tests are flaky
-
Google Testing Blog: What Test Engineers do at Google: Building Test Infrastructure
-
Google Testing Blog: Testing on the Toilet: What Makes a Good End-to-End Test?