Flaky

Flaky tests are:

Tests passed and failed with the same code.
Tests passed before but sometimes failed

Flaky tests caused problems like:

Engineering time. It is very hard to fix. The person who fix is usually not the author.
Very expensive to your team. They get in the way when you need to hotfix deploy something.
People lost confidence of your test suite. They stopped/less willing to write tests.
Mitigations of flaky tests are costly: qurantine, rerun flaky, retry few times, rerun your tests hourly, daily.

Flaky tests usually caused by:

Testing for fixed Time and Timezone
Concurrency — Things run together
Wait — Assert state fo elements not showing up quick enough on UI.
Untidy cleanup between tests. Order of Tests.
Order of result from database is not the same as you think: not behaves the same locally and on CI
3rd-party library (Capybara)
Checking datetime, time, elapsed time fields but precision is not enough
Checking for ambiguous HTML elements (two .btn on the page?)

Flaky tests usually tackled by:

Automatic Retry
Rerun til passed
Having a team who fixed them
Skip them and record somewhere
Rerun regularly
Bisect against test database state (seed) (minitest-bisect)

Test Flakiness – Methods for identifying and dealing with flaky tests
Spotify take on test flakiness.
Eradicating Non-Determinism in Tests by Martin Fowler
- Introduced the idea of quarantine
Athena: Our automated build health management system by Dropbox
Flaky tests by GitLab
rspec-retry + quarantine
Why are my tests so slow?” A list of likely suspects, anti-patterns, and unresolved personal trauma
Reducing flaky build by 18x — GitHub Engineering
- The dashboard looks insightful to find what to fix.
Tests that sometimes failed
- Practical tips on how to tackle
Broken windows theory when there is a window broken, more window will get broken.
discourse’s flaky tests
An empirical study of flaky tests
- The majority of flaky tests are caused by asynchronous waits, concurrency and test order dependency.
- Most of the tests are flaky when they are written. 15% Became flaky at later.
- Google 1.6M test failures per day, 73K (4.5%) are flaky. Repeat 10 times before marking as flaky.
- 97% unit test failures in Apache are harmless (out of 21% are flaky)
Empirical Study of Restarted and Flaky Builds on Travis CI
- Developers restart 1.72% builds (961 builds of 56552). More mature/complex projects more prompted to restart builds. Those restarted builds are flaky, network issues, execution timeout. Flaky slows down developer workflow. Increase merge time from 16h to 48h.
Google Testing Blog: Where do our flaky tests come from?
- Large Tests
- Some tools are more flaky¹
Google Testing Blog: Flaky Tests at Google and How We Mitigate Them
- 1.5% - 2% tests are flaky
Google Testing Blog: What Test Engineers do at Google: Building Test Infrastructure
Google Testing Blog: TotT: Avoiding Flakey Tests
Google Testing Blog: My Selenium Tests Aren’t Stable!
Google Testing Blog: Testing on the Toilet: What Makes a Good End-to-End Test?

Good posts on Flaky