skip to main content
10.1145/3379597.3387482acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

What is the Vocabulary of Flaky Tests?

Published: 18 September 2020 Publication History

Abstract

Flaky tests are tests whose outcomes are non-deterministic. Despite the recent research activity on this topic, no effort has been made on understanding the vocabulary of flaky tests. This work proposes to automatically classify tests as flaky or not based on their vocabulary. Static classification of flaky tests is important, for example, to detect the introduction of flaky tests and to search for flaky tests after they are introduced in regression test suites.
We evaluated performance of various machine learning algorithms to solve this problem. We constructed a data set of flaky and non-flaky tests by running every test case, in a set of 64k tests, 100 times (6.4 million test executions). We then used machine learning techniques on the resulting data set to predict which tests are flaky from their source code. Based on features, such as counting stemmed tokens extracted from source code identifiers, we achieved an F-measure of 0.95 for the identification of flaky tests. The best prediction performance was obtained when using Random Forest and Support Vector Machines. In terms of the code identifiers that are most strongly associated with test flakiness, we noted that job, action, and services are commonly associated with flaky tests. Overall, our results provides initial yet strong evidence that static detection of flaky tests is effective.

References

[1]
Syed Nadeem Ahsan, Javed Ferzund, and Franz Wotawa. 2009. Automatic software bug triage system (bts) based on latent semantic indexing and support vector machine. In 2009 Fourth International Conference on Software Engineering Advances. IEEE, 216--221.
[2]
Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and David Binkley. 2012. An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 56--65.
[3]
Jonathan Bell, Owolabi Legunsen, Michael Hilton Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. Deflaker Dataset. http://www.deflaker.org/icsecomp/.
[4]
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In Proceedings of the 40th International Conference on Software Engineering. ACM, 433--444.
[5]
Antonia Bertolino, Emilio Cruciani, Breno Miranda, and Roberto Verdecchia. 2020. Know Your Neighbor: Fast Static Prediction of Test Flakiness. https://doi.org/10.32079/ISTI-TR-2020/001
[6]
Lucas BL de Souza, Eduardo C Campos, and Marcelo de A Maia. 2014. Ranking crowd knowledge to assist software development. In Proceedings of the International Conference on Program Comprehension. ACM, 72--82.
[7]
Hyunsook Do. 2016. Recent Advances in Regression Testing Techniques. Advances in Computers 103 (2016), 53--77. https://doi.org/10.1016/bs.adcom.2016.04.004
[8]
Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 830--840. https://doi.org/10.1145/3338906.3338945
[9]
Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1--11.
[10]
Vahid Garousi, Baris Kucuk, and Michael Felderer. 2018. What we know about smells in software test code. IEEE Software 36, 3 (2018), 61--73.
[11]
Mark Harman and Peter W. O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In Proc. SCAM'18.
[12]
Kim Herzig. 2016. Let's assume we had to pay for testing. Keynote at AST 2016. https://www.kim-herzig.de/2016/06/28/keynote-ast-2016/
[13]
Kim Herzig and Nachiappan Nagappan. 2015. Empirically Detecting False Test Alarms Using Association Rules (ICSE '15). 39--48.
[14]
Sunghun Kim, E James Whitehead Jr, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering 34, 2 (2008), 181--196.
[15]
Tariq M King, Dionny Santiago, Justin Phillips, and Peter J Clarke. 2018. Towards a Bayesian Network Model for Predicting Flaky Automated Tests. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 100--107.
[16]
Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-scale Industrial Setting (ISSTA 2019). ACM, New York, NY, USA, 101--111. https://doi.org/10.1145/3293882.3330570
[17]
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 312--322.
[18]
Jeff Listfield. 2017. Where do our flaky tests come from? https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html.
[19]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proc. FSE'14.
[20]
John Micco. 2016. Flaky tests at Google and how we mitigate them. https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html. Accessed: 2020-01-15.
[21]
John Micco. 2016. Flaky Tests at Google and How We Mitigate Them. https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-from.html.
[22]
John Micco. 2017. The State of Continuous Integration Testing @Google.
[23]
Fabio Palomba, Andy Zaidman, and Andrea De Lucia. 2018. Automatic test smell detection using information retrieval techniques. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 311--322.
[24]
Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 534--538.
[25]
Huynh Khanh Vi Tran, Nauman Bin Ali, Jürgen Börstler, and Michael Unterkalmsteiner. 2019. Test-Case Quality-Understanding Practitioners' Perspectives. In International Conference on Product-Focused Software Process Improvement. Springer, 37--52.
[26]
Christoph Treude and Martin P Robillard. 2016. Augmenting api documentation with insights from stack overflow. In Proceedings of the International Conference on Software Engineering. IEEE, 392--403.
[27]
Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2016. An empirical investigation into the nature of test smells. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 4--15.
[28]
Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An Empirical Study of Bugs in Test Code. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) (ICSME '15). IEEE Computer Society, USA, 101--110. https://doi.org/10.1109/ICSM.2015.7332456
[29]
Arie Van Deursen, Leon Moonen, Alex Van Den Bergh, and Gerard Kok. 2001. Refactoring test code. In Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001). 92--95.
[30]
Matias Waterloo, Suzette Person, and Sebastian Elbaum. 2015. Test Analysis: Searching for Faults in Tests (ASE '15). 149--154.
[31]
Ian H Witten and Eibe Frank. 2002. Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record 31, 1 (2002), 76--77.
[32]
Sai Zhang, Darioush Jalali, Jochen Wuttke, Kivanç Muslu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption. In International Symposium on Software Testing and Analysis, ISSTA '14, San Jose, CA, USA - July 21-26, 2014. 385--396.

Cited By

View all
  • (2024)The Importance of Accounting for Execution Failures when Predicting Test FlakinessProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695261(1979-1989)Online publication date: 27-Oct-2024
  • (2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
  • (2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories
June 2020
675 pages
ISBN:9781450375177
DOI:10.1145/3379597
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Regression testing
  2. Test flakiness
  3. Text classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MSR '20
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)6
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Importance of Accounting for Execution Failures when Predicting Test FlakinessProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695261(1979-1989)Online publication date: 27-Oct-2024
  • (2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
  • (2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
  • (2024)Can ChatGPT Repair Non-Order-Dependent Flaky Tests?Proceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643900(22-29)Online publication date: 14-Apr-2024
  • (2024)Predicting the Lifetime of Flaky Tests on ChromeProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643899(5-13)Online publication date: 14-Apr-2024
  • (2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
  • (2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
  • (2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
  • (2024)Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00095(872-883)Online publication date: 12-Mar-2024
  • (2024)Automatically Reproducing Timing-Dependent Flaky-Test Failures2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00032(269-280)Online publication date: 27-May-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media