research-article

What is the Vocabulary of Flaky Tests?

Authors:

Supun Dissanayake,

Marcelo d'Amorim,

Christoph Treude,

Antonia BertolinoAuthors Info & Claims

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

Pages 492 - 502

https://doi.org/10.1145/3379597.3387482

Published: 18 September 2020 Publication History

Abstract

Flaky tests are tests whose outcomes are non-deterministic. Despite the recent research activity on this topic, no effort has been made on understanding the vocabulary of flaky tests. This work proposes to automatically classify tests as flaky or not based on their vocabulary. Static classification of flaky tests is important, for example, to detect the introduction of flaky tests and to search for flaky tests after they are introduced in regression test suites.

We evaluated performance of various machine learning algorithms to solve this problem. We constructed a data set of flaky and non-flaky tests by running every test case, in a set of 64k tests, 100 times (6.4 million test executions). We then used machine learning techniques on the resulting data set to predict which tests are flaky from their source code. Based on features, such as counting stemmed tokens extracted from source code identifiers, we achieved an F-measure of 0.95 for the identification of flaky tests. The best prediction performance was obtained when using Random Forest and Support Vector Machines. In terms of the code identifiers that are most strongly associated with test flakiness, we noted that job, action, and services are commonly associated with flaky tests. Overall, our results provides initial yet strong evidence that static detection of flaky tests is effective.

References

[1]

Syed Nadeem Ahsan, Javed Ferzund, and Franz Wotawa. 2009. Automatic software bug triage system (bts) based on latent semantic indexing and support vector machine. In 2009 Fourth International Conference on Software Engineering Advances. IEEE, 216--221.

Digital Library

[2]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and David Binkley. 2012. An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 56--65.

Digital Library

[3]

Jonathan Bell, Owolabi Legunsen, Michael Hilton Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. Deflaker Dataset. http://www.deflaker.org/icsecomp/.

[4]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In Proceedings of the 40th International Conference on Software Engineering. ACM, 433--444.

Digital Library

[5]

Antonia Bertolino, Emilio Cruciani, Breno Miranda, and Roberto Verdecchia. 2020. Know Your Neighbor: Fast Static Prediction of Test Flakiness. https://doi.org/10.32079/ISTI-TR-2020/001

[6]

Lucas BL de Souza, Eduardo C Campos, and Marcelo de A Maia. 2014. Ranking crowd knowledge to assist software development. In Proceedings of the International Conference on Program Comprehension. ACM, 72--82.

Digital Library

[7]

Hyunsook Do. 2016. Recent Advances in Regression Testing Techniques. Advances in Computers 103 (2016), 53--77. https://doi.org/10.1016/bs.adcom.2016.04.004

[8]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 830--840. https://doi.org/10.1145/3338906.3338945

[9]

Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1--11.

[10]

Vahid Garousi, Baris Kucuk, and Michael Felderer. 2018. What we know about smells in software test code. IEEE Software 36, 3 (2018), 61--73.

[11]

Mark Harman and Peter W. O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In Proc. SCAM'18.

[12]

Kim Herzig. 2016. Let's assume we had to pay for testing. Keynote at AST 2016. https://www.kim-herzig.de/2016/06/28/keynote-ast-2016/

[13]

Kim Herzig and Nachiappan Nagappan. 2015. Empirically Detecting False Test Alarms Using Association Rules (ICSE '15). 39--48.

[14]

Sunghun Kim, E James Whitehead Jr, and Yi Zhang. 2008. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering 34, 2 (2008), 181--196.

Digital Library

[15]

Tariq M King, Dionny Santiago, Justin Phillips, and Peter J Clarke. 2018. Towards a Bayesian Network Model for Predicting Flaky Automated Tests. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 100--107.

[16]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-scale Industrial Setting (ISSTA 2019). ACM, New York, NY, USA, 101--111. https://doi.org/10.1145/3293882.3330570

[17]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 312--322.

[18]

Jeff Listfield. 2017. Where do our flaky tests come from? https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html.

[19]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proc. FSE'14.

Digital Library

[20]

John Micco. 2016. Flaky tests at Google and how we mitigate them. https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html. Accessed: 2020-01-15.

[21]

John Micco. 2016. Flaky Tests at Google and How We Mitigate Them. https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-from.html.

[22]

John Micco. 2017. The State of Continuous Integration Testing @Google.

[23]

Fabio Palomba, Andy Zaidman, and Andrea De Lucia. 2018. Automatic test smell detection using information retrieval techniques. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 311--322.

[24]

Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 534--538.

[25]

Huynh Khanh Vi Tran, Nauman Bin Ali, Jürgen Börstler, and Michael Unterkalmsteiner. 2019. Test-Case Quality-Understanding Practitioners' Perspectives. In International Conference on Product-Focused Software Process Improvement. Springer, 37--52.

[26]

Christoph Treude and Martin P Robillard. 2016. Augmenting api documentation with insights from stack overflow. In Proceedings of the International Conference on Software Engineering. IEEE, 392--403.

Digital Library

[27]

Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2016. An empirical investigation into the nature of test smells. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 4--15.

Digital Library

[28]

Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An Empirical Study of Bugs in Test Code. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) (ICSME '15). IEEE Computer Society, USA, 101--110. https://doi.org/10.1109/ICSM.2015.7332456

Digital Library

[29]

Arie Van Deursen, Leon Moonen, Alex Van Den Bergh, and Gerard Kok. 2001. Refactoring test code. In Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001). 92--95.

[30]

Matias Waterloo, Suzette Person, and Sebastian Elbaum. 2015. Test Analysis: Searching for Faults in Tests (ASE '15). 149--154.

[31]

Ian H Witten and Eibe Frank. 2002. Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record 31, 1 (2002), 76--77.

Digital Library

[32]

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kivanç Muslu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption. In International Symposium on Software Testing and Analysis, ISSTA '14, San Jose, CA, USA - July 21-26, 2014. 385--396.

Digital Library

Cited By

Haben GHabchi SMicco JHarman MPapadakis MCordy MLe Traon YFilkov VRay BZhou M(2024)The Importance of Accounting for Execution Failures when Predicting Test FlakinessProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695261(1979-1989)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695261
Berndt ABach TBaltes S(2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3695407
Chen YJabbarvand RChristakis MPradel M(2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680369
Show More Cited By

Index Terms

What is the Vocabulary of Flaky Tests?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
Root causing flaky tests in a large-scale industrial setting
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

In today’s agile world, developers often rely on continuous integration pipelines to help build and validate their changes by executing tests in an efficient manner. One of the significant factors that hinder developers’ productivity is flaky tests—...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

June 2020

675 pages

ISBN:9781450375177

DOI:10.1145/3379597

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MSR '20

Sponsor:

SIGSOFT

MSR '20: 17th International Conference on Mining Software Repositories

June 29 - 30, 2020

Seoul, Republic of Korea

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
614
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)6

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Haben GHabchi SMicco JHarman MPapadakis MCordy MLe Traon YFilkov VRay BZhou M(2024)The Importance of Accounting for Execution Failures when Predicting Test FlakinessProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695261(1979-1989)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695261
Berndt ABach TBaltes S(2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3695407
Chen YJabbarvand RChristakis MPradel M(2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680369
Chen YJabbarvand R(2024)Can ChatGPT Repair Non-Order-Dependent Flaky Tests?Proceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643900(22-29)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643900
Malmir SRigby P(2024)Predicting the Lifetime of Flaky Tests on ChromeProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643899(5-13)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643899
Chen YRoychoudhury APaiva AAbreu RStorey M(2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3641227
Berndt ABaltes SBach TRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639741
Liu XSong ZFang WYang WWang WChua TNgo CKa-Wei Lee RKumar RLauw H(2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645628
Wang JLei YLi MRen GXie HJin SLi JHu J(2024)Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00095(872-883)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00095
Rahman SMassey ALam WShi ABell J(2024)Automatically Reproducing Timing-Dependent Flaky-Test Failures2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00032(269-280)Online publication date: 27-May-2024
https://doi.org/10.1109/ICST60714.2024.00032
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents