research-article

Open access

FlakiMe: laboratory-controlled test flakiness impact assessment

Authors:

Renaud Rwemalika,

Adriano Franci,

Mike Papadakis,

Mark HarmanAuthors Info & Claims

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 982 - 994

https://doi.org/10.1145/3510003.3510194

Published: 05 July 2022 Publication History

Abstract

Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-deterministic outcomes. To help testing researchers better investigate flakiness, we introduce a test flakiness assessment and experimentation platform, called FlakiMe. FlakiMe supports the seeding of a (controllable) degree of flakiness into the behaviour of a given test suite. Thereby, FlakiMe equips researchers with ways to investigate the impact of test flakiness on their techniques under laboratory-controlled conditions. To demonstrate the application of FlakiMe, we use it to assess the impact of flakiness on mutation testing and program repair (the PRAPR and ARJA methods). These results indicate that a 10% flakiness is sufficient to affect the mutation score, but the effect size is modest (2% - 5% ), while it reduces the number of patches produced for repair by 20% up to 100% of repair problems; a devastating impact on this application of testing. Our experiments with FlakiMe demonstrate that flakiness affects different testing applications in very different ways, thereby motivating the need for a laboratory-controllable flakiness impact assessment platform and approach such as FlakiMe.

References

[1]

Azeem Ahmad, Ola Leifler, and Kristian Sandahl. 2019. Empirical Analysis of Factors and their Effect on Test Flakiness - Practitioners' Perceptions. CoRR abs/1906.00673 (2019). arXiv:1906.00673

[2]

Nadia Alshahwan, Andrea Ciancone, Mark Harman, Yue Jia, Ke Mao, Alexandru Marginean, Alexander Mols, Hila Peleg, Federica Sarro, and Ilya Zorin. 2019. Some Challenges for Software Testing Research (Invited Talk Paper). In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 1--3.

Digital Library

[3]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering. IEEE, 1572--1584.

Digital Library

[4]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the 40th International Conference on Software Engineering - ICSE '18. ACM, 433--444.

Digital Library

[5]

B. H. P. Camara, M. A. G. Silva, A. T. Endo, and S. R. Vergilio. 2021. What is the Vocabulary of Flaky Tests? An Extended Replication. In Proceedings of the 29th IEEE/ACM International Conference on Program Comprehension. IEEE/ACM, 11.

[6]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: A Practical Mutation Testing Tool for Java (Demo). In Proceedings of the 25th International Symposium on Software Testing and Analysis (Saarbrücken, Germany) (ISSTA 2016). Association for Computing Machinery, New York, NY, USA, 449--452.

Digital Library

[7]

Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 302--313.

Digital Library

[8]

Saikat Dutta, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. 2020. Detecting flaky tests in probabilistic and machine learning applications. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 211--224.

Digital Library

[9]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 830--840. arXiv:1907.01466

Digital Library

[10]

Zebao Gao and Atif M. Memon. 2015. Which of My Failures are Real? Using Relevance Ranking to Raise True Failures to the Top. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW). IEEE, 62--69.

Digital Library

[11]

Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical Program Repair via Bytecode Mutation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 19--30.

Digital Library

[12]

Martin Gruber, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. An Empirical Study of Flaky Tests in Python. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation. IEEE, 11. arXiv:2101.09077

[13]

Guillaume Haben, Sarra Habchi, Mike Papadakis, Maxime Cordy, and Yves Le Traon. 2021. A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests. In Proceedings of the 18th International Conference on Mining Software Repositories. ACM, 11.

[14]

Mark Harman and Peter O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis (keynote paper). In Proceedings of the 18th IEEE International Working Conference on Source Code Analysis and Manipulation. 1--23.

[15]

Facebook Inc. 2019. Facebook Testing and Verification request for proposals.

[16]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437--440.

Digital Library

[17]

Tariq M. King, Dionny Santiago, Justin Phillips, and Peter J. Clarke. 2018. Towards a Bayesian Network Model for Predicting Flaky Automated Tests. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security Companion. IEEE, 100--107.

[18]

Emily Kowalczyk, Karan Nair, Zebao Gao, Leo Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking flaky tests at Apple. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. ACM, 110--119.

Digital Library

[19]

Adriaan Labuschagne, Laura Inozemtseva, and Reid Holmes. 2017. Measuring the cost of regression testing in practice: a study of Java projects using continuous integration. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, 821--830.

Digital Library

[20]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root causing flaky tests in a large-scale industrial setting. In Proceedings of the 28th ACMSIGSOFT International Symposium on Software Testing and Analysis. ACM, Beijing, China, 101--111.

Digital Library

[21]

Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A study on the lifecycle of flaky tests. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. ACM, 1471--1482.

Digital Library

[22]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 312--322.

[23]

Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-test-aware regression testing techniques. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 298--311.

Digital Library

[24]

Johannes Lampel, Sascha Just, Sven Apel, and Andreas Zeller. 2021. When life gives you oranges: detecting and diagnosing intermittent job failures at Mozilla. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 1381--1392.

Digital Library

[25]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering 38, 1 (2012), 54--72.

Digital Library

[26]

Claire Leong, Abhayendra Singh, Mike Papadakis, Yves Le Traon, and John Micco. 2019. Assessing Transition-Based Test Selection Algorithms at Google. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (Montreal, Quebec, Canada) (ICSE-SEIP '19). IEEE Press, 101--110.

Digital Library

[27]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In 22^nd International Symposium on Foundations of Software Engineering (FSE 2014), Shing-Chi Cheung, Alessandro Orso, and Margaret-Anne Storey (Eds.). ACM, Hong Kong, China, 643--653.

Digital Library

[28]

Jean Malm, Adnan Causevic, Björn Lisper, and Sigrid Eldh. 2020. Automated Analysis of Flakiness-mitigating Delays. In Proceedings of the IEEE/ACM 1st International Conference on Automation of Software Test. ACM, 81--84.

Digital Library

[29]

Atif M. Memon and Myra B. Cohen. 2013. Automated testing of GUI applications: models, tools, and controlling flakiness. In Proceedings of the 35th International Conference on Software Engineering, David Notkin, Betty H. C. Cheng, and Klaus Pohl (Eds.). IEEE Computer Society, San Francisco, CA, USA, 1479--1480.

[30]

Atif M. Memon, Zebao Gao, Bao N. Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-Scale Continuous Testing. In 39^th international Conference on Software Engineering, Software Engineering in Practice Track (ICSE-SEIP). IEEE, Buenos Aires, Argentina, 233--242.

[31]

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Chapter Six - Mutation Testing Advances: An Analysis and Survey. In Advances in Computers, Atif M. Memon (Ed.). Advances in Computers, Vol. 112. Elsevier, 275--378.

[32]

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2020. Flake It 'Till You Make It. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. ACM, 11--12.

Digital Library

[33]

Samad Paydar and Aidin Azamnouri. 2019. An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites. In Fundamentals of Software Engineering, Hossein Hojjat and Mieke Massink (Eds.). Springer International Publishing, 111--126.

Digital Library

[34]

Gustavo Pinto, Breno Miranda, Supun Dissanayake, Marcelo D'Amorim, Christoph Treude, and Antonia Bertolino. 2020. What is the Vocabulary of Flaky Tests?. In Proceedings of the 17th International Conference on Mining Software Repositories. ACM, 492--502.

Digital Library

[35]

Kai Presler-Marshall, Eric Horton, Sarah Heckman, and Kathryn T Stolee. 2019. Wait Wait . No, Tell Me. Analyzing Selenium Configuration Effects on Test Flakiness .. In Proceedings of the IEEE/ACM 14th International Workshop on Automation of Software Test (AST). IEEE, Montreal, Canada, 2--8.

Digital Library

[36]

Yihao Qin, Shangwen Wang, Kui Liu, Xiaoguang Mao, and Tegawende F. Bissyande. 2021. On the Impact of Flaky Tests in Automated Program Repair. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 295--306.

[37]

August Shi, Jonathan Bell, and Darko Marinov. 2019. Mitigating the effects of flaky tests on mutation testing. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 112--122.

Digital Library

[38]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. IFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, 545--555.

Digital Library

[39]

Denini Silva, Leopoldo Teixeira, and Marcelo D'Amorim. 2020. Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE, 301--311.

[40]

Bela Vancsics, Tamas Gergely, and Arpad Beszedes. 2020. Simulating the Effect of Test Flakiness on Fault Localization Effectiveness. In Proceedings of the IEEE Workshop on Validation, Analysis and Evolution of Software Tests. IEEE, 28--35.

[41]

Roberto Verdecchia, Emilio Cruciani, Breno Miranda, and Antonia Bertolino. 2021. Know You Neighbor: Fast Static Prediction of Test Flakiness. IEEE Access 9 (2021), 76119--76134.

[42]

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization. IEEE Transactions on Software Engineering 42, 8 (2016), 707--740.

Digital Library

[43]

Yuan Yuan and Wolfgang Banzhaf. 2020. ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming. IEEE Transactions on Software Engineering 46, 10 (2020), 1040--1067.

Cited By

Alshahwan NChheda JFinogenova AGokkaya BHarman MHarper IMarginean ASengupta SWang Ed'Amorim M(2024)Automated Unit Test Improvement using Large Language Models at MetaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663839(185-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663839
Rahman SMassey ALam WShi ABell J(2024)Automatically Reproducing Timing-Dependent Flaky-Test Failures2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00032(269-280)Online publication date: 27-May-2024
https://doi.org/10.1109/ICST60714.2024.00032
Pontillo VPalomba FFerrucci F(2024)Test Code Flakiness in Mobile AppsInformation and Software Technology10.1016/j.infsof.2023.107394168:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107394
Show More Cited By

Recommendations

A Static Approach to Prioritizing JUnit Test Cases

Test case prioritization is used in regression testing to schedule the execution order of test cases so as to expose faults earlier in testing. Over the past few years, many test case prioritization techniques have been proposed in the literature. Most ...
Spartan: a spectral and entropy-based partial-scan and test point insertion algorithm
Faster mutation testing inspired by test prioritization and reduction
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and Analysis

Mutation testing is a well-known but costly approach for determining test adequacy. The central idea behind the approach is to generate mutants, which are small syntactic transformations of the program under test, and then to measure for a given test ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

May 2022

2508 pages

ISBN:9781450392211

DOI:10.1145/3510003

General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Check for updates

Qualifiers

Research-article

Funding Sources

Luxembourg National Research Funds
European Research Council

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)16

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alshahwan NChheda JFinogenova AGokkaya BHarman MHarper IMarginean ASengupta SWang Ed'Amorim M(2024)Automated Unit Test Improvement using Large Language Models at MetaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663839(185-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663839
Rahman SMassey ALam WShi ABell J(2024)Automatically Reproducing Timing-Dependent Flaky-Test Failures2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00032(269-280)Online publication date: 27-May-2024
https://doi.org/10.1109/ICST60714.2024.00032
Pontillo VPalomba FFerrucci F(2024)Test Code Flakiness in Mobile AppsInformation and Software Technology10.1016/j.infsof.2023.107394168:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107394
Barboni MBertolino ADe Angelis G(2024)Flakiness goes liveInformation and Software Technology10.1016/j.infsof.2023.107373167:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107373
Chen YYildiz AMarinov DJabbarvand RJust RFraser G(2023)Transforming Test Suites into CroissantsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598119(1080-1092)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598119
Alshahwan NHarman MMarginean A(2023)Software Testing Research Challenges: An Industrial Perspective2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00008(1-10)Online publication date: Apr-2023
https://doi.org/10.1109/ICST57152.2023.00008
Gruber MFraser G(2023)Debugging Flaky Tests using Spectrum-based Fault Localization2023 IEEE/ACM International Conference on Automation of Software Test (AST)10.1109/AST58925.2023.00017(128-139)Online publication date: May-2023
https://doi.org/10.1109/AST58925.2023.00017
Tahir ARasheed SDietrich JHashemi NZhang L(2023)Test flakiness’ causes, detection, impact and responsesJournal of Systems and Software10.1016/j.jss.2023.111837206:COnline publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111837
Pontillo VPalomba FFerrucci F(2022)Static test flakiness prediction: How Far Can We Go?Empirical Software Engineering10.1007/s10664-022-10227-127:7Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10664-022-10227-1

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents