research-article

Public Access

Mitigating the effects of flaky tests on mutation testing

Authors:

Darko MarinovAuthors Info & Claims

ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 112 - 122

https://doi.org/10.1145/3293882.3330568

Published: 10 July 2019 Publication History

Abstract

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code under test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%.

References

[1]

Iftekhar Ahmed, Carlos Jensen, Alex Groce, and Paul E. McKenney. 2017. Applying mutation analysis on kernel test suites: an experience report. In ICSTW.

[2]

Paul Ammann, Marcio E. Delamaro, and Jeff Offutt. 2014. Establishing theoretical minimal sets of mutants. In ICST.

Digital Library

[3]

Apache Software Foundation. {n.d.}. SUREFIRE-1516. https://issues.apache.org/ jira/browse/SUREFIRE-1516.

[4]

Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE.

Digital Library

[5]

Jonathan Bell, Gail Kaiser, Eric Melski, and Mohan Dattatreya. 2015. Efficient dependency detection for safe Java test acceleration. In ESEC/FSE.

Digital Library

[6]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In ICSE.

Digital Library

[7]

Henry Coles. {n.d.}. Real world mutation testing. http://pitest.org.

[8]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: a practical mutation testing tool for Java (demo). In ISSTA.

Digital Library

[9]

Hyunsook Do and Gregg Rothermel. 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. TSE 32, 9 (2006).

Digital Library

[10]

Gordon Fraser and Andreas Zeller. 2010. Mutation-driven generation of unit tests and oracles. In ISSTA.

Digital Library

[11]

Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In ICST.

[12]

Zebao Gao, Yalan Liang, Myra B. Cohen, Atif M. Memon, and Zhen Wang. 2015. Making system user interactive tests repeatable: when and what should we control?. In ICSE.

Digital Library

[13]

Milos Gligoric, Lingming Zhang, Cristiano Pereira, and Gilles Pokam. 2013. Selective mutation testing for concurrent code. In ISSTA.

Digital Library

[14]

Bernhard JM Grün, David Schuler, and Andreas Zeller. 2009. The impact of equivalent mutants. In ICSTW.

[15]

Alex Gyori, August Shi, Farah Hariri, and Darko Marinov. 2015. Reliable testing: detecting state-polluting tests to prevent test dependency. In ISSTA.

Digital Library

[16]

Mark Harman and Peter O’Hearn. 2018. From start-ups to scale-ups: opportunities and open problems for static and dynamic program analysis. In SCAM.

[17]

Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A large-scale study of test coverage evolution. In ASE.

Digital Library

[18]

Chen Huo and James Clause. 2014. Improving oracle quality by detecting brittle assertions and unused inputs in tests. In FSE.

Digital Library

[19]

Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. TSE 37, 5 (2011).

Digital Library

[20]

René Just. 2014. The Major mutation framework: efficient and scalable mutation analysis for Java. In ISSTA.

Digital Library

[21]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In ISSTA.

Digital Library

[22]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: a framework for detecting and partially classifying flaky tests. In ICST.

[23]

Yiling Lou, Dan Hao, and Lu Zhang. 2015. Mutation-based test-case prioritization in software evolution. In ISSRE.

Digital Library

[24]

Yafeng Lu, Yiling Lou, Shiyang Cheng, Lingming Zhang, Dan Hao, Yangfan Zhou, and Lu Zhang. 2016. How does regression test prioritization perform in real-world software evolution?. In ICSE.

Digital Library

[25]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.

Digital Library

[26]

Qi Luo, Kevin Moran, and Denys Poshyvanyk. 2016. A large-scale empirical comparison of static and dynamic test case prioritization techniques. In FSE.

Digital Library

[27]

Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. 2006. MuJava: a mutation system for Java. In ICSE.

Digital Library

[28]

Lech Madeyski and Norbert Radyk. 2010. Judy - a mutation testing tool for Java. IET software 4, 1 (2010).

[29]

Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: a framework for the analysis of code, test, and coverage evolution in real software. In ISSTA.

Digital Library

[30]

Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and Martin Monperrus. 2017. Automatic repair of real bugs in Java: a large-scale experiment on the Defects4J dataset. ESE 22, 4 (2017).

Digital Library

[31]

John Micco. 2017. The state of continuous integration testing @Google. https: //research.google.com/pubs/pub45880.html.

[32]

Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: a large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In ICSE.

Digital Library

[33]

Goran Petrović and Marko Ivanković. 2018. State of Mutation Testing at Google. In ICSE-SEIP.

Digital Library

[34]

Goran Petrović, Marko Ivanković, Bob Kurtz, Paul Ammann, and René Just. 2018. An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions. In ICSTW.

[35]

David Schuler and Andreas Zeller. 2009. Javalanche: efficient mutation testing for Java. In ESEC/FSE.

Digital Library

[36]

David Schuler and Andreas Zeller. 2010. (Un-) Covering equivalent mutants. In ICST.

Digital Library

[37]

August Shi, Jonathan Bell, and Darko Marinov. 2019. Mitigating the effects of flaky tests on mutation testing (dataset and tool). (6 2019). 6084/m9.figshare.8226332

[38]

August Shi, Alex Gyori, Milos Gligoric, Andrey Zaytsev, and Darko Marinov. 2014. Balancing trade-offs in test-suite reduction. In FSE.

Digital Library

[39]

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. In ICST.

[40]

August Shi, Alex Gyori, Suleman Mahmood, Peiyuan Zhao, and Darko Marinov. 2018. Evaluating test-suite reduction in real software evolution. In ISSTA.

Digital Library

[41]

August Shi, Tifany Yung, Alex Gyori, and Darko Marinov. 2015. Comparing and combining test-suite reduction and regression test selection. In ESEC/FSE.

Digital Library

[42]

Friedrich Steimann, Marcus Frenkel, and Rui Abreu. 2013. Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators. In ISSTA.

Digital Library

[43]

Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An empirical study of bugs in test code. In ICSME.

Digital Library

[44]

Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry. 2018. Decartes: A PITest engine to detect pseudo-tested methods. In ASE Demo.

[45]

Andy Zaidman and Fabio Palomba. 2017. Does refactoring of test smells induce fixing flaky tests?. In ICSME.

[46]

Lingming Zhang, Milos Gligoric, Darko Marinov, and Sarfraz Khurshid. 2013. Operator-based and random mutant selection: better together. In ASE.

Digital Library

[47]

Lingming Zhang, Darko Marinov, and Sarfraz Khurshid. 2013. Faster mutation testing inspired by test prioritization and reduction. In ISSTA.

Digital Library

[48]

Lingming Zhang, Darko Marinov, Lu Zhang, and Sarfraz Khurshid. 2012. Regression mutation testing. In ISSTA.

Digital Library

Cited By

Tian ZShu HWang DCao XKamei YChen JChristakis MPradel M(2024)Large Language Models for Equivalent Mutant Detection: How Far Are We?Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680395(1733-1745)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680395
Cai XDong ZWang YTiwari APeng XChristakis MPradel M(2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680377
Bockisch CEren DLehmann SNeufeld DTaentzer GEgyed AWimmer MChechik MCombemale B(2024)Mutation Testing of Java Bytecode: A Model-Driven ApproachProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3640310.3674103(237-248)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3640310.3674103
Show More Cited By

Index Terms

Mitigating the effects of flaky tests on mutation testing
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...
Faster mutation testing inspired by test prioritization and reduction
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and Analysis

Mutation testing is a well-known but costly approach for determining test adequacy. The central idea behind the approach is to generate mutants, which are small syntactic transformations of the program under test, and then to measure for a given test ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2019

451 pages

ISBN:9781450362245

DOI:10.1145/3293882

General Chair:
Dongmei Zhang
Microsoft Research, China
,
Program Chair:
Anders Møller
Aarhus University, Denmark

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ISSTA '19

Sponsor:

SIGSOFT

ISSTA '19: 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 15 - 19, 2019

Beijing, China

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
902
Total Downloads

Downloads (Last 12 months)168
Downloads (Last 6 weeks)25

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tian ZShu HWang DCao XKamei YChen JChristakis MPradel M(2024)Large Language Models for Equivalent Mutant Detection: How Far Are We?Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680395(1733-1745)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680395
Cai XDong ZWang YTiwari APeng XChristakis MPradel M(2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680377
Bockisch CEren DLehmann SNeufeld DTaentzer GEgyed AWimmer MChechik MCombemale B(2024)Mutation Testing of Java Bytecode: A Model-Driven ApproachProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3640310.3674103(237-248)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3640310.3674103
Du HPalepu VJones JRoychoudhury APaiva AAbreu RStorey M(2024)Ripples of a Mutation — An Empirical Study of Propagation Effects in Mutation TestingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639179(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639179
Liu XSong ZFang WYang WWang WChua TNgo CKa-Wei Lee RKumar RLauw H(2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645628
Wang JLei YLi MRen GXie HJin SLi JHu J(2024)Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00095(872-883)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00095
Alshammari AAmmann PHilton MBell J(2024)230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00031(257-268)Online publication date: 27-May-2024
https://doi.org/10.1109/ICST60714.2024.00031
Pontillo VPalomba FFerrucci F(2024)Test Code Flakiness in Mobile Apps: The Developer’s PerspectiveInformation and Software Technology10.1016/j.infsof.2023.107394168(107394)Online publication date: Apr-2024
https://doi.org/10.1016/j.infsof.2023.107394
Ojdanic MPapadakis MHarman MChandra SBlincoe KTonella P(2023)Keeping Mutation Test Suites Consistent and Relevant with Long-Standing MutantsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613089(2067-2071)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613089
Du HPalepu VJones JJust RFraser G(2023)To Kill a Mutant: An Empirical Study of Mutation Testing KillsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598090(715-726)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598090
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents