skip to main content
10.1145/3293882.3330568acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Public Access

Mitigating the effects of flaky tests on mutation testing

Published: 10 July 2019 Publication History

Abstract

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code under test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%.

References

[1]
Iftekhar Ahmed, Carlos Jensen, Alex Groce, and Paul E. McKenney. 2017. Applying mutation analysis on kernel test suites: an experience report. In ICSTW.
[2]
Paul Ammann, Marcio E. Delamaro, and Jeff Offutt. 2014. Establishing theoretical minimal sets of mutants. In ICST.
[3]
Apache Software Foundation. {n.d.}. SUREFIRE-1516. https://issues.apache.org/ jira/browse/SUREFIRE-1516.
[4]
Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE.
[5]
Jonathan Bell, Gail Kaiser, Eric Melski, and Mohan Dattatreya. 2015. Efficient dependency detection for safe Java test acceleration. In ESEC/FSE.
[6]
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In ICSE.
[7]
Henry Coles. {n.d.}. Real world mutation testing. http://pitest.org.
[8]
Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: a practical mutation testing tool for Java (demo). In ISSTA.
[9]
Hyunsook Do and Gregg Rothermel. 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. TSE 32, 9 (2006).
[10]
Gordon Fraser and Andreas Zeller. 2010. Mutation-driven generation of unit tests and oracles. In ISSTA.
[11]
Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In ICST.
[12]
Zebao Gao, Yalan Liang, Myra B. Cohen, Atif M. Memon, and Zhen Wang. 2015. Making system user interactive tests repeatable: when and what should we control?. In ICSE.
[13]
Milos Gligoric, Lingming Zhang, Cristiano Pereira, and Gilles Pokam. 2013. Selective mutation testing for concurrent code. In ISSTA.
[14]
Bernhard JM Grün, David Schuler, and Andreas Zeller. 2009. The impact of equivalent mutants. In ICSTW.
[15]
Alex Gyori, August Shi, Farah Hariri, and Darko Marinov. 2015. Reliable testing: detecting state-polluting tests to prevent test dependency. In ISSTA.
[16]
Mark Harman and Peter O’Hearn. 2018. From start-ups to scale-ups: opportunities and open problems for static and dynamic program analysis. In SCAM.
[17]
Michael Hilton, Jonathan Bell, and Darko Marinov. 2018. A large-scale study of test coverage evolution. In ASE.
[18]
Chen Huo and James Clause. 2014. Improving oracle quality by detecting brittle assertions and unused inputs in tests. In FSE.
[19]
Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. TSE 37, 5 (2011).
[20]
René Just. 2014. The Major mutation framework: efficient and scalable mutation analysis for Java. In ISSTA.
[21]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In ISSTA.
[22]
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: a framework for detecting and partially classifying flaky tests. In ICST.
[23]
Yiling Lou, Dan Hao, and Lu Zhang. 2015. Mutation-based test-case prioritization in software evolution. In ISSRE.
[24]
Yafeng Lu, Yiling Lou, Shiyang Cheng, Lingming Zhang, Dan Hao, Yangfan Zhou, and Lu Zhang. 2016. How does regression test prioritization perform in real-world software evolution?. In ICSE.
[25]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.
[26]
Qi Luo, Kevin Moran, and Denys Poshyvanyk. 2016. A large-scale empirical comparison of static and dynamic test case prioritization techniques. In FSE.
[27]
Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. 2006. MuJava: a mutation system for Java. In ICSE.
[28]
Lech Madeyski and Norbert Radyk. 2010. Judy - a mutation testing tool for Java. IET software 4, 1 (2010).
[29]
Paul Marinescu, Petr Hosek, and Cristian Cadar. 2014. Covrig: a framework for the analysis of code, test, and coverage evolution in real software. In ISSTA.
[30]
Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and Martin Monperrus. 2017. Automatic repair of real bugs in Java: a large-scale experiment on the Defects4J dataset. ESE 22, 4 (2017).
[31]
John Micco. 2017. The state of continuous integration testing @Google. https: //research.google.com/pubs/pub45880.html.
[32]
Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: a large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In ICSE.
[33]
Goran Petrović and Marko Ivanković. 2018. State of Mutation Testing at Google. In ICSE-SEIP.
[34]
Goran Petrović, Marko Ivanković, Bob Kurtz, Paul Ammann, and René Just. 2018. An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions. In ICSTW.
[35]
David Schuler and Andreas Zeller. 2009. Javalanche: efficient mutation testing for Java. In ESEC/FSE.
[36]
David Schuler and Andreas Zeller. 2010. (Un-) Covering equivalent mutants. In ICST.
[37]
August Shi, Jonathan Bell, and Darko Marinov. 2019. Mitigating the effects of flaky tests on mutation testing (dataset and tool). (6 2019). 6084/m9.figshare.8226332
[38]
August Shi, Alex Gyori, Milos Gligoric, Andrey Zaytsev, and Darko Marinov. 2014. Balancing trade-offs in test-suite reduction. In FSE.
[39]
August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. In ICST.
[40]
August Shi, Alex Gyori, Suleman Mahmood, Peiyuan Zhao, and Darko Marinov. 2018. Evaluating test-suite reduction in real software evolution. In ISSTA.
[41]
August Shi, Tifany Yung, Alex Gyori, and Darko Marinov. 2015. Comparing and combining test-suite reduction and regression test selection. In ESEC/FSE.
[42]
Friedrich Steimann, Marcus Frenkel, and Rui Abreu. 2013. Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators. In ISSTA.
[43]
Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An empirical study of bugs in test code. In ICSME.
[44]
Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry. 2018. Decartes: A PITest engine to detect pseudo-tested methods. In ASE Demo.
[45]
Andy Zaidman and Fabio Palomba. 2017. Does refactoring of test smells induce fixing flaky tests?. In ICSME.
[46]
Lingming Zhang, Milos Gligoric, Darko Marinov, and Sarfraz Khurshid. 2013. Operator-based and random mutant selection: better together. In ASE.
[47]
Lingming Zhang, Darko Marinov, and Sarfraz Khurshid. 2013. Faster mutation testing inspired by test prioritization and reduction. In ISSTA.
[48]
Lingming Zhang, Darko Marinov, Lu Zhang, and Sarfraz Khurshid. 2012. Regression mutation testing. In ISSTA.

Cited By

View all
  • (2024)Large Language Models for Equivalent Mutant Detection: How Far Are We?Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680395(1733-1745)Online publication date: 11-Sep-2024
  • (2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
  • (2024)Mutation Testing of Java Bytecode: A Model-Driven ApproachProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3640310.3674103(237-248)Online publication date: 22-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2019
451 pages
ISBN:9781450362245
DOI:10.1145/3293882
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Flaky tests
  2. mutation testing
  3. non-deterministic coverage

Qualifiers

  • Research-article

Funding Sources

Conference

ISSTA '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)168
  • Downloads (Last 6 weeks)25
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Large Language Models for Equivalent Mutant Detection: How Far Are We?Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680395(1733-1745)Online publication date: 11-Sep-2024
  • (2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
  • (2024)Mutation Testing of Java Bytecode: A Model-Driven ApproachProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3640310.3674103(237-248)Online publication date: 22-Sep-2024
  • (2024)Ripples of a Mutation — An Empirical Study of Propagation Effects in Mutation TestingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639179(1-13)Online publication date: 20-May-2024
  • (2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
  • (2024)Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00095(872-883)Online publication date: 12-Mar-2024
  • (2024)230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers2024 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST60714.2024.00031(257-268)Online publication date: 27-May-2024
  • (2024)Test Code Flakiness in Mobile Apps: The Developer’s PerspectiveInformation and Software Technology10.1016/j.infsof.2023.107394168(107394)Online publication date: Apr-2024
  • (2023)Keeping Mutation Test Suites Consistent and Relevant with Long-Standing MutantsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613089(2067-2071)Online publication date: 30-Nov-2023
  • (2023)To Kill a Mutant: An Empirical Study of Mutation Testing KillsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598090(715-726)Online publication date: 12-Jul-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media