skip to main content
research-article

QuoTe: Quality-oriented Testing for Deep Learning Systems

Published: 22 July 2023 Publication History

Abstract

Recently, there has been significant growth of interest in applying software engineering techniques for the quality assurance of deep learning (DL) systems. One popular direction is DL testing—that is, given a property of test, defects of DL systems are found either by fuzzing or guided search with the help of certain testing metrics. However, recent studies have revealed that the neuron coverage metrics, which are commonly used by most existing DL testing approaches, are not necessarily correlated with model quality (e.g., robustness, the most studied model property), and are also not an effective measurement on the confidence of the model quality after testing. In this work, we address this gap by proposing a novel testing framework called QuoTe (i.e., Quality-oriented Testing). A key part of QuoTe is a quantitative measurement on (1) the value of each test case in enhancing the model property of interest (often via retraining) and (2) the convergence quality of the model property improvement. QuoTe utilizes the proposed metric to automatically select or generate valuable test cases for improving model quality. The proposed metric is also a lightweight yet strong indicator of how well the improvement converged. Extensive experiments on both image and tabular datasets with a variety of model architectures confirm the effectiveness and efficiency of QuoTe in improving DL model quality—that is, robustness and fairness. As a generic quality-oriented testing framework, future adaptations can be made to other domains (e.g., text) as well as other model properties.
Appendix

A Additional Figures

Figure A.1.
Figure A.1. The FOL distribution of FGSM and PGD attacks for different models.
Figure A.2.
Figure A.2. The FOL distribution of AEQUITAS and ADF testing algorithms for different models.
Figure A.3.
Figure A.3. The trend of robustness improvement (ATTACK) in iterations on the MNIST and CIFAR-10 datasets (with confidence interval of 95% significance level).

References

[1]
Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha. 2019. Black box fairness testing of machine learning models. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 625–635.
[2]
Haldun Akoglu. 2018. User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine 18, 3 (2018), 91–93.
[3]
Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2022. Black-box safety analysis and retraining of DNNs based on feature extraction and clustering. arXiv preprint arXiv:2201.05077 (2022).
[4]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.
[5]
Neal E. Boudette. 2017. Tesla’s self-driving system cleared in deadly crash. New York Times. Retrieved February 21, 2023 from https://www.nytimes.com/2017/01/19/business/tesla-model-s-autopilot-fatal-crash.html.
[6]
Wieland Brendel, Jonas Rauber, Matthias Kümmerer, Ivan Ustyuzhaninov, and Matthias Bethge. 2019. Accurate, reliable and fast robustness evaluation. In Advances in Neural Information Processing Systems. 12861–12871.
[7]
Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest’19). IEEE, Los Alamitos, CA, 63–70.
[8]
Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). 209–224.
[9]
Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705 (2019).
[10]
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP’17). IEEE, Los Alamitos, CA, 39–57.
[11]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493–2537.
[12]
Francesco Croce and Matthias Hein. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the International Conference on Machine Learning. 2206–2216.
[13]
Yizhen Dong, Peixin Zhang, Jingyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, Jinsong Dong, and Ting Dai. 2020. An empirical study on correlation between coverage and robustness for deep neural networks. In Proceedings of the 2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS’20). IEEE, Los Alamitos, CA, 73–82.
[14]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved February 21, 2023 from http://archive.ics.uci.edu/ml.
[15]
Ranjie Duan, Xingjun Ma, Yisen Wang, James Bailey, A. Kai Qin, and Yun Yang. 2020. Adversarial camouflage: Hiding physical-world attacks with natural styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1000–1008.
[16]
Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2022. HUDD: A tool to debug DNNs for safety analysis. In Proceedings of the 2022 IEEE/ACM 44st International Conference on Software Engineering.
[17]
Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing massive tests to enhance the robustness of deep neural networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’20). ACM, New York, NY, 177–188. DOI:
[18]
Samuel G. Finlayson, John D. Bowers, Joichi Ito, Jonathan L. Zittrain, Andrew L. Beam, and Isaac S. Kohane. 2019. Adversarial attacks on medical machine learning. Science 363, 6433 (2019), 1287–1289.
[19]
Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness testing: Testing software for discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 498–510.
[20]
Patrice Godefroid, Adam Kiezun, and Michael Y. Levin. 2008. Grammar-based whitebox fuzzing. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 206–215.
[21]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 2672–2680.
[22]
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations. http://arxiv.org/abs/1412.6572.
[23]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Los Alamitos, CA, 6645–6649.
[24]
Loren Grush. 2015. Google engineer apologizes after photos app tags two black people as gorillas. The Verge. Retrieved February 21, 2023 from https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas.
[25]
Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. DLFuzz: Differential fuzzing testing of deep learning systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 739–743.
[26]
Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks? In Proceedings of the Joint Meeting on Foundations of Software Engineering (FSE’20).
[27]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[28]
Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An empirical study on data distribution-aware test selection for deep learning enhancement. ACM Transactions on Software Engineering and Methodology 31, 4 (2022), Article 78, 30.
[29]
Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. 2017. Safety verification of deep neural networks. In Proceedings of the International Conference on Computer Aided Verification. 3–29.
[30]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1110–1121.
[31]
Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha. 2019. Limitations of the Lipschitz constant as a defense against adversarial examples. In ECML PKDD 2018 Workshops. Lecture Notes in Computer Science, Vol. 11329. Springer, 16–29.
[32]
Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548–1558.
[33]
Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. 2017. Towards proving the adversarial robustness of deep neural networks. arXiv preprint arXiv:1709.02802 (2017).
[34]
Marc Khoury and Dylan Hadfield-Menell. 2019. Adversarial training with Voronoi constraints. arXiv preprint arXiv:1905.01019 (2019).
[35]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE’2019). IEEE, Los Alamitos, CA, 1039–1049.
[36]
Ron Kohavi. 1996. Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). 202–207.
[37]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
[38]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[39]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[40]
Seokhyun Lee, Sooyoung Cha, Dain Lee, and Hakjoo Oh. 2020. Effective white-box testing of deep neural networks with adaptive neuron-selection strategy. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 165–176.
[41]
Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J. Zico Kolter, et al. 2011. Towards fully autonomous driving: Systems and algorithms. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV’11). IEEE, Los Alamitos, CA, 163–168.
[42]
Yu Li, Muxi Chen, and Qiang Xu. 2022. HybridRepair: Towards annotation-efficient repair for deep learning models. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 227–238.
[43]
Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. 2019. Structural coverage criteria for neural networks could be misleading. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’19). IEEE, Los Alamitos, CA, 89–92.
[44]
Lei Ma, Felix Juefei-Xu, Minhui Xue, Bo Li, Li Li, Yang Liu, and Jianjun Zhao. 2019. DeepCT: Tomographic combinatorial testing for deep learning systems. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution, and Reengineering (SANER’19). IEEE, Los Alamitos, CA, 614–618.
[45]
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, et al. 2018. DeepGauge: Multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, New York, NY, 120–131.
[46]
Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 1–22.
[47]
Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. 2020. Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition 110 (2020), 107332.
[48]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=rJzIBfZAb.
[49]
Sérgio Moro, Paulo Cortez, and Paulo Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62 (2014), 22–31.
[50]
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA.
[51]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
[52]
Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications (OOPSLA’07). 815–816.
[53]
Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (EuroS&P’16). IEEE, Los Alamitos, CA, 372–387.
[54]
Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of the 2015 British Machine Vision Conference (BMVC’15).
[55]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, New York, NY, 1–18.
[56]
Vincenzo Riccio, Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepMetis: Augmenting a deep learning test set to increase its mutation score. In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). IEEE, Los Alamitos, CA, 355–367.
[57]
Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: A systematic mapping. Empirical Software Engineering 25, 6 (2020), 5193–5254.
[58]
Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering 14, 2 (2009), 131–164.
[59]
Patrick Schober, Christa Boer, and Lothar A. Schwarte. 2018. Correlation coefficients: Appropriate use and interpretation. Anesthesia & Analgesia 126, 5 (2018), 1763–1768.
[60]
Ali Shahrokni and Robert Feldt. 2013. A systematic review of software robustness. Information and Software Technology 55, 1 (2013), 1–17.
[61]
Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 410–422.
[62]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[63]
Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. 2019. An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–30.
[64]
Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic testing for deep neural networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18). ACM, New York, NY, 109–119. DOI:
[65]
Richard Szeliski. 2010. Computer Vision: Algorithms and Applications. Springer Science & Business Media.
[66]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. 303–314.
[67]
Hoang-Dung Tran, Diago Manzanas Lopez, Patrick Musau, Xiaodong Yang, Luan Viet Nguyen, Weiming Xiang, and Taylor T. Johnson. 2019. Star-based reachability analysis of deep neural networks. In Proceedings of the International Symposium on Formal Methods. 670–686.
[68]
Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay. 2018. Automated directed fairness testing. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 98–108.
[69]
Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-oriented testing for deep learning systems. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, Los Alamitos, CA, 300–311.
[70]
Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the 41st International Conference on Software Engineering. IEEE, Los Alamitos, CA, 1245–1256.
[71]
Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. 2019. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 692–702.
[72]
Xinyu Wang, Jun Sun, Zhenbang Chen, Peixin Zhang, Jingyi Wang, and Yun Lin. 2018. Towards optimal concolic testing. In Proceedings of the 40th International Conference on Software Engineering. 291–302.
[73]
Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. 2019. On the convergence and robustness of adversarial training. In Proceedings of the 36th International Conference on Machine Learning. 6586–6595. http://proceedings.mlr.press/v97/wang19i.html.
[74]
Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. 2019. Improving adversarial robustness requires revisiting misclassified examples. In Proceedings of the International Conference on Learning Representations.
[75]
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, Los Alamitos, CA, 397–409.
[76]
Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S. Dhillon, and Luca Daniel. 2018. Towards fast computation of certified robustness for ReLU networks. arXiv preprint arXiv:1804.09699 (2018).
[77]
Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. 2018. Evaluating the robustness of neural networks: An extreme value theory approach. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=BkUHlMZ0b.
[78]
Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. 2018. Feature-guided black-box safety testing of deep neural networks. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems. 408–426.
[79]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.
[80]
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. FASHION-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
[81]
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’19). ACM, New York, NY, 146–157. DOI:
[82]
Huan Xu and Shie Mannor. 2012. Robustness and generalization. Machine Learning 86, 3 (2012), 391–423.
[83]
Pengfei Yang, Renjue Li, Jianlin Li, Cheng-Chao Huang, Jingyi Wang, Jun Sun, Bai Xue, and Lijun Zhang. 2020. Improving neural network verification through spurious region guided refinement. arXiv preprint arXiv:2010.07722 (2020).
[84]
Bing Yu, Hua Qi, Qing Guo, Felix Juefei-Xu, Xiaofei Xie, Lei Ma, and Jianjun Zhao. 2022. DeepRepair: Style-guided repairing for deep neural networks in the real-world operational environment. IEEE Transactions on Reliability 71, 4 (2022), 1401–1416.
[85]
Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the International Conference on Machine Learning. PMLR, 7472–7482.
[86]
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36.
[87]
Peixin Zhang, Jingyi Wang, Jun Sun, Guoliang Dong, Xinyu Wang, Xingen Wang, Jin Song Dong, and Ting Dai. 2020. White-box fairness testing through adversarial sampling. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 949–960.
[88]
Peixin Zhang, Jingyi Wang, Jun Sun, and Xinyu Wang. 2021. Fairness testing of deep image classification with adequacy metrics. arXiv preprint arXiv:2111.08856 (2021).
[89]
Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, and Jin Song Dong. 2022. Automatic fairness testing of neural classifiers through adversarial sampling. IEEE Transactions on Software Engineering 48, 9 (2022), 3593–3612.
[90]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

Cited By

View all
  • (2024)Hybrid whale optimized crow search algorithm and multi-SVM classifier for effective system level test case selectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23270046:2(4191-4207)Online publication date: 14-Feb-2024
  • (2024)Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning SystemsProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686673(107-118)Online publication date: 24-Oct-2024
  • (2024)GIST: Generated Inputs Sets Transferability in Deep LearningACM Transactions on Software Engineering and Methodology10.1145/3672457Online publication date: 13-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 5
September 2023
905 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3610417
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2023
Online AM: 10 February 2023
Accepted: 16 December 2022
Revised: 04 December 2022
Received: 02 March 2022
Published in TOSEM Volume 32, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep learning
  2. testing
  3. robustness
  4. fairness

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • Key R&D Program of Zhejiang
  • NSFC Program
  • Fundamental Research Funds for Central Universities

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)462
  • Downloads (Last 6 weeks)28
Reflects downloads up to 05 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Hybrid whale optimized crow search algorithm and multi-SVM classifier for effective system level test case selectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23270046:2(4191-4207)Online publication date: 14-Feb-2024
  • (2024)Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning SystemsProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686673(107-118)Online publication date: 24-Oct-2024
  • (2024)GIST: Generated Inputs Sets Transferability in Deep LearningACM Transactions on Software Engineering and Methodology10.1145/3672457Online publication date: 13-Jun-2024
  • (2024)Fairness Testing of Machine Translation SystemsACM Transactions on Software Engineering and Methodology10.1145/366460833:6(1-27)Online publication date: 27-Jun-2024
  • (2024)Fault Diagnosis for Test Alarms in Microservices through Multi-source DataCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663833(115-125)Online publication date: 10-Jul-2024
  • (2024)Isolation-Based Debugging for Neural NetworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652132(338-349)Online publication date: 11-Sep-2024
  • (2024)Test Optimization in DNN Testing: A SurveyACM Transactions on Software Engineering and Methodology10.1145/364367833:4(1-42)Online publication date: 27-Jan-2024
  • (2024)Empirical Comparison between MOEAs and Local Search on Multi-Objective Combinatorial Optimisation ProblemsProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654077(547-556)Online publication date: 14-Jul-2024
  • (2024)Framework for Bias Detection in Machine Learning Models: A Fairness ApproachProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635731(1152-1154)Online publication date: 4-Mar-2024
  • (2023)aNNoTest: An Annotation-based Test Generation Tool for Neural Network Programs2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00075(574-579)Online publication date: 1-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media