Skip to main content
Log in

Explaining poor performance of text-based machine learning models for vulnerability detection

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

With an increase of severity in software vulnerabilities, machine learning models are being adopted to combat this threat. Given the possibilities towards usage of such models, research in this area has introduced various approaches. Although models may differ in performance, there is an overall lack of explainability in understanding how a model learns and predicts. Furthermore, recent research suggests that models perform poorly in detecting vulnerabilities when interpreting source code as text, known as “text-based” models. To help explain this poor performance, we explore the dimensions of explainability. From recent studies on text-based models, we experiment with removal of overlapping features present in training and testing datasets, deemed “cross-cutting”. We conduct scenario experiments removing such “cross-cutting” data and reassessing model performance. Based on the results, we examine how removal of these “cross-cutting” features may affect model performance. Our results show that removal of “cross-cutting” features may provide greater performance of models in general, thus leading to explainable dimensions regarding data dependency and agnostic models. Overall, we conclude that model performance can be improved, and explainable aspects of such models can be identified via empirical analysis of the models’ performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Bulgaria)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

The datasets generated during and/or analyzed during the current study are available in the “explainability_data” repository, https://github.com/krn65/explainability_data

Notes

  1. https://cve.mitre.org/

  2. https://nvd.nist.gov/

  3. https://github.com/krn65/explainability_data

  4. https://python.org

  5. https://scikit-learn.org/stable/

  6. https://keras.io

  7. https://tensorflow.org

  8. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html

References

  • Alenezi M, Zarour M (2020) On the relationship between software complexity and security. arXiv preprint arXiv:2002.07135

  • Ban X, Liu S, Chen C, Chua C (2019) A performance evaluation of deep-learnt features for software vulnerability detection. Concurr Comput Pract Exp 31(19):e5103. https://doi.org/10.1002/cpe.5103

    Article  Google Scholar 

  • Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education

  • Braz L, Aeberhard C, Çalikli G, Bacchelli A (2022) Less is more: supporting developers in vulnerability detection during code review. In: Proceedings of the 44th international conference on software engineering, pp 1317–1329

  • Burkart N, Huber MF (2021) A survey on the explainability of supervised machine learning. J Artif Intell Res 70:245–317

    Article  MathSciNet  Google Scholar 

  • Chernis B, Verma R (2018) Machine learning methods for software vulnerability detection. In: Proceedings of the fourth acm international workshop on security and privacy analytics, pp 31–39. https://doi.org/10.1145/3180445.3180453

  • Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE International conference on software engineering, vol 2. IEEE, pp 27–28

  • Duval A (2019) Explainable artificial intelligence (xai). N/A N/A:N/A. https://doi.org/10.13140/RG.2.2.24722.09929

  • Edmundson A, Holtkamp B, Rivera E, Finifter M, Mettler A, Wagner D (2013) An empirical study on the effectiveness of security code review. In: International symposium on engineering secure software and systems. Springer, pp 197–212

  • Fan J, Li Y, Wang S, Nguyen TN (2020) Ac/c\(+\) code vulnerability dataset with code changes and cve summaries. In: Proceedings of the 17th international conference on mining software repositories, pp 508–512. https://doi.org/10.1145/3379597.3387501

  • Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155

  • Ghaffarian SM, Shahriari HR (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: a survey. ACM Comput Surv (CSUR) 50(4):1–36. https://doi.org/10.1145/3092566

    Article  Google Scholar 

  • Gkortzis A, Feitosa D, Spinellis D (2021) Software reuse cuts both ways: an empirical analysis of its relationship with security vulnerabilities. J Syst Softw 172:110653

    Article  Google Scholar 

  • Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the sixth acm conference on data and application security and privacy, pp 85–96. https://doi.org/10.1145/2857705.2857720

  • Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497, https://arxiv.org/abs/1803.04497

  • Hovsepyan A, Scandariato R, Joosen W, Walden J (2012) Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th international workshop on Security measurements and metrics, pp 7–10. https://doi.org/10.1145/2372225.2372230

  • Ijaz M, Durad MH, Ismail M (2019) Static and dynamic malware analysis using machine learning. In: 2019 16th International bhurban conference on applied sciences and technology (IBCAST). IEEE, pp 687–691. https://doi.org/10.1109/IBCAST.2019.8667136

  • Jie G, Xiao-Hui K, Qiang L (2016) Survey on software vulnerability analysis method based on machine learning. In: 2016 IEEE First International conference on data science in cyberspace (DSC). IEEE, pp 642–647. https://doi.org/10.1109/DSC.2016.33

  • Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621. https://doi.org/10.1080/01621459.1952.10483441

    Article  Google Scholar 

  • Li Z, Zou D, Xu S, Chen Z, Zhu Y, Jin H (2021a) Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans Dependable Secure Comput. https://doi.org/10.1109/TDSC.2021.3076142

    Article  Google Scholar 

  • Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2021b) Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secure Comput. https://doi.org/10.1109/TDSC.2021.3051525

    Article  Google Scholar 

  • Lin G, Zhang J, Luo W, Pan L, De Vel O, Montague P, Xiang Y (2019) Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans Dependable Secure Comput. https://doi.org/10.1109/TDSC.2019.2954088

    Article  Google Scholar 

  • Lin G, Wen S, Han QL, Zhang J, Xiang Y (2020) Software vulnerability detection using deep neural networks: a survey. Proc IEEE 108(10):1825–1848. https://doi.org/10.1109/JPROC.2020.2993293

    Article  Google Scholar 

  • Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840

  • Lipton ZC (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3):31–57

    Article  Google Scholar 

  • Liu S, Lin G, Qu L, Zhang J, De Vel O, Montague P, Xiang Y (2020) Cd-vuld: cross-domain vulnerability discovery based on deep domain adaptation. IEEE Trans Dependable Secure Comput. https://doi.org/10.1109/TDSC.2020.2984505

    Article  Google Scholar 

  • Liu B, Shi L, Cai Z, Li M (2012) Software vulnerability discovery techniques: a survey. In: 2012 fourth international conference on multimedia information networking and security. IEEE, pp 152–156. https://doi.org/10.1109/MINES.2012.202

  • Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681, https://doi.org/10.14722/ndss.2018.23158

  • Mäntylä MV, Lassenius C (2008) What types of defects are really discovered in code reviews? IEEE Trans Software Eng 35(3):430–448. https://doi.org/10.1109/TSE.2008.71

    Article  Google Scholar 

  • Molnar C (2018) A guide for making black box models explainable. https://christophmgithubio/interpretable-ml-book

  • Mosolygó B, Vándor N, Antal G, Hegedűs P, Ferenc R (2021) Towards a prototype based explainable javascript vulnerability prediction model. In: 2021 International conference on code quality (ICCQ). IEEE, pp 15–25

  • Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press

    Google Scholar 

  • Napier K, Bhowmik T, Wang S (2023) An empirical study of text-based machine learning models for vulnerability detection. Empir Softw Eng 28(2):38

    Article  Google Scholar 

  • Napier K, Bhowmik T (2022) Text-based machine learning models for cross-domain vulnerability prediction: Why they may not be effective? In: 2022 IEEE 23rd International conference on information reuse and integration for data science (IRI). IEEE, pp 158–163

  • Oliveira D, Rosenthal M, Morin N, Yeh KC, Cappos J, Zhuang Y (2014) It’s the psychology stupid: how heuristics explain software vulnerabilities and how priming can illuminate developer’s blind spots. In: Proceedings of the 30th annual computer security applications conference, pp 296–305

  • Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) Vccfinder: finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp 426–437. https://doi.org/10.1145/2810103.2813604

  • Ribeiro MT, Singh S, Guestrin C (2016) Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386

  • Saleem H (2020) Naveed M (2020) Sok: anatomy of data breaches. Proc Priv Enhancing Technol 4:153–174

    Article  Google Scholar 

  • Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Software Eng 40(10):993–1006. https://doi.org/10.1109/TSE.2014.2340398

    Article  Google Scholar 

  • Shahid MR, Debar H (2021) Cvss-bert: explainable natural language processing to determine the severity of a computer security vulnerability from its description. In: 2021 20th IEEE International conference on machine learning and applications (ICMLA). IEEE, pp 1600–1607

  • Shar LK, Briand LC, Tan HBK (2014) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Trans Dependable Secure Comput 12(6):688–707. https://doi.org/10.1109/TDSC.2014.2373377

    Article  Google Scholar 

  • Shin Y, Meneely A, Williams L, Osborne JA (2010) Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Trans Software Eng 37(6):772–787

    Article  Google Scholar 

  • Sotgiu A, Pintor M, Biggio B (2022) Explainability-based debugging of machine learning for vulnerability discovery. In: Proceedings of the 17th international conference on availability, reliability and security, pp 1–8

  • Spreitzenbarth M, Schreck T, Echtler F, Arp D, Hoffmann J (2015) Mobile-sandbox: combining static and dynamic analysis with machine-learning techniques. Int J Inf Secur 14(2):141–153. https://doi.org/10.1007/s10207-014-0250-0

    Article  Google Scholar 

  • Turpin A, Scholer F (2006) User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp 11–18

  • van Rijsbergen C (1979) Information Retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  • Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies, pp 13–13. https://doi.org/10.5555/2028052.2028065

  • Zeng P, Lin G, Pan L, Tai Y, Zhang J (2020) Software vulnerability analysis and discovery using deep learning techniques: a survey. IEEE Access 8:197158–197172

    Article  Google Scholar 

  • Zhang Y, Chen X et al (2020) Explainable recommendation: a survey and new perspectives. Foundations and Trends® in Information Retrieval 14(1):1–101

    Article  Google Scholar 

  • Zhou J, Gandomi AH, Chen F, Holzinger A (2021) Evaluating the quality of machine learning explanations: a survey on methods and metrics. Electronics 10(5):593

    Article  Google Scholar 

  • Zhou Y, Liu S, Siow J, Du X, Liu Y (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv Neural Inf Process Syst 32

  • Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2(30):6

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kollin Napier.

Ethics declarations

Conflicts of Interest

The authors of this manuscript have no conflicts of interest.

Additional information

Communicated by: Yuan Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Additional Data and Results

Appendix A: Additional Data and Results

Table 25 Average precision scores from Random Forest (RF) model when removing “cross-cutting” features from NC training data for projects
Table 26 Average precision scores from Linear Support Vector Classification (LSVC) model when removing “cross-cutting” features from NC training data for projects
Fig. 3
figure 3

Correlation of overlap % of features and APS % in different scenarios. (a) Testing within projects. (b) Testing across projects. (c) Testing within CWE vulnerability types. (d) Testing across CWE vulnerability types

Fig. 4
figure 4

Average of cross-testing scores for trained projects RF models

Table 27 Average precision scores from Multi-layer Perceptron (MLP) model when removing “cross-cutting” features from NC training data for projects
Table 28 Average precision scores from Bidirectional Long Short-Term Memory (BiLSTM) model when removing “cross-cutting” features from NC training data for projects
Table 29 Average precision scores from Random Forest (RF) model when removing “cross-cutting” features from NC training data for CWE vulnerability types
Table 30 Average precision scores from Linear Support Vector (LSVC) model when removing “cross-cutting” features from NC training data for CWE vulnerability types
Table 31 Average precision scores from Multi-layer Perceptron (MLP) model when removing “cross-cutting” features from NC training data for CWE vulnerability types
Table 32 Average precision scores from Bidirectional Long Short-Term Memory (BiLSTM) model when removing “cross-cutting” features from NC training data for CWE vulnerability types
Table 33 Average precision scores from Random Forest (RF) model when removing “cross-cutting” features from NC training data and FC testing data for projects
Table 34 Average precision scores from Linear Support Vector Classification (LSVC) model when removing “cross-cutting” features from NC training data and FC testing data for projects
Table 35 Average precision scores from Multi-layer Perceptron (MLP) model when removing “cross-cutting” features from NC training data and FC testing data for projects
Table 36 Average precision scores from Bidirectional Long Short-Term Memory (BiLSTM) model when removing “cross-cutting” features from NC training data and FC testing data for projects
Fig. 5
figure 5

Average of cross-testing scores for trained projects LSVC models

Fig. 6
figure 6

Average of cross-testing scores for trained projects MLP models

Fig. 7
figure 7

Average of cross-testing scores for trained projects BiLSTM models

Table 37 Average precision scores from Random Forest (RF) model when removing “cross-cutting” features from NC training data and FC testing data for CWE vulnerability types
Table 38 Average precision scores from Linear Support Vector Classification (LSVC) model when removing “cross-cutting” features from NC training data and FC testing data for CWE vulnerability types
Table 39 Average precision scores from Multi-layer Perceptron (MLP) model when removing “cross-cutting” features from NC training data and FC testing data for CWE vulnerability types
Table 40 Average precision scores from Bidirectional Long Short-Term Memory model when removing “cross-cutting” features from NC training data and FC testing data for CWE vulnerability types
Fig. 8
figure 8

Average of cross-testing scores for trained CWE vulnerability types RF models

Fig. 9
figure 9

Average of cross-testing scores for trained CWE vulnerability types LSVC models

Fig. 10
figure 10

Average of cross-testing scores for trained CWE vulnerability types MLP models

Fig. 11
figure 11

Average of cross-testing scores for trained CWE vulnerability types BiLSTM models

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Napier, K., Bhowmik, T. & Chen, Z. Explaining poor performance of text-based machine learning models for vulnerability detection. Empir Software Eng 29, 113 (2024). https://doi.org/10.1007/s10664-024-10519-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-024-10519-8

Keywords

Navigation