research-article

Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

Authors:

Mike Papadakis,

Yves Le TraonAuthors Info & Claims

IEEE Transactions on Software Engineering, Volume 50, Issue 5

Pages 1080 - 1095

https://doi.org/10.1109/TSE.2024.3376964

Published: 13 March 2024 Publication History

Abstract

The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions (which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.

References

[1]

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Comput. Surv., vol. 51, no. 4, pp. 1–37, 2018.

Digital Library

[2]

X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin, “Summarizing source code with transferred API knowledge,” in Proc. 27th Int. Joint Conf. Artif. Intell. (IJCAI), 2018, pp. 2269–2275.

[3]

L. Li, H. Feng, W. Zhuang, N. Meng, and B. Ryder, “CCLearner: A deep learning-based clone detection approach,” in Proc. IEEE Int. Conf. Softw. Maintenance Evol. (ICSME), 2017, pp. 249–260.

[4]

Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.

[5]

T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proc. Conf. Empirical Methods Natural Lang. Process., Syst. Demonstrations, Association for Computational Linguistics, Oct. 2020, pp. 38–45. Accessed: Oct. 2020. [Online]. Available: https://aclanthology.org/2020.emnlp-demos.6

[6]

B. Settles, “Active learning literature survey,” Univ. Wisconsin-Madison, Madison, WI, USA, CS Tech. Rep. TR1648, 2009.

[7]

O. Sener and S. Savarese, “Active learning for convolutional neural networks: A core-set approach,” in Proc. Int. Conf. Learn. Representations, 2018. [Online]. Available: https://openreview.net/forum?id=H1aIuk-RW

[8]

B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2008, pp. 1070–1079.

Digital Library

[9]

Q. Hu et al., “Towards exploring the limitations of active learning: An empirical study,” in Proc. 36th IEEE/ACM Int. Conf. Automated Softw. Eng., 2021, pp. 917–929.

Digital Library

[10]

M. Weiss and P. Tonella, “Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study),” in Proc. 31st ACM SIGSOFT Int. Symp. Softw. Testing Anal. (ISSTA), 2022, pp. 139–150.

Digital Library

[11]

H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, Apr. 2005.

Digital Library

[12]

“Active code learning.” Google Sites. Accessed: 2023. [Online]. Available: https://sites.google.com/view/activecodelearning

[13]

D. Wang and Y. Shang, “A new active labeling method for deep learning,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Piscataway, NJ, USA: IEEE Press, 2014, pp. 112–119.

[14]

Y. Feng, Q. Shi, X. Gao, J. Wan, C. Fang, and Z. Chen, “DeepGini: Prioritizing massive tests to enhance the robustness of deep neural networks,” in Proc. 29th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, pp. 177–188.

Digital Library

[15]

Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian active learning with image data,” in Proc. Int. Conf. Mach. Learn., PMLR, 2017, pp. 1183–1192.

[16]

K. Margatina, G. Vernikos, L. Barrault, and N. Aletras, “Active learning by acquiring contrastive examples,” in Proc. Conf. Empirical Methods Natural Lang. Process., Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 650–663. Accessed: Nov. 2021. [Online]. Available: https://aclanthology.org/2021.emnlp-main.51

[17]

J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal, “Deep batch active learning by diverse, uncertain gradient lower bounds,” in Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https://openreview.net/forum?id=ryghZJBKPS

[18]

Z. Feng et al., “CodeBERT: A pre-trained model for programming and natural languages,” in Proc. Findings Assoc. Comput. Linguistics (EMNLP), Association for Computational Linguistics, Nov. 2020, pp. 1536–1547. Accessed: 2020. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.139

[19]

J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., Association for Computational Linguistics, 2019, vol. 1 (Long and Short Papers), pp. 4171–4186.

[20]

D. Guo et al., “GraphCodeBERT: Pre-training code representations with data flow,” 2020,.

[21]

Z. Tian, J. Chen, and Z. Jin, “Code difference guided adversarial example generation for deep code models,” in Proc. 38th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Los Alamitos, CA, USA: IEEE Comput. Soc. Press, Sep. 2023, pp. 850–862.

Digital Library

[22]

Z. Yang, J. Shi, J. He, and D. Lo, “Natural attack for pre-trained models of code,” in Proc. 44th Int. Conf. Softw. Eng., 2022, pp. 1482–1493.

Digital Library

[23]

Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proc. Conf. Empirical Methods Natural Lang. Process., M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds., Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 8696–8708. Accessed: 2021. [Online]. Available: https://aclanthology.org/2021.emnlp-main.685

[24]

R. Puri, et al. “CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks,” in Proc. Neural Inf. Process. Syst. (NeurIPS) Track Datasets Benchmarks, 2021. Accessed: 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/a5bfc9e07964f8dddeb95fc584cd965d-Abstract-round2.html

[25]

J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a big data curated benchmark of inter-project code clones,” in Proc. IEEE Int. Conf. Softw. Maintenance Evol., Piscataway, NJ, USA: IEEE Press, 2014, pp. 476–480.

[26]

Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Vancouver, Canada: Curran Associates, Inc., 2019. Accessed: Dec. 8, 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf

[27]

S. Lu et al., “CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,” 2021,.

[28]

Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” 2019,.

[29]

C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551, Jan. 2020.

Digital Library

[30]

Y. Wan et al., “Improving automatic source code summarization via deep reinforcement learning,” in Proc. 33rd ACM/IEEE Int. Conf. Automated Softw. Eng., 2018, pp. 397–407.

Digital Library

[31]

D. B. Owen, “The power of student's t-test,” J. Amer. Statist. Assoc., vol. 60, no. 309, pp. 320–333, 1965.

[32]

Q. Hu et al., “An empirical study on data distribution-aware test selection for deep learning enhancement,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 4, pp. 1–30, 2022.

Digital Library

[33]

A. Eghbali and M. Pradel, “CrystalBLEU: Precisely and efficiently measuring the similarity of code,” in Proc. 37th IEEE/ACM Int. Conf. Automated Softw. Eng., 2022, pp. 1–12.

Digital Library

[34]

S. Zhou, U. Alon, S. Agarwal, and G. Neubig, “CodeBERTScore: Evaluating code generation with pretrained models of code,” 2023,.

[35]

OpenAI, “GPT-4 technical report,” 2023,.

[36]

Y. Li, M. Chen, Y. Liu, D. He, and Q. Xu, “An empirical study on the efficacy of deep active learning for image classification,” 2022,.

[37]

Y. Guo, Q. Hu, M. Cordy, M. Papadakis, and Y. Le Traon, “DRE: Density-based data selection with entropy for adversarial-robust deep learning models,” Neural Comput. Appl., vol. 45, pp. 4009–4026, Feb. 2023.

[38]

D. Pereira-Santos, R. B. C. Prudencio, and A. C. de Carvalho, “Empirical investigation of active learning strategies,” Neurocomputing, vol. 326, pp. 15–27, Jan. 2019.

[39]

Z. Yu, N. A. Kraft, and T. Menzies, “Finding better active learners for faster literature reviews,” Empirical Softw. Eng., vol. 23, pp. 3161–3186, Dec. 2018.

Digital Library

[40]

A. Siddhant and Z. C. Lipton, “Deep Bayesian active learning for natural language processing: Results of a large-scale empirical study,” in Proc. Conf. Empirical Methods Natural Lang. Process., Brussels, Belgium: Association for Computational Linguistics, Oct./Nov. 2018, pp. 2904–2909. Accessed: 2018. [Online]. Available: https://aclanthology.org/D18-1318

[41]

M. E. Ramirez-Loaiza, M. Sharma, G. Kumar, and M. Bilgic, “Active learning: An empirical study of common baselines,” Data Mining Knowl. Discovery, vol. 31, pp. 287–313, Mar. 2017.

Digital Library

[42]

F. C. Heilbron, J.-Y. Lee, H. Jin, and B. Ghanem, “What do I annotate next? An empirical study of active learning for action localization,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 199–216.

[43]

N. Chirkova and S. Troshin, “Empirical study of transformers for source code,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2021, pp. 703–715.

Digital Library

[44]

C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical comparison of pre-trained models of source code,” in Proc. 45th Int. Conf. Softw. Eng. (ICSE), May 2023, pp. 2136–2148.

Digital Library

[45]

B. Steenhoek, M. M. Rahman, R. Jiles, and W. Le, “An empirical study of deep learning models for vulnerability detection,” in Proc. IEEE/ACM 45th Int. Conf. Softw. Eng. (ICSE), 2023, pp. 2237–2248.

Digital Library

[46]

W. Jiang et al., “An empirical study of pre-trained model reuse in the hugging face deep learning model registry,” in Proc. 45th Int. Conf. Softw. Eng. (ICSE), May 2023, pp. 2463–2475.

Digital Library

[47]

A. Mastropaolo et al., “On the robustness of code generation techniques: An empirical study on GitHub Copilot,” in Proc. IEEE/ACM 45th Int. Conf. Softw. Eng. (ICSE), 2023, pp. 2149–2160.

Digital Library

[48]

Q. Hu et al., “CodeS: Towards code model generalization under distribution shift,” in Proc. Int. Conf. Softw. Eng. (ICSE), New Ideas Emerg. Results (NIER), 2023, pp. 1–6.

[49]

P. Nie, J. Zhang, J. J. Li, R. Mooney, and M. Gligoric, “Impact of evaluation methodologies on code summarization,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, vol. 1 (Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 4936–4960. Accessed: 2022. [Online]. Available: https://aclanthology.org/2022.acl-long.339

[50]

H. Tian et al., “Is ChatGPT the ultimate programming assistant—How far is it?” 2023,.

[51]

W. Sun et al., “Automatic code summarization via ChatGPT: How far are we?” 2023,.

[52]

Y. Charalambous, N. Tihanyi, R. Jain, Y. Sun, M. A. Ferrag, and L. C. Cordeiro, “A new era in software security: Towards self-healing software via large language models and formal verification,” 2023,.

[53]

C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,” 2023,.

[54]

Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, “Large language models are edge-case fuzzers: Testing deep learning libraries via FuzzGPT,” 2023,.

[55]

W. Ma et al., “The scope of ChatGPT in software engineering: A thorough investigation,” 2023,.

[56]

R. Moskovitch, N. Nissim, and Y. Elovici, “Malicious code detection using active learning,” in Proc. Int. Workshop Privacy, Secur., Trust KDD, Berlin, Germany: Springer-Verlag, 2008, pp. 74–91.

[57]

N. Nissim, R. Moskovitch, O. BarAd, L. Rokach, and Y. Elovici, “ALDROID: Efficient update of android anti-virus software using designated active learning methods,” Knowl. Inf. Syst., vol. 49, pp. 795–833, Dec. 2016.

Digital Library

[58]

N. Nissim, R. Moskovitch, L. Rokach, and Y. Elovici, “Novel active learning methods for enhanced PC malware detection in windows OS,” Expert Syst. Appl., vol. 41, no. 13, pp. 5843–5857, 2014.

[59]

P. Samoaa, L. Aronsson, A. Longa, P. Leitner, and M. H. Chehreghani, “A unified active learning framework for annotating graph data with application to software source code performance prediction,” 2023,.

[60]

D. Wu, C.-T. Lin, and J. Huang, “Active learning for regression using greedy sampling,” Inf. Sci., vol. 474, pp. 90–105, Feb. 2019.

[61]

M. Berezov, C. Ancourt, J. Zawalska, and M. Savchenko, “COLA-Gen: Active learning techniques for automatic code generation of benchmarks,” in Proc. 13th Workshop Parallel Program. Run-Time Manage. Techn. Many-Core Archit./11th Workshop Des. Tools Archit. Multicore Embedded Comput. Platforms (PARMA-DITAM), Wadern, Germany: Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022, pp. 3:1–3:14.

Recommendations

The adverse effects of code duplication in machine learning models of code
Onward! 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software

The field of big code relies on mining large corpora of code to perform some learning task towards creating better tools for software engineers. A significant threat to this approach was recently identified by Lopes et al. (2017) who found a large ...
Is your code harmful too? Understanding harmful code through transfer learning
SBQS '23: Proceedings of the XXII Brazilian Symposium on Software Quality

Code smells are indicators of poor design implementation and decision-making that can potentially harm the quality of software. Therefore, detecting these smells is crucial to prevent such issues. Some studies aim to comprehend the impact of code smells ...
Understanding code snippets in code reviews: a preliminary study of the OpenStack community
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

Code review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering Volume 50, Issue 5

May 2024

291 pages

Issue’s Table of Contents

© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher

IEEE Press

Publication History

Published: 13 March 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents