skip to main content
10.1145/3611643.3613869acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open access

xASTNN: Improved Code Representations for Industrial Practice

Published: 30 November 2023 Publication History

Abstract

The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.

References

[1]
2006. Accessed: 2022-09-28. T.J. Watson libraries for analysis (WALA). https://wala.sourceforge.net/wiki/index.php/Main_Page
[2]
2020. Version: 0.13.0. Accessed: 2022-09-28. javalang. https://github.com/c2nes/javalang
[3]
2021. Version: 2.21. Accessed: 2022-09-28. pycparser. https://github.com/eliben/pycparser
[4]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the Joint Meeting on Foundations of Software Engineering. 38–49.
[5]
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. Comput. Surveys, 51, 4 (2018), 1–37.
[6]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the International Conference on Learning Representations.
[7]
Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the International Conference on Mining Software Repositories. 207–216.
[8]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating sequences from structured representations of code. In Proceedings of the International Conference on Learning Representations.
[9]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. In Proceedings of the ACM on Programming Languages. 3, 1–29.
[10]
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. Advances in Neural Information Processing Systems, 31 (2018).
[11]
Alexander Brauckmann, Andrés Goens, Sebastian Ertel, and Jeronimo Castrillon. 2020. Compiler-based graph representations for deep learning models of code. In Proceedings of the International Conference on Compiler Construction. 201–211.
[12]
Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. InferCode: Self-supervised learning of code representations by predicting subtrees. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 1186–1197.
[13]
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964–974.
[14]
Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017. Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing.
[15]
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuan-Jing Huang. 2015. Gated recursive neural network for Chinese word segmentation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 1744–1753.
[16]
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Shiyu Wu, and Xuan-Jing Huang. 2015. Sentence modeling with gated recursive neural network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 793–798.
[17]
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
[18]
Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, and Satish Chandra. 2022. Counterfactual Explanations for Models of Code. In Proceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice. 125–134.
[19]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning, 20, 3 (1995), 273–297.
[20]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
[21]
Chunrong Fang, Zixi Liu, Yangyang Shi, Jeff Huang, and Qingkai Shi. 2020. Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis. 516–527.
[22]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Proceedings of the Empirical Methods in Natural Language Processing.
[23]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 933–944.
[24]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, and Shengyu Fu. 2021. GraphCodeBERT: Pre-training code representations with data flow. Proceedings of the International Conference on Learning Representations.
[25]
Siqi Han, DongXia Wang, Wanting Li, and Xuesong Lu. 2021. A Comparison of Code Embeddings and Beyond. arXiv preprint arXiv:2109.07173.
[26]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9, 8 (1997), 1735–1780.
[27]
Xuan Huo, Ferdian Thung, Ming Li, David Lo, and Shu-Ting Shi. 2019. Deep transfer bug localization. IEEE Transactions on Software Engineering, 47, 7 (2019), 1368–1380.
[28]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 2073–2083.
[29]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 96–105.
[30]
Hong Jin Kang, Tegawendé F Bissyandé, and David Lo. 2019. Assessing the generalizability of code2vec token embeddings. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering.
[31]
Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code? In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 1332–1336.
[32]
Filippos Kokkinos and Alexandros Potamianos. 2017. Structural attention neural networks for improved sentiment analysis. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics.
[33]
Jens Krinke and Chaiyong Ragkhitwetsagul. 2022. BigCloneBench Considered Harmful for Machine Learning. In Proceedings of the IEEE International Workshop on Software Clones. 1–7.
[34]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75–86.
[35]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. In Proceedings of the International Conference on Learning Representations.
[36]
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 336–347.
[37]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26 (2013).
[38]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the AAAI conference on Artificial Intelligence.
[39]
Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. SPT-code: sequence-to-sequence pre-training for learning source code representations. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 2006–2018.
[40]
Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 419–428.
[41]
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 1157–1168.
[42]
Pasquale Salza, Christoph Schwizer, Jian Gu, and Harald C Gall. 2022. On the effectiveness of transfer learning for code search. IEEE Transactions on Software Engineering.
[43]
Jing Kai Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2022. Learning Program Semantics with Code Representations: An Empirical Study. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering.
[44]
Yulei Sui, Xiao Cheng, Guanqin Zhang, and Haoyu Wang. 2020. Flow2vec: Value-flow-based precise code embedding. In Proceedings of the ACM on Programming Languages. 4, 1–27.
[45]
Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. 476–480.
[46]
Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 1556–1566.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30 (2017).
[48]
Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the International Joint Conference on Artificial Intelligence. 3034–3040.
[49]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 87–98.
[50]
Yueming Wu, Deqing Zou, Shihan Dou, Siru Yang, Wei Yang, Feng Cheng, Hong Liang, and Hai Jin. 2020. SCDetector: software functional clone detection based on semantic tokens analysis. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 821–833.
[51]
Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, and Zhi Jin. 2020. Generating adversarial examples for holding robustness of source code processing models. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 1169–1176.
[52]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 783–794.
[53]
Kechi Zhang, Wenhan Wang, Huangzhao Zhang, Ge Li, and Zhi Jin. 2022. Learning to represent programs with heterogeneous graphs. In Proceedings of the IEEE/ACM International Conference on Program Comprehension. 378–389.
[54]
Gang Zhao and Jeff Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 141–151.
[55]
Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, and Gareth Ari Aye. 2022. Improving Code Autocompletion with Transfer Learning. In Proceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice. 161–162.
[56]
Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano. 2022. Exploring and Evaluating Personalized Models for Code Generation. Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering: Industry Paper.

Cited By

View all
  • (2024)Trident: Detecting SQL Injection Attacks via Abstract Syntax Tree-based Neural NetworkProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695289(2225-2229)Online publication date: 27-Oct-2024
  • (2024)Fine-grained vulnerability detection for medical sensor systemsInternet of Things10.1016/j.iot.2024.101362(101362)Online publication date: Sep-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big code
  2. code feature learning
  3. neural code representation

Qualifiers

  • Research-article

Funding Sources

Conference

ESEC/FSE '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)246
  • Downloads (Last 6 weeks)33
Reflects downloads up to 05 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Trident: Detecting SQL Injection Attacks via Abstract Syntax Tree-based Neural NetworkProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695289(2225-2229)Online publication date: 27-Oct-2024
  • (2024)Fine-grained vulnerability detection for medical sensor systemsInternet of Things10.1016/j.iot.2024.101362(101362)Online publication date: Sep-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media