research-article

Open access

xASTNN: Improved Code Representations for Industrial Practice

Authors:

Hongyu ZhangAuthors Info & Claims

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1727 - 1738

https://doi.org/10.1145/3611643.3613869

Published: 30 November 2023 Publication History

Abstract

The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.

References

[1]

2006. Accessed: 2022-09-28. T.J. Watson libraries for analysis (WALA). https://wala.sourceforge.net/wiki/index.php/Main_Page

[2]

2020. Version: 0.13.0. Accessed: 2022-09-28. javalang. https://github.com/c2nes/javalang

[3]

2021. Version: 2.21. Accessed: 2022-09-28. pycparser. https://github.com/eliben/pycparser

[4]

Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the Joint Meeting on Foundations of Software Engineering. 38–49.

Digital Library

[5]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. Comput. Surveys, 51, 4 (2018), 1–37.

Digital Library

[6]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the International Conference on Learning Representations.

[7]

Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the International Conference on Mining Software Repositories. 207–216.

[8]

Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating sequences from structured representations of code. In Proceedings of the International Conference on Learning Representations.

[9]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. In Proceedings of the ACM on Programming Languages. 3, 1–29.

Digital Library

[10]

Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. Advances in Neural Information Processing Systems, 31 (2018).

[11]

Alexander Brauckmann, Andrés Goens, Sebastian Ertel, and Jeronimo Castrillon. 2020. Compiler-based graph representations for deep learning models of code. In Proceedings of the International Conference on Compiler Construction. 201–211.

Digital Library

[12]

Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. InferCode: Self-supervised learning of code representations by predicting subtrees. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 1186–1197.

Digital Library

[13]

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964–974.

Digital Library

[14]

Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017. Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing.

[15]

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuan-Jing Huang. 2015. Gated recursive neural network for Chinese word segmentation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 1744–1753.

[16]

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Shiyu Wu, and Xuan-Jing Huang. 2015. Sentence modeling with gated recursive neural network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 793–798.

[17]

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.

[18]

Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, and Satish Chandra. 2022. Counterfactual Explanations for Models of Code. In Proceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice. 125–134.

[19]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning, 20, 3 (1995), 273–297.

[20]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[21]

Chunrong Fang, Zixi Liu, Yangyang Shi, Jeff Huang, and Qingkai Shi. 2020. Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis. 516–527.

Digital Library

[22]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Proceedings of the Empirical Methods in Natural Language Processing.

[23]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 933–944.

Digital Library

[24]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, and Shengyu Fu. 2021. GraphCodeBERT: Pre-training code representations with data flow. Proceedings of the International Conference on Learning Representations.

[25]

Siqi Han, DongXia Wang, Wanting Li, and Xuesong Lu. 2021. A Comparison of Code Embeddings and Beyond. arXiv preprint arXiv:2109.07173.

[26]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9, 8 (1997), 1735–1780.

Digital Library

[27]

Xuan Huo, Ferdian Thung, Ming Li, David Lo, and Shu-Ting Shi. 2019. Deep transfer bug localization. IEEE Transactions on Software Engineering, 47, 7 (2019), 1368–1380.

[28]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 2073–2083.

[29]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 96–105.

Digital Library

[30]

Hong Jin Kang, Tegawendé F Bissyandé, and David Lo. 2019. Assessing the generalizability of code2vec token embeddings. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering.

[31]

Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code? In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 1332–1336.

Digital Library

[32]

Filippos Kokkinos and Alexandros Potamianos. 2017. Structural attention neural networks for improved sentiment analysis. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics.

[33]

Jens Krinke and Chaiyong Ragkhitwetsagul. 2022. BigCloneBench Considered Harmful for Machine Learning. In Proceedings of the IEEE International Workshop on Software Clones. 1–7.

[34]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75–86.

[35]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. In Proceedings of the International Conference on Learning Representations.

[36]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 336–347.

Digital Library

[37]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26 (2013).

[38]

Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the AAAI conference on Artificial Intelligence.

[39]

Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. SPT-code: sequence-to-sequence pre-training for learning source code representations. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 2006–2018.

Digital Library

[40]

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 419–428.

Digital Library

[41]

Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 1157–1168.

Digital Library

[42]

Pasquale Salza, Christoph Schwizer, Jian Gu, and Harald C Gall. 2022. On the effectiveness of transfer learning for code search. IEEE Transactions on Software Engineering.

[43]

Jing Kai Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2022. Learning Program Semantics with Code Representations: An Empirical Study. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering.

[44]

Yulei Sui, Xiao Cheng, Guanqin Zhang, and Haoyu Wang. 2020. Flow2vec: Value-flow-based precise code embedding. In Proceedings of the ACM on Programming Languages. 4, 1–27.

Digital Library

[45]

Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. 476–480.

Digital Library

[46]

Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 1556–1566.

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30 (2017).

[48]

Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the International Joint Conference on Artificial Intelligence. 3034–3040.

[49]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 87–98.

Digital Library

[50]

Yueming Wu, Deqing Zou, Shihan Dou, Siru Yang, Wei Yang, Feng Cheng, Hong Liang, and Hai Jin. 2020. SCDetector: software functional clone detection based on semantic tokens analysis. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 821–833.

Digital Library

[51]

Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, and Zhi Jin. 2020. Generating adversarial examples for holding robustness of source code processing models. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 1169–1176.

[52]

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the IEEE/ACM International Conference on Software Engineering. 783–794.

Digital Library

[53]

Kechi Zhang, Wenhan Wang, Huangzhao Zhang, Ge Li, and Zhi Jin. 2022. Learning to represent programs with heterogeneous graphs. In Proceedings of the IEEE/ACM International Conference on Program Comprehension. 378–389.

Digital Library

[54]

Gang Zhao and Jeff Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 141–151.

Digital Library

[55]

Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, and Gareth Ari Aye. 2022. Improving Code Autocompletion with Transfer Learning. In Proceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice. 161–162.

[56]

Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano. 2022. Exploring and Evaluating Personalized Models for Code Generation. Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering: Industry Paper.

Digital Library

Cited By

Li YXu ZZhou MWan HZhao XFilkov VRay BZhou M(2024)Trident: Detecting SQL Injection Attacks via Abstract Syntax Tree-based Neural NetworkProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695289(2225-2229)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695289
Sun LWang YLi HMuhammad G(2024)Fine-grained vulnerability detection for medical sensor systemsInternet of Things10.1016/j.iot.2024.101362(101362)Online publication date: Sep-2024
https://doi.org/10.1016/j.iot.2024.101362

Index Terms

xASTNN: Improved Code Representations for Industrial Practice
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features

Recommendations

code2vec: learning distributed representations of code

We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of ...
Capturing source code semantics via tree-based convolution over API-enhanced AST
CF '19: Proceedings of the 16th ACM International Conference on Computing Frontiers

When deep learning meets big code, a key question is how to efficiently learn a distributed representation for source code that can capture its semantics effectively. We propose to use tree-based convolution over API-enhanced AST. To demonstrate the ...
The adverse effects of code duplication in machine learning models of code
Onward! 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software

The field of big code relies on mining large corpora of code to perform some learning task towards creating better tools for software engineers. A significant threat to this approach was recently identified by Lopes et al. (2017) who found a large ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2023

2215 pages

ISBN:9798400703270

DOI:10.1145/3611643

General Chair:
Satish Chandra
Google, USA
,
Program Chairs:
Kelly Blincoe
University of Auckland, New Zealand
,
Paolo Tonella
USI Lugano, Switzerland

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China
Industrial Technology Infrastructure Public Service Platform Project

Conference

ESEC/FSE '23

Sponsor:

SIGSOFT

ESEC/FSE '23: 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

December 3 - 9, 2023

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)246
Downloads (Last 6 weeks)33

Reflects downloads up to 05 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YXu ZZhou MWan HZhao XFilkov VRay BZhou M(2024)Trident: Detecting SQL Injection Attacks via Abstract Syntax Tree-based Neural NetworkProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695289(2225-2229)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695289
Sun LWang YLi HMuhammad G(2024)Fine-grained vulnerability detection for medical sensor systemsInternet of Things10.1016/j.iot.2024.101362(101362)Online publication date: Sep-2024
https://doi.org/10.1016/j.iot.2024.101362

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents