research-article

Syntax-guided program reduction for understanding neural code intelligence models

Authors:

Md Rafiqul Islam Rabin,

Mohammad Amin AlipourAuthors Info & Claims

MAPS 2022: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming

Pages 70 - 79

https://doi.org/10.1145/3520312.3534869

Published: 13 June 2022 Publication History

Abstract

Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language.

In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.

References

[1]

Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In Proceedings of the ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2019). ACM New York, NY, USA, 143–153. https://doi.org/10.1145/3359591.3359735

Digital Library

[2]

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). Association for Computing Machinery, New York, NY, USA. 281–293. https://doi.org/10.1145/2635868.2635883

Digital Library

[3]

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA. 38–49. https://doi.org/10.1145/2786805.2786849

Digital Library

[4]

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. In ACM Computing Surveys. 51, Association for Computing Machinery, New York, NY, USA. Article 81, 37 pages. https://doi.org/10.1145/3212695

Digital Library

[5]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In International Conference on Learning Representations (ICLR 2018). OpenReview.net, Open Access. https://openreview.net/forum?id=BJOFETxR-

[6]

Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of the 33nd International Conference on Machine Learning (ICML 2016). Proceedings of Machine Learning Research (PMLR), Open Access. 2091–2100. http://proceedings.mlr.press/v48/allamanis16.html

[7]

Uri Alon, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations (ICLR 2019). OpenReview.net, Open Access. https://openreview.net/forum?id=H1gKYo09tX

[8]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2vec: Learning Distributed Representations of Code. In Proceedings of the ACM on Programming Languages (PACMPL 2019, Vol. 3). Association for Computing Machinery, New York, NY, USA. 40:1–40:29. https://doi.org/10.1145/3290353

Digital Library

[9]

Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2019. AutoFocus: Interpreting Attention-Based Neural Networks by Code Perturbation. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE 2019). IEEE Press, New York, NY, USA. 38–41. https://doi.org/10.1109/ASE.2019.00014

Digital Library

[10]

Zimin Chen and Martin Monperrus. 2019. A Literature Study of Embeddings on Source Code. arxiv:1904.03061. arxiv:1904.03061

[11]

Rhys Compton, Eibe Frank, Panos Patros, and Abigail Koay. 2020. Embedding Java Classes with Code2vec: Improvements from Variable Obfuscation. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR 2020). Association for Computing Machinery, New York, NY, USA. 243–253. https://doi.org/10.1145/3379597.3387445

Digital Library

[12]

Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. 2018. Deep Learning Type Inference. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA. 152–162. https://doi.org/10.1145/3236024.3236051

Digital Library

[13]

Hong Jin Kang, Tegawendé F. Bissyandé, and David Lo. 2019. Assessing the Generalizability of Code2vec Token Embeddings. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE 2019). IEEE Press, New York, NY, USA. 1–12. https://doi.org/10.1109/ASE.2019.00011

Digital Library

[14]

Md Rafiqul Islam Rabin and Mohammad Amin Alipour. 2020. Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations. arxiv:2004.07313. arxiv:2004.07313

[15]

Md Rafiqul Islam Rabin and Mohammad Amin Alipour. 2021. Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code. arxiv:2111.01097. arxiv:2111.01097

[16]

Md Rafiqul Islam Rabin, Nghi D.Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. In Information and Software Technology. 135, Elsevier, Amsterdam, Netherlands. 106552. https://doi.org/10.1016/j.infsof.2021.106552

[17]

Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, and Mohammad Amin Alipour. 2021. Artifact for Article (SIVAND): Understanding Neural Code Intelligence Through Program Simplification. ACM Digital Library, ESEC/FSE, https://doi.org/10.1145/3462296

Digital Library

[18]

Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, and Mohammad Amin Alipour. 2021. Understanding Neural Code Intelligence through Program Simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA. 441–452. https://doi.org/10.1145/3468264.3468539

Digital Library

[19]

Md Rafiqul Islam Rabin, Aftab Hussain, and Mohammad Amin Alipour. 2022. Artifact for Article (CI-DD-Perses): Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models. Zenodo, MAPS, https://doi.org/10.5281/zenodo.6630188

Digital Library

[20]

Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, and Vincent J. Hellendoorn. 2021. Memorization and Generalization in Neural Code Intelligence Models. arxiv:2106.08704. arxiv:2106.08704

[21]

Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, and Mohammad Amin Alipour. 2020. Towards Demystifying Dimensions of Source Code Embeddings. In Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages (RL+SE&PL 2020). Association for Computing Machinery, New York, NY, USA. 29–38. https://doi.org/10.1145/3416506.3423580

Digital Library

[22]

Md Rafiqul Islam Rabin, Ke Wang, and Mohammad Amin Alipour. 2019. Testing Neural Program Analyzers. 34th IEEE/ACM International Conference on Automated Software Engineering (Late Breaking Results-Track), arxiv:1908.10711

[23]

Tushar Sharma, Maria Kechagia, Stefanos Georgiou, Rohit Tiwari, and Federica Sarro. 2021. A Survey on Machine Learning Techniques for Source Code Analysis. arxiv:2110.09610. arxiv:2110.09610

[24]

Chengnian Sun, Yuanbo Li, Qirun Zhang, Tianxiao Gu, and Zhendong Su. 2018. Perses: Syntax-Guided Program Reduction. In Proceedings of the 40th International Conference on Software Engineering (ICSE 2018). Association for Computing Machinery, New York, NY, USA. 361–371. https://doi.org/10.1145/3180155.3180236

Digital Library

[25]

Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim A. Laredo, and Alessandro Morari. 2021. Probing Model Signal-Awareness via Prediction-Preserving Input Minimization. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA. 945–955. https://doi.org/10.1145/3468264.3468545

Digital Library

[26]

Yu Wang, Fengjuan Gao, and Linzhang Wang. 2021. Demystifying code summarization models. arxiv:2102.04625. arxiv:2102.04625

[27]

Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial Examples for Models of Code. In Proceedings of the ACM on Programming Languages (PACMPL 2020, Vol. 4). Association for Computing Machinery, New York, NY, USA. 162:1–162:30. https://doi.org/10.1145/3428230

Digital Library

[28]

Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. In IEEE Transactions on Software Engineering. 28, IEEE Press, New York, NY, USA. 183–200. https://doi.org/10.1109/32.988498

Digital Library

Cited By

Cheng BZhao SWang KWang MBai GFeng RGuo YMa LWang H(2024)Beyond Fidelity: Explaining Vulnerability Localization of Learning-Based DetectorsACM Transactions on Software Engineering and Methodology10.1145/364154333:5(1-33)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3641543
Shi CZhu TZhang TPang JPan M(2023)Structural-semantics Guided Program Simplification for Understanding Neural Code Intelligence ModelsProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609438(1-11)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609438
Gharachorlu GSumner NJust RFraser G(2023)Type Batched Program ReductionProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598065(398-410)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598065
Show More Cited By

Index Terms

Syntax-guided program reduction for understanding neural code intelligence models
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Understanding neural code intelligence through program simplification
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior ...
Ad Hoc Syntax-Guided Program Reduction
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Program reduction is a widely adopted, indispensable technique for debugging language implementations such as compilers and interpreters. Given a program 𝑃 and a bug triggered by 𝑃, a program reducer can produce a minimized program 𝑃∗ that is derived ...
Pushing the Limit of 1-Minimality of Language-Agnostic Program Reduction

Program reduction has demonstrated its usefulness in facilitating debugging language implementations in practice, by minimizing bug-triggering programs. There are two categories of program reducers: language-agnostic program reducers (AGRs) and language-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MAPS 2022: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming

June 2022

79 pages

ISBN:9781450392730

DOI:10.1145/3520312

General Chairs:
Swarat Chaudhuri
University of Texas at Austin, USA
,
Charles Sutton
Google Research, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Conference

MAPS '22

Sponsor:

SIGPLAN

MAPS '22: 6th ACM SIGPLAN International Symposium on Machine Programming

June 13, 2022

CA, San Diego, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
98
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng BZhao SWang KWang MBai GFeng RGuo YMa LWang H(2024)Beyond Fidelity: Explaining Vulnerability Localization of Learning-Based DetectorsACM Transactions on Software Engineering and Methodology10.1145/364154333:5(1-33)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3641543
Shi CZhu TZhang TPang JPan M(2023)Structural-semantics Guided Program Simplification for Understanding Neural Code Intelligence ModelsProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609438(1-11)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609438
Gharachorlu GSumner NJust RFraser G(2023)Type Batched Program ReductionProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598065(398-410)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598065
Islam Rabin MHussain ASuneja SAlipour M(2023)Study of Distractors in Neural Models of Code2023 IEEE/ACM International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE)10.1109/InteNSE59150.2023.00005(1-7)Online publication date: May-2023
https://doi.org/10.1109/InteNSE59150.2023.00005
Abid SCai XJiang L(2023)Interpreting CodeBERT for Semantic Code Clone Detection2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00033(229-238)Online publication date: 4-Dec-2023
https://doi.org/10.1109/APSEC60848.2023.00033
Rabin MHussain AAlipour MHellendoorn V(2023)Memorization and generalization in neural code intelligence modelsInformation and Software Technology10.1016/j.infsof.2022.107066153:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.infsof.2022.107066
Islam Rabin MAmin Alipour M(2022)Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA55696.2022.00140(843-848)Online publication date: Dec-2022
https://doi.org/10.1109/ICMLA55696.2022.00140

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents