short-paper

Leveraging Artificial Intelligence on Binary Code Comprehension

Author:

Yifan ZhangAuthors Info & Claims

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Article No.: 125, Pages 1 - 3

https://doi.org/10.1145/3551349.3559564

Published: 05 January 2023 Publication History

Abstract

Understanding binary code is an essential but complex software engineering task for reverse engineering, malware analysis, and compiler optimization. Unlike source code, binary code has limited semantic information, which makes it challenging for human comprehension. At the same time, compiling source to binary code, or transpiling among different programming languages (PLs) can provide a way to introduce external knowledge into binary comprehension. We propose to develop Artificial Intelligence (AI) models that aid human comprehension of binary code. Specifically, we propose to incorporate domain knowledge from large corpora of source code (e.g., variable names, comments) to build AI models that capture a generalizable representation of binary code. Lastly, we will investigate metrics to assess the performance of models that apply to binary code by using human studies of comprehension.

References

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653(2020).

[2]

Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2022. A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features. ACM Computing Surveys (CSUR) 55, 1 (2022), 1–41.

Digital Library

[3]

Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272). IEEE, 368–377.

[4]

Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. TreeCaps: Tree-based capsule networks for source code processing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 30–38.

[5]

Juan Caballero, Noah M Johnson, Stephen McCamant, and Dawn Song. 2009. Binary code extraction and interface identification for security applications. Technical Report. California Univ Berkeley Dept of Electrical Engineering and Computer Science.

[6]

Gerardo Canfora, Massimiliano Di Penta, and Luigi Cerulo. 2011. Achievements and challenges in software reverse engineering. Commun. ACM 54, 4 (2011), 142–151.

Digital Library

[7]

Andreas Daffertshofer, Claudine JC Lamoth, Onno G Meijer, and Peter J Beek. 2004. PCA in studying coordination and variability: a tutorial. Clinical biomechanics 19, 4 (2004), 415–428.

[8]

Robert Feldt, Francisco Gomes de Oliveira Neto, and Richard Torkar. 2018. Ways of applying artificial intelligence in software engineering. In 2018 IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 35–41.

Digital Library

[9]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155(2020).

[10]

Yi Gao, Zan Wang, Shuang Liu, Lin Yang, Wei Sang, and Yuanfang Cai. 2019. Teccd: A tree embedding approach for code clone detection. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 145–156.

[11]

Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software 180 (2021), 111031.

Digital Library

[12]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933–944.

Digital Library

[13]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv preprint arXiv:2203.03850(2022).

[14]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366(2020).

[15]

Mark Harman. 2012. The role of artificial intelligence in software engineering. In 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE). IEEE, 1–6.

[16]

Laune C Harris and Barton P Miller. 2005. Practical analysis of stripped binary code. ACM SIGARCH Computer Architecture News 33, 5 (2005), 63–68.

Digital Library

[17]

Yuede Ji, Lei Cui, and H Howie Huang. 2021. Buggraph: Differentiating source-binary code similarity with graph triplet-loss network. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 702–715.

Digital Library

[18]

Mahmoud Kalash, Mrigank Rochan, Noman Mohammed, Neil DB Bruce, Yang Wang, and Farkhund Iqbal. 2018. Malware classification with deep convolutional neural networks. In 2018 9th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, 1–5.

[19]

Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 247–249.

[20]

Maggie Lei, Hao Li, Ji Li, Namrata Aundhkar, and Dae-Kyoo Kim. 2022. Deep learning application on code clone detection: A review of current knowledge. Journal of Systems and Software 184 (2022), 111141.

Digital Library

[21]

Benjamin Mariano, Yanju Chen, Yu Feng, Greg Durrett, and Işil Dillig. 2022. Automated transpilation of imperative to functional code using neural-guided program synthesis. Proceedings of the ACM on Programming Languages 6, OOPSLA1(2022), 1–27.

Digital Library

[22]

Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software engineering for AI-based systems: a survey. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2(2022), 1–59.

Digital Library

[23]

Xiaozhu Meng and Barton P Miller. 2016. Binary code is not easy. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 24–35.

Digital Library

[24]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.

[25]

Xiaolei Ren, Michael Ho, Jiang Ming, Yu Lei, and Li Li. 2021. Unleashing the hidden power of compiler optimization on binary code difference: An empirical study. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 142–157.

Digital Library

[26]

Vasile Rus, Peter Brusilovsky, Lasang Jimba Tamang, Kamil Akhuseyinoglu, and Scott Fleming. 2022. DeepCode: An Annotated Set of Instructional Code Examples to Foster Deep Code Comprehension and Learning. In International Conference on Intelligent Tutoring Systems. Springer, 36–50.

Digital Library

[27]

Murali Sridharan, Mika Mäntylä, Maëlick Claes, and Leevi Rantala. 2022. SoCCMiner: A Source Code-Comments and Comment-Context Miner. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE, 242–246.

Digital Library

[28]

Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, Westley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehension and code summarization. In Proceedings of the 28th International Conference on Program Comprehension. 2–13.

Digital Library

[29]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).

[30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[31]

Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. jTrans: Jump-Aware Transformer for Binary Code Similarity. arXiv preprint arXiv:2205.12713(2022).

[32]

Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31, 7 (2019), 1235–1270.

[33]

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783–794.

Digital Library

[34]

Qian Zhang, Jiyuan Wang, Guoqing Harry Xu, and Miryung Kim. 2022. HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 1017–1029.

Digital Library

Index Terms

Leveraging Artificial Intelligence on Binary Code Comprehension

Index terms have been assigned to the content through auto-classification.

Recommendations

A Survey of Binary Code Similarity

Binary code similarityapproaches compare two or more pieces of binary code to identify their similarities and differences. The ability to compare binary code enables many real-world applications on scenarios where source code may not be available such as ...
The poset structures admitting the extended binary Golay code to be a perfect code

Brualdi et al. [Codes with a poset metric, Discrete Math. 147 (1995) 57-72] introduced the concept of poset codes, and gave an example of poset structure which admits the extended binary Golay code to be a 4-error-correcting perfect P-code. In this ...
Detecting code clones in binary executables
ISSTA '09: Proceedings of the eighteenth international symposium on Software testing and analysis

Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

October 2022

2006 pages

ISBN:9781450394758

DOI:10.1145/3551349

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper
Research
Refereed limited

Conference

ASE '22

ASE '22: 37th IEEE/ACM International Conference on Automated Software Engineering

October 10 - 14, 2022

MI, Rochester, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
134
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)4

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents