skip to main content
10.1145/3551349.3559564acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
short-paper

Leveraging Artificial Intelligence on Binary Code Comprehension

Published: 05 January 2023 Publication History

Abstract

Understanding binary code is an essential but complex software engineering task for reverse engineering, malware analysis, and compiler optimization. Unlike source code, binary code has limited semantic information, which makes it challenging for human comprehension. At the same time, compiling source to binary code, or transpiling among different programming languages (PLs) can provide a way to introduce external knowledge into binary comprehension. We propose to develop Artificial Intelligence (AI) models that aid human comprehension of binary code. Specifically, we propose to incorporate domain knowledge from large corpora of source code (e.g., variable names, comments) to build AI models that capture a generalizable representation of binary code. Lastly, we will investigate metrics to assess the performance of models that apply to binary code by using human studies of comprehension.

References

[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653(2020).
[2]
Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2022. A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features. ACM Computing Surveys (CSUR) 55, 1 (2022), 1–41.
[3]
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272). IEEE, 368–377.
[4]
Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. TreeCaps: Tree-based capsule networks for source code processing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 30–38.
[5]
Juan Caballero, Noah M Johnson, Stephen McCamant, and Dawn Song. 2009. Binary code extraction and interface identification for security applications. Technical Report. California Univ Berkeley Dept of Electrical Engineering and Computer Science.
[6]
Gerardo Canfora, Massimiliano Di Penta, and Luigi Cerulo. 2011. Achievements and challenges in software reverse engineering. Commun. ACM 54, 4 (2011), 142–151.
[7]
Andreas Daffertshofer, Claudine JC Lamoth, Onno G Meijer, and Peter J Beek. 2004. PCA in studying coordination and variability: a tutorial. Clinical biomechanics 19, 4 (2004), 415–428.
[8]
Robert Feldt, Francisco Gomes de Oliveira Neto, and Richard Torkar. 2018. Ways of applying artificial intelligence in software engineering. In 2018 IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 35–41.
[9]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155(2020).
[10]
Yi Gao, Zan Wang, Shuang Liu, Lin Yang, Wei Sang, and Yuanfang Cai. 2019. Teccd: A tree embedding approach for code clone detection. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 145–156.
[11]
Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software 180 (2021), 111031.
[12]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933–944.
[13]
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv preprint arXiv:2203.03850(2022).
[14]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366(2020).
[15]
Mark Harman. 2012. The role of artificial intelligence in software engineering. In 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE). IEEE, 1–6.
[16]
Laune C Harris and Barton P Miller. 2005. Practical analysis of stripped binary code. ACM SIGARCH Computer Architecture News 33, 5 (2005), 63–68.
[17]
Yuede Ji, Lei Cui, and H Howie Huang. 2021. Buggraph: Differentiating source-binary code similarity with graph triplet-loss network. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 702–715.
[18]
Mahmoud Kalash, Mrigank Rochan, Noman Mohammed, Neil DB Bruce, Yang Wang, and Farkhund Iqbal. 2018. Malware classification with deep convolutional neural networks. In 2018 9th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, 1–5.
[19]
Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 247–249.
[20]
Maggie Lei, Hao Li, Ji Li, Namrata Aundhkar, and Dae-Kyoo Kim. 2022. Deep learning application on code clone detection: A review of current knowledge. Journal of Systems and Software 184 (2022), 111141.
[21]
Benjamin Mariano, Yanju Chen, Yu Feng, Greg Durrett, and Işil Dillig. 2022. Automated transpilation of imperative to functional code using neural-guided program synthesis. Proceedings of the ACM on Programming Languages 6, OOPSLA1(2022), 1–27.
[22]
Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software engineering for AI-based systems: a survey. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2(2022), 1–59.
[23]
Xiaozhu Meng and Barton P Miller. 2016. Binary code is not easy. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 24–35.
[24]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[25]
Xiaolei Ren, Michael Ho, Jiang Ming, Yu Lei, and Li Li. 2021. Unleashing the hidden power of compiler optimization on binary code difference: An empirical study. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 142–157.
[26]
Vasile Rus, Peter Brusilovsky, Lasang Jimba Tamang, Kamil Akhuseyinoglu, and Scott Fleming. 2022. DeepCode: An Annotated Set of Instructional Code Examples to Foster Deep Code Comprehension and Learning. In International Conference on Intelligent Tutoring Systems. Springer, 36–50.
[27]
Murali Sridharan, Mika Mäntylä, Maëlick Claes, and Leevi Rantala. 2022. SoCCMiner: A Source Code-Comments and Comment-Context Miner. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE, 242–246.
[28]
Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, Westley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehension and code summarization. In Proceedings of the 28th International Conference on Program Comprehension. 2–13.
[29]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[31]
Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. jTrans: Jump-Aware Transformer for Binary Code Similarity. arXiv preprint arXiv:2205.12713(2022).
[32]
Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31, 7 (2019), 1235–1270.
[33]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783–794.
[34]
Qian Zhang, Jiyuan Wang, Guoqing Harry Xu, and Miryung Kim. 2022. HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 1017–1029.

Index Terms

  1. Leveraging Artificial Intelligence on Binary Code Comprehension
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
          October 2022
          2006 pages
          ISBN:9781450394758
          DOI:10.1145/3551349
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 05 January 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Qualifiers

          • Short-paper
          • Research
          • Refereed limited

          Conference

          ASE '22

          Acceptance Rates

          Overall Acceptance Rate 82 of 337 submissions, 24%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 134
            Total Downloads
          • Downloads (Last 12 months)53
          • Downloads (Last 6 weeks)4
          Reflects downloads up to 10 Nov 2024

          Other Metrics

          Citations

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media