research-article

Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

Authors: Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, Xin PengAuthors Info & Claims

ICSE '24: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering

Article No.: 34, Pages 1 - 13

https://doi.org/10.1145/3597503.3623306

Published: 06 February 2024 Publication History

Abstract

Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT's underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.

References

[1]

Fatih Kadir Akin. 2023. Awesome Chatgpt Prompts. https://github.com/f/awesome-chatgpt-prompts.

[2]

Gareth Ari Aye, Seohyun Kim, and Hongyu Li. 2021. Learning autocompletion from real-world datasets. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 131 -- 139.

Digital Library

[3]

Amiangshu Bosu and Jeffrey C Carver. 2013. Impact of peer code review on peer impression formation: A survey. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE, 133--142.

[4]

Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of useful code reviews: An empirical study at microsoft. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, 146--156.

[5]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]

[6]

ChatGPTCodeReview. 2023. ExtraResourceLinkage. https://sites.google.com/view/chatgptcodereview.

[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[8]

Guo et al. 2023. chatgptcodereview-settings. https://sites.google.com/view/chatgptcodereview/impact-of-settings?authuser=0.

[9]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).

[10]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv:2203.03850

[11]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).

[12]

Vincent J Hellendoorn, Jason Tsay, Manisha Mukherjee, and Martin Hirzel. 2021. Towards automating code review at scale. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1479--1482.

Digital Library

[13]

IBM. 2022. IBM Global AI Adoption Index 2022. https://www.ibm.com/watson/resources/ai-adoption.

[14]

Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161--1173.

Digital Library

[15]

Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. PMLR, 54--63.

[16]

Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code prediction by feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 150--162.

Digital Library

[17]

Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. 2023. Multi-target Backdoor Attacks for Code Pre-trained Models. arXiv preprint arXiv:2306.08350 (2023).

[18]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan. 2022. CodeReviewer: Pre-Training for Automating Code Review Activities. arXiv preprint arXiv:2203.09095v1 (2022).

[19]

Shangqing Liu, Yanzhou Li, Xiaofei Xie, and Yang Liu. 2023. CommitBART: A Large Pre-trained Model for GitHub Commits. arXiv:2208.08100

[20]

Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2023. ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning. arXiv preprint arXiv:2301.09072 (2023).

[21]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).

[22]

Wei Ma, Shangqing Liu, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, and Yang Liu. 2023. The Scope of ChatGPT in Software Engineering: A Thorough Investigation. arXiv preprint arXiv:2305.12138 (2023).

[23]

Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. 2022. Is Self-Attention Powerful to Learn Code Syntax and Semantics? arXiv preprint arXiv:2212.10017 (2022).

[24]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336--347.

Digital Library

[25]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.

[26]

OpenAI. 2023. ChatGPTblog. https://openai.com/blog/chatgpt.

[27]

OpenAI. 2023. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.

[28]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774

[29]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155

[30]

Matheus Paixão, Anderson Uchôa, Ana Carla Bibiano, Daniel Oliveira, Alessandro Garcia, Jens Krinke, and Emilio Arvonio. 2020. Behind the intents: An in-depth empirical study on software refactoring in modern code review. In Proceedings of the 17th International Conference on Mining Software Repositories. 125--136.

Digital Library

[31]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311--318.

Digital Library

[32]

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476 (2023).

[33]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[34]

Colin Raffel, NoamShazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michatgpt promptsel Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683

[35]

Scipy. 2023. ttest_ind. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.

[36]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008--3021.

[37]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433--1443.

Digital Library

[38]

Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn. 2022. AutoTransform: automated code transformation to support modern code review process. In Proceedings of the 44th International Conference on Software Engineering. 237--248.

Digital Library

[39]

Patanamon Thongtanunam, chatgpt_promptskkrit Tantithamthavorn, Raula Gaikovina Kula, Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should review my code? a file location-based code-reviewer recommendation approach for modern code review. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 141--150.

[40]

Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On learning meaningful code changes via neural machine translation. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 25--36.

Digital Library

[41]

Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using pre-trained models to boost code review automation. In Proceedings of the 44th International Conference on Software Engineering. 2291--2302.

Digital Library

[42]

Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards automating code review activities. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 163--174.

Digital Library

[43]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696--8708.

[44]

Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2022. A systematic literature review on the use of deep learning in software engineering research. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1--58.

Digital Library

[45]

Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. 2015. Automatically recommending peer reviewers in modern code review. IEEE Transactions on Software Engineering 42, 6 (2015), 530--543.

Digital Library

[46]

Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, Junda He, and David Lo. 2023. Generation-based Code Review Automation: How Far Are We? arXiv preprint arXiv:2303.07221 (2023).

Cited By

Kernberg AGold JMohan V(2024)Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative StudyJournal of Medical Internet Research10.2196/5441926(e54419)Online publication date: 22-Apr-2024
https://doi.org/10.2196/54419
Liu SLi YXie XMa WMeng GLiu Y(2024)Automated Commit Intelligence by Pre-trainingACM Transactions on Software Engineering and Methodology10.1145/3674731Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674731
Ding YMin MKaiser GRay B(2024)CYCLE: Learning to Self-Refine the Code GenerationProceedings of the ACM on Programming Languages10.1145/36498258:OOPSLA1(392-418)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649825
Show More Cited By

Index Terms

Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
  2. Software notations and tools
    1. Software maintenance tools

Recommendations

Are architectural smells independent from code smells? An empirical study
Highlights
- Case study analyzing the correlations among code smells, groups of code smells and architectural smells.
Abstract
Background. Architectural smells and code smells are symptoms of bad code or design that can cause different quality problems, such as faults, technical debt, or difficulties with maintenance and evolution. Some studies ...
An empirical study of bugs in test code
ICSME '15: Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Testing aims at detecting (regression) bugs in production code. However, testing code is just as likely to contain bugs as the code it tests. Buggy test cases can silently miss bugs in the production code or loudly ring false alarms when the production ...
A large-scale empirical study on the lifecycle of code smell co-occurrences
Abstract Context
Code smells are suboptimal design or implementation choices made by programmers during the development of a software system that possibly lead to low code maintainability and higher maintenance costs.
...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

May 2024

2942 pages

ISBN:9798400702174

DOI:10.1145/3597503

Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ICSE '24

Sponsor:

SIGSOFT

ICSE '24: IEEE/ACM 46th International Conference on Software Engineering

April 14 - 20, 2024

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
721
Total Downloads

Downloads (Last 12 months)721
Downloads (Last 6 weeks)92

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kernberg AGold JMohan V(2024)Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative StudyJournal of Medical Internet Research10.2196/5441926(e54419)Online publication date: 22-Apr-2024
https://doi.org/10.2196/54419
Liu SLi YXie XMa WMeng GLiu Y(2024)Automated Commit Intelligence by Pre-trainingACM Transactions on Software Engineering and Methodology10.1145/3674731Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674731
Ding YMin MKaiser GRay B(2024)CYCLE: Learning to Self-Refine the Code GenerationProceedings of the ACM on Programming Languages10.1145/36498258:OOPSLA1(392-418)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649825
Ouaazki ABergram KFarah JGillet DHolzer A(2024)Generative AI-Enabled Conversational Interaction to Support Self-Directed Learning Experiences in Transversal Computational ThinkingProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665542(1-12)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3640794.3665542
Kutsenok LKorablev Y(2024)Research of Applicability of Natural Language Processing Models to the Task of Analyzing Technical Tasks and Specifications for Software Development2024 XXVII International Conference on Soft Computing and Measurements (SCM)10.1109/SCM62608.2024.10554107(200-203)Online publication date: 22-May-2024
https://doi.org/10.1109/SCM62608.2024.10554107
Li KQian CYang X(2024)Evaluating the quality of student-generated content in learnersourcing: A large language model based approachEducation and Information Technologies10.1007/s10639-024-12851-4Online publication date: 17-Jul-2024
https://doi.org/10.1007/s10639-024-12851-4

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents