skip to main content
10.1145/3468264.3468588acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open access

Reassessing automatic evaluation metrics for code summarization tasks

Published: 18 August 2021 Publication History

Abstract

In recent years, research in the domain of source code summarization has adopted data-driven techniques pioneered in machine translation (MT). Automatic evaluation metrics such as BLEU, METEOR, and ROUGE, are fundamental to the evaluation of MT systems and have been adopted as proxies of human evaluation in the code summarization domain. However, the extent to which automatic metrics agree with the gold standard of human evaluation has not been evaluated on code summarization tasks. Despite this, marginal improvements in metric scores are often used to discriminate between the performance of competing summarization models. In this paper, we present a critical exploration of the applicability and interpretation of automatic metrics as evaluation techniques for code summarization tasks. We conduct an empirical study with 226 human annotators to assess the degree to which automatic metrics reflect human evaluation. Results indicate that metric improvements of less than 2 points do not guarantee systematic improvements in summarization quality, and are unreliable as proxies of human evaluation. When the difference between metric scores for two summarization approaches increases but remains within 5 points, some metrics such as METEOR and chrF become highly reliable proxies, whereas others, such as corpus BLEU, remain unreliable. Based on these findings, we make several recommendations for the use of automatic metrics to discriminate model performance in code summarization.

References

[1]
Alireza Aghamohammadi, Maliheh Izadi, and Abbas Heydarnoori. 2020. Generating summaries for methods of event-driven programs: An Android case study. Journal of Systems and Software, 170 (2020), 110800.
[2]
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. In ACL (short).
[3]
Abdulaziz Alhefdhi, Hoa Khanh Dam, Hideaki Hata, and Aditya Ghose. 2018. Generating pseudo-code from source code using deep learning. In 2018 25th Australasian Software Engineering Conference (ASWEC). 21–25.
[4]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning. 2091–2100.
[5]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400.
[6]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
[7]
Boxing Chen and Colin Cherry. 2014. A systematic comparison of smoothing techniques for sentence-level bleu. In Proceedings of the Ninth Workshop on Statistical Machine Translation. 362–367.
[8]
Minghao Chen and Xiaojun Wan. 2019. Neural Comment Generation for Source Code with Auxiliary Code Classification Task. In 2019 26th Asia-Pacific Software Engineering Conference (APSEC). 522–529.
[9]
Qingying Chen and Minghui Zhou. 2018. A neural framework for retrieval and summarization of source code. In Proceedings of the International Conference on Automated Software Engineering (ASE). 826–831.
[10]
YunSeok Choi, Suah Kim, and Jee-Hyong Lee. 2020. Source Code Summarization Using Attention-Based Keyword Memory Networks. In Proceedings of the International Conference on Big Data and Smart Computing (BigComp). 564–570.
[11]
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1383–1392.
[12]
Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured Neural Summarization. In International Conference on Learning Representations.
[13]
Yvette Graham, Timothy Baldwin, and Nitika Mathur. 2015. Accurate evaluation of segment-level machine translation metrics. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1183–1191.
[14]
Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. 33–41.
[15]
Yvette Graham, Nitika Mathur, and Timothy Baldwin. 2014. Randomized significance tests in machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation. 266–274.
[16]
David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation. In Proceedings of the International Conference on Automated Software Engineering (ASE). 746–757.
[17]
Tjalling Haije, Bachelor Opleiding Kunstmatige Intelligentie, E Gavves, and H Heuer. 2016. Automatic comment generation using a neural translation model. Inf. Softw. Technol., 55, 3 (2016), 258–268.
[18]
Sakib Haque, Alexander LeClair, Lingfei Wu, and Collin McMillan. 2020. Improved Automatic Summarization of Subroutines via Attention to File Context. In Proceedings of the Working Conference on Mining Software Repositories (MSR). 300–310.
[19]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the International Conference on Program Comprehension (ICPC). 200–20010.
[20]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering Journal (EMSE), 25, 3 (2020), 2179–2217.
[21]
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing source code with transferred api knowledge.(2018). In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI). 19, 2269–2275.
[22]
Yuan Huang, Shaohao Huang, Huanchao Chen, Xiangping Chen, Zibin Zheng, Xiapu Luo, Nan Jia, Xinyu Hu, and Xiaocong Zhou. 2020. Towards automatically generating block comments for code snippets. Information and Software Technology, 127 (2020), 106373.
[23]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083.
[24]
Alexander LeClair, Sakib Haque, Linfgei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. In Proceedings of the International Conference on Program Comprehension (ICPC).
[25]
Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 795–806.
[26]
Alexander LeClair and Collin McMillan. 2019. Recommendations for datasets for source code summarization. arXiv preprint arXiv:1904.02660.
[27]
Boao Li, Meng Yan, Xin Xia, Xing Hu, Ge Li, and David Lo. 2020. DeepCommenter: a deep code comment generation tool with hybrid lexical and syntactical information. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1571–1575.
[28]
Yuding Liang and Kenny Zhu. 2018. Automatic generation of text descriptive comments for code blocks. In Proceedings of the AAAI Conference on Artificial Intelligence. 32.
[29]
R. Likert. 1932. A Technique for the Measurement of Attitudes. Archives of Psychology, 140 (1932), 44–53.
[30]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
[31]
Bohong Liu, Tao Wang, Xunhui Zhang, Qiang Fan, Gang Yin, and Jinsheng Deng. 2019. A Neural-Network based Code Summarization Approach by Using Source Code and its Call Dependencies. In Proceedings of the 11th Asia-Pacific Symposium on Internetware. 1–10.
[32]
Mingwei Liu, Xin Peng, Xiujie Meng, Huanjun Xu, Shuangshuang Xing, Xin Wang, Yang Liu, and Gang Lv. 2020. Source Code based On-demand Class Documentation Generation. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). 864–865.
[33]
Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette Graham. 2019. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 62–90.
[34]
Nitika Mathur, Tim Baldwin, and Trevor Cohn. 2020. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. arXiv preprint arXiv:2006.06264.
[35]
Sergey Matskevich and Colin S Gordon. 2018. Generating comments from source code with CCGs. In Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering. 26–29.
[36]
Paul W McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension. 279–290.
[37]
L. Moreno, A. Marcus, L. Pollock, and K. Vijay-Shanker. 2013. JSummarizer: An automatic generator of natural language summaries for Java classes. In International Conference on Program Comprehension (ICPC). 230–232.
[38]
Najam Nazar, Yan Hu, and He Jiang. 2016. Summarizing software artifacts: A literature review. Journal of Computer Science and Technology, 31, 5 (2016), 883–909.
[39]
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation (t). In Proceedings of the International Conference on Automated Software Engineering (ASE). 574–584.
[40]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[41]
Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation. 392–395.
[42]
Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771.
[43]
Ehud Reiter. 2018. A Structured Review of the Validity of BLEU. Computational Linguistics, 44, 3 (2018), 393–401.
[44]
Stefan Riezler and John T Maxwell III. 2005. On some pitfalls in automatic evaluation and significance testing for MT. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 57–64.
[45]
Peter C. Rigby, Daniel M German, Laura Cowen, and Margaret-Anne Storey. 2014. Peer Review on Open Source Software Projects: Parameters, Statistical Models, and Theory. ACM Transactions on Software Engineering and Methodology (TOSEM), To appear.
[46]
Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Online Replication Package. https://github.com/devjeetr/Re-assessing-automatic-evaluation-metrics-for-source-code-summarization-tasks
[47]
Devjeet Roy, Ziyi Zhang, Venera Arnaoudova, A Panichella, Sebastiano Panichella, Danielle Gonzalez, and Mehdi Mirakhorli. 2020. DeepTC-Enhancer: Improving the Readability of Automatically Generated Tests.
[48]
Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto, Atsushi Miyamoto, and Tadayuki Matsumura. 2019. Automatic source code summarization with extended tree-lstm. In 2019 International Joint Conference on Neural Networks (IJCNN). 1–8.
[49]
Xingyi Song, Trevor Cohn, and Lucia Specia. 2013. BLEU deconstructed: Designing a better MT evaluation metric. International Journal of Computational Linguistics and Applications, 4, 2 (2013), 29–44.
[50]
Xiaotao Song, Hailong Sun, Xu Wang, and Jiafei Yan. 2019. A survey of automatic generation of source code comments: Algorithms and techniques. IEEE Access, 7 (2019), 111411–111428.
[51]
Peter Stanchev, Weiyue Wang, and Hermann Ney. 2020. Towards a Better Evaluation of Metrics for Machine Translation. In Proceedings of the Fifth Conference on Machine Translation. Association for Computational Linguistics, Online. 928–933.
[52]
Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, Westley Weimer, Kevin Leach, and Yu Huang. 2020. A Human Study of Comprehension and Code Summarization. In Proceedings of the International Conference on Program Comprehension (ICPC). 2–13.
[53]
Akiyoshi Takahashi, Hiromitsu Shiina, and Nobuyuki Kobayashi. 2019. Automatic Generation of Program Comments Based on Problem Statements for Computational Thinking. In 2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI). 629–634.
[54]
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the International Conference on Automated Software Engineering (ASE). 397–407.
[55]
Ruyun Wang, Hanwen Zhang, Guoliang Lu, Lei Lyu, and Chen Lyu. 2020. Fret: Functional Reinforced Transformer With BERT for Code Summarization. IEEE Access, 8 (2020), 135591–135604.
[56]
Wenhua Wang, Yuqun Zhang, Yulei Sui, Yao Wan, Zhou Zhao, Jian Wu, Philip Yu, and Guandong Xu. 2020. Reinforcement-Learning-Guided Source Code Summarization via Hierarchical Attention. IEEE Transactions on Software Engineering (TSE).
[57]
Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019. Code generation as a dual task of code summarization. In Advances in Neural Information Processing Systems. 6563–6573.
[58]
Kurt D. Welker, Paul W. Oman, and Gerald G. Atkinson. 1997. Development and Application of an Automated Source Code Maintainability Index. Journal of Software Maintenance: Research and Practice, 9, 3 (1997), May, 127–159.
[59]
Shaofeng Xu and Yun Xiong. 2018. Automatic Generation of Pseudocode with Attention Seq2seq Model. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC). 711–712.
[60]
Wei Ye, Rui Xie, Jinglei Zhang, Tianxiang Hu, Xiaoyin Wang, and Shikun Zhang. 2020. Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning. In Proceedings of The Web Conference (WWW). 2309–2319.
[61]
Xiaohan Yu, Quzhe Huang, Zheng Wang, Yansong Feng, and Dongyan Zhao. 2020. Towards Context-Aware Code Comment Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 3938–3947.
[62]
Lingbin Zeng, Xunhui Zhang, Tao Wang, Xiao Li, Jie Yu, and Huaimin Wang. 2018. Improving code summarization by combining deep learning and empirical knowledge (S). In Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE). 566–565.
[63]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020. Retrieval-based neural source code summarization. In Proceedings of the International Conference on Software Engineering (ICSE).
[64]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
[65]
Yu Zhou, Xin Yan, Wenhua Yang, Taolue Chen, and Zhiqiu Huang. 2019. Augmenting Java method comments generation with context information based on neural networks. Journal of Systems and Software, 156 (2019), 328–340.
[66]
Ziyi Zhou, Huiqun Yu, and Guisheng Fan. 2020. Effective approaches to combining lexical and syntactical information for code summarization. Software: Practice and Experience, 50, 12 (2020), 2313–2336.

Cited By

View all
  • (2024)SimLLM: Calculating Semantic Similarity in Code Summaries using a Large Language Model-Based ApproachProceedings of the ACM on Software Engineering10.1145/36607691:FSE(1376-1399)Online publication date: 12-Jul-2024
  • (2024)Do Code Summarization Models Process Too Much Information? Function Signature May Be All That Is NeededACM Transactions on Software Engineering and Methodology10.1145/365215633:6(1-35)Online publication date: 27-Jun-2024
  • (2024)FastLog: An End-to-End Method to Efficiently Generate and Insert Logging StatementsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652107(26-37)Online publication date: 11-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
August 2021
1690 pages
ISBN:9781450385626
DOI:10.1145/3468264
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic evaluation metrics
  2. code summarization
  3. machine translation

Qualifiers

  • Research-article

Conference

ESEC/FSE '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)743
  • Downloads (Last 6 weeks)93
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SimLLM: Calculating Semantic Similarity in Code Summaries using a Large Language Model-Based ApproachProceedings of the ACM on Software Engineering10.1145/36607691:FSE(1376-1399)Online publication date: 12-Jul-2024
  • (2024)Do Code Summarization Models Process Too Much Information? Function Signature May Be All That Is NeededACM Transactions on Software Engineering and Methodology10.1145/365215633:6(1-35)Online publication date: 27-Jun-2024
  • (2024)FastLog: An End-to-End Method to Efficiently Generate and Insert Logging StatementsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652107(26-37)Online publication date: 11-Sep-2024
  • (2024)An Extractive-and-Abstractive Framework for Source Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363274233:3(1-39)Online publication date: 14-Mar-2024
  • (2024)Deep Is Better? An Empirical Comparison of Information Retrieval and Deep Learning Approaches to Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363197533:3(1-37)Online publication date: 15-Mar-2024
  • (2024)Esale: Enhancing Code-Summary Alignment Learning for Source Code SummarizationIEEE Transactions on Software Engineering10.1109/TSE.2024.342227450:8(2077-2095)Online publication date: Aug-2024
  • (2024)Automatic Commit Message Generation: A Critical Review and Directions for Future WorkIEEE Transactions on Software Engineering10.1109/TSE.2024.336467550:4(816-835)Online publication date: 12-Feb-2024
  • (2024)Duration-Based Investigation of User Content Choices in the Exit of Filter Bubbles2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)10.1109/HORA61326.2024.10550588(1-8)Online publication date: 23-May-2024
  • (2024)Transformers in source code generation: A comprehensive surveyJournal of Systems Architecture10.1016/j.sysarc.2024.103193153(103193)Online publication date: Aug-2024
  • (2024)Automatic smart contract comment generation via large language models and in-context learningInformation and Software Technology10.1016/j.infsof.2024.107405168:COnline publication date: 17-Apr-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media