skip to main content
10.1145/3238147.3240471acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

A neural framework for retrieval and summarization of source code

Published: 03 September 2018 Publication History

Abstract

Code retrieval and summarization are two tasks often employed by software developers to reuse code that spreads over online repositories. In this paper, we present a neural framework that allows bidirectional mapping between source code and natural language to improve these two tasks. Our framework, BVAE, is designed to have two Variational AutoEncoders (VAEs) to model bimodal data: C-VAE for source code and L-VAE for natural language. Both VAEs are trained jointly to reconstruct their input as much as possible with regularization that captures the closeness between the latent variables of code and description. BVAE could learn semantic vector representations for both code and description and generate completely new descriptions for arbitrary code snippets. We design two instance models of BVAE for retrieval and summarization tasks respectively and evaluate their performance on a benchmark which involves two programming languages: C# and SQL. Experiments demonstrate BVAE’s potential on the two tasks.

References

[1]
Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal modelling of source code and natural language. In ICML. 2123–2132.
[2]
Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale collection and analysis of open-source code. Science of Computer Programming 79 (2014), 241–259.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
[4]
Jeffrey S Beis and David G Lowe. 1997. Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In CVPR. IEEE, 1000–1006.
[5]
Stephen Blott and Roger Weber. 1997. A simple vector-approximation file for similarity search in high-dimensional vector spaces. ESPRIT Technical Report TR19, ca (1997).
[6]
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. 10–21.
[7]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[8]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249–256.
[9]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. ACM, 933–944.
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[11]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In ICPC.
[12]
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Source Code with Transferred API Knowledge. In IJCAI. 2269–2275.
[13]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In ACL, Vol. 1. 2073– 2083.
[14]
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In ICSE. ACM, 664–675.
[15]
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR.
[16]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. ACL, 177–180.
[17]
Georgia Koutrika, Alkis Simitsis, and Yannis E Ioannidis. 2010. Explaining structured queries in natural language. In ICDE. IEEE, 333–344.
[18]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP. 1412âĂŞ– 1421.
[19]
Paul W McBurney and Collin McMillan. 2016. Automatic source code summarization of context for java methods. IEEE Transactions on Software Engineering 42, 2 (2016), 103–119.
[20]
Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. 2013. Automatic generation of natural language summaries for java classes. In ICPC. IEEE, 23–32.
[21]
Dana Movshovitz-Attias and William W. Cohen. 2013. Natural language models for predicting programming comments. In ACL. 35–40.
[22]
Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger, Jens Lehmann, and Daniel Gerber. 2013. Sorry, i don’t speak SPARQL: translating SPARQL queries into natural language. In WWW. ACM, 977–988.
[23]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. ACL, 311–318.
[24]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In ICML.
[25]
Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In EMNLP. 379âĂŞ–389.
[26]
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. ACM, 43–52.
[27]
Giriprasad Sridhara, Lori Pollock, and K Vijay-Shanker. 2011. Automatically detecting and describing high level actions within methods. In ICSE. ACM, 101– 110.
[28]
Quan Wang and Suya You. 2006. Fast similarity search for high-dimensional dataset. In Multimedia, 2006. ISM’06. Eighth IEEE International Symposium on. IEEE, 799–804.
[29]
Wei Wang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment 7, 8 (2014), 649–660.

Cited By

View all
  • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
  • (2024)Adversarial Attack and Robustness Improvement on Code SummarizationProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661173(17-27)Online publication date: 18-Jun-2024
  • (2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
  • Show More Cited By

Index Terms

  1. A neural framework for retrieval and summarization of source code

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASE '18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering
    September 2018
    955 pages
    ISBN:9781450359375
    DOI:10.1145/3238147
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 September 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Code retrieval
    2. code summarization
    3. neural framework

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASE '18
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 82 of 337 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
    • (2024)Adversarial Attack and Robustness Improvement on Code SummarizationProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661173(17-27)Online publication date: 18-Jun-2024
    • (2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
    • (2024)Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and DisengagementACM Transactions on Software Engineering and Methodology10.1145/364033633:4(1-27)Online publication date: 10-Jan-2024
    • (2024)Learning to Generate Structured Code Summaries From Hybrid Code ContextIEEE Transactions on Software Engineering10.1109/TSE.2024.343956250:10(2512-2528)Online publication date: Oct-2024
    • (2024)A survey on machine learning techniques applied to source codeJournal of Systems and Software10.1016/j.jss.2023.111934209:COnline publication date: 14-Mar-2024
    • (2024)A review of automatic source code summarizationEmpirical Software Engineering10.1007/s10664-024-10553-629:6Online publication date: 7-Oct-2024
    • (2024)Enrich Code Search Query Semantics with Raw DescriptionsCollaborative Computing: Networking, Applications and Worksharing10.1007/978-3-031-54521-4_16(284-302)Online publication date: 23-Feb-2024
    • (2023)Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A ReviewEntropy10.3390/e2506088825:6(888)Online publication date: 1-Jun-2023
    • (2023)Survey of Code Search Based on Deep LearningACM Transactions on Software Engineering and Methodology10.1145/362816133:2(1-42)Online publication date: 23-Dec-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media