research-article

Public Access

Are deep neural networks the best choice for modeling source code?

Authors:

Vincent J. Hellendoorn,

Premkumar DevanbuAuthors Info & Claims

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

Pages 763 - 773

https://doi.org/10.1145/3106237.3106290

Published: 21 August 2017 Publication History

Abstract

Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.

References

[1]

Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 281–293.

Digital Library

[2]

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016.

[3]

A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of The 33rd International Conference on Machine Learning. 2091–2100.

[4]

Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 207–216.

Digital Library

[5]

Miltiadis Allamanis, Daniel Tarlow, Andrew D Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In ICML, Vol. 37. 2123–2132.

Digital Library

[6]

Pavol Bielik, Veselin Raychev, and Martin T Vechev. 2016. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML. 19–24.

Digital Library

[7]

Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. ACM, 213–222.

Digital Library

[8]

Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax errors just aren’t natural: improving error reporting with language models. In Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 252–261.

Digital Library

[9]

Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stolcke, and others. 1997. Structure and performance of a dependency language model. In EUROSPEECH.

[10]

Ciprian Chelba and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 225–231.

Digital Library

[11]

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013).

[12]

Stanley F. Chen and Joshua Goodman. 1996.

[13]

An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL ’96). Association for Computational Linguistics, Stroudsburg, PA, USA, 310–318.

Digital Library

[14]

Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).

[15]

Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Subhajit Roy, and others. 2016. Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering. ACM, 345–356.

Digital Library

[16]

Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705–708.

Digital Library

[17]

Vincent J Hellendoorn, Premkumar T Devanbu, and Alberto Bacchelli. 2015.

[18]

Will they like this?: evaluating code contributions with language models. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 157–167.

Digital Library

[19]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.

Digital Library

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[21]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).

[22]

Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. ACM, 173–184.

Digital Library

[23]

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and Understanding Recurrent Networks. arXiv preprint arXiv:1506.02078 (2015).

[24]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2015. Characteraware neural language models. arXiv preprint arXiv:1508.06615 (2015).

Digital Library

[25]

Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.

Digital Library

[26]

Chris J Maddison and Daniel Tarlow. 2014.

[27]

Structured Generative Models of Natural Source Code. In ICML. 649–657.

[28]

Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernock ` y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. 2. 3.

[29]

Anh Tuan Nguyen and Tien N Nguyen. 2015. Graph-based statistical language model for code. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1. IEEE, 858–868.

Digital Library

[30]

Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2013.

[31]

Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 651–654.

Digital Library

[32]

Thanh Nguyen, Peter C Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N Nguyen. 2016.

[33]

T2API: synthesizing API code usage templates from English texts with statistical translation. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 1013– 1017.

Digital Library

[34]

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2013. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 532–542.

Digital Library

[35]

Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: synthesizing what I mean: code search and idiomatic snippet synthesis. In Proceedings of the 38th International Conference on Software Engineering. ACM, 357–367.

Digital Library

[36]

Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the naturalness of buggy code. In Proceedings of the 38th International Conference on Software Engineering. ACM, 428–439.

Digital Library

[37]

Baishakhi Ray, Meiyappan Nagappan, Christian Bird, Nachiappan Nagappan, and Thomas Zimmermann. 2014.

[38]

The Uniqueness of Changes: Characteristics and Applications. Technical Report. Microsoft Research Technical Report.

[39]

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In ACM SIGPLAN Notices, Vol. 49. ACM, 419–428.

Digital Library

[40]

Romain Robbes and Michele Lanza. 2008. How program history can improve code completion. In Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 317–326.

Digital Library

[41]

Juliana Saraiva, Christian Bird, and Thomas Zimmermann. 2015.

[42]

Products, developers, and milestones: how should I build my N-Gram language model. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 998–1001.

Digital Library

[43]

Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 269–280.

Digital Library

[44]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98.

Digital Library

Cited By

HONG JCHOI EMIZUNO O(2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
https://doi.org/10.1587/transinf.2023MPP0002
Xu SShen JLi YYao YYu PXu FMa X(2024)On the Heterophily of Program Graphs: A Case Study of Graph-based Type InferenceProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671389(1-10)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3671389
Olewicki DHabchi SAdams B(2024)An Empirical Study on Code Review Activity Prediction and Its Impact in PracticeProceedings of the ACM on Software Engineering10.1145/36608061:FSE(2238-2260)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660806
Show More Cited By

Index Terms

Are deep neural networks the best choice for modeling source code?
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

Deep Learning for Source Code Modeling and Generation: Models, Applications, and Challenges

Deep Learning (DL) techniques for Natural Language Processing have been evolving remarkably fast. Recently, the DL advances in language modeling, machine translation, and paragraph understanding are so prominent that the potential of DL in Software ...
Do bugs lead to unnaturalness of source code?
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent ...
Using Morphological Information for Robust Language Modeling in Czech ASR System

Automatic speech recognition, or more precisely language modeling, of the Czech language has to face challenges that are not present in the language modeling of English. Those include mainly the rapid vocabulary growth and closely connected unreliable ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

August 2017

1073 pages

ISBN:9781450351058

DOI:10.1145/3106237

General Chairs:
Eric Bodden
Paderborn University, Germany / Fraunhofer IEM, Germany
,
Wilhelm Schäfer
Paderborn University, Germany
,
Program Chairs:
Arie van Deursen
Delft University of Technology, Netherlands
,
Andrea Zisman
Open University, UK

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

ESEC/FSE'17

Sponsor:

SIGSOFT

ESEC/FSE'17: Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering

September 4 - 8, 2017

Paderborn, Germany

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

183
Total Citations
View Citations
3,057
Total Downloads

Downloads (Last 12 months)358
Downloads (Last 6 weeks)46

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

HONG JCHOI EMIZUNO O(2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
https://doi.org/10.1587/transinf.2023MPP0002
Xu SShen JLi YYao YYu PXu FMa X(2024)On the Heterophily of Program Graphs: A Case Study of Graph-based Type InferenceProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671389(1-10)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3671389
Olewicki DHabchi SAdams B(2024)An Empirical Study on Code Review Activity Prediction and Its Impact in PracticeProceedings of the ACM on Software Engineering10.1145/36608061:FSE(2238-2260)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660806
Liu FFu ZLi GJin ZLiu HHao YZhang L(2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1145/3649594
Dong CJiang YNiu NZhang YLiu HRoychoudhury APaiva AAbreu RStorey M(2024)Context-Aware Name Recommendation for Field RenamingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639195(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639195
Chen JHu XLi ZGao CXia XLo DRoychoudhury APaiva AAbreu RStorey M(2024)Code Search is All You Need? Improving Code Suggestions with Code SearchProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639085(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639085
Nader Palacio DVelasco ACooper NRodriguez AMoran KPoshyvanyk D(2024)Toward a Theory of Causation for Interpreting Neural Code ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.337994350:5(1215-1243)Online publication date: May-2024
https://doi.org/10.1109/TSE.2024.3379943
Qiu SHuang HLuo JKuang YLuo H(2024)BAFLineDP: Code Bilinear Attention Fusion Framework for Line- Level Defect Prediction2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00036(1-12)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00036
Chen JXiao LShen Y(2024)API Completion Recommendation Algorithm Based on Programming Site Context2024 10th International Symposium on System Security, Safety, and Reliability (ISSSR)10.1109/ISSSR61934.2024.00037(251-257)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ISSSR61934.2024.00037
Faseeh MKhan MIqbal NQayyum FMehmood AKim J(2024)Enhancing User Experience on Q&A Platforms: Measuring Text Similarity Based on Hybrid CNN-LSTM Model for Efficient Duplicate Question DetectionIEEE Access10.1109/ACCESS.2024.335842212(34512-34526)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3358422
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents