skip to main content
10.1145/3106237.3106290acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Public Access

Are deep neural networks the best choice for modeling source code?

Published: 21 August 2017 Publication History

Abstract

Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.

References

[1]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 281–293.
[2]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016.
[3]
A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of The 33rd International Conference on Machine Learning. 2091–2100.
[4]
Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 207–216.
[5]
Miltiadis Allamanis, Daniel Tarlow, Andrew D Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In ICML, Vol. 37. 2123–2132.
[6]
Pavol Bielik, Veselin Raychev, and Martin T Vechev. 2016. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML. 19–24.
[7]
Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. ACM, 213–222.
[8]
Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax errors just aren’t natural: improving error reporting with language models. In Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 252–261.
[9]
Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stolcke, and others. 1997. Structure and performance of a dependency language model. In EUROSPEECH.
[10]
Ciprian Chelba and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 225–231.
[11]
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013).
[12]
Stanley F. Chen and Joshua Goodman. 1996.
[13]
An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL ’96). Association for Computational Linguistics, Stroudsburg, PA, USA, 310–318.
[14]
Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).
[15]
Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Subhajit Roy, and others. 2016. Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering. ACM, 345–356.
[16]
Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705–708.
[17]
Vincent J Hellendoorn, Premkumar T Devanbu, and Alberto Bacchelli. 2015.
[18]
Will they like this?: evaluating code contributions with language models. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 157–167.
[19]
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[21]
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).
[22]
Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. ACM, 173–184.
[23]
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and Understanding Recurrent Networks. arXiv preprint arXiv:1506.02078 (2015).
[24]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2015. Characteraware neural language models. arXiv preprint arXiv:1508.06615 (2015).
[25]
Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.
[26]
Chris J Maddison and Daniel Tarlow. 2014.
[27]
Structured Generative Models of Natural Source Code. In ICML. 649–657.
[28]
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernock ` y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. 2. 3.
[29]
Anh Tuan Nguyen and Tien N Nguyen. 2015. Graph-based statistical language model for code. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1. IEEE, 858–868.
[30]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2013.
[31]
Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 651–654.
[32]
Thanh Nguyen, Peter C Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N Nguyen. 2016.
[33]
T2API: synthesizing API code usage templates from English texts with statistical translation. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 1013– 1017.
[34]
Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2013. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 532–542.
[35]
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: synthesizing what I mean: code search and idiomatic snippet synthesis. In Proceedings of the 38th International Conference on Software Engineering. ACM, 357–367.
[36]
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the naturalness of buggy code. In Proceedings of the 38th International Conference on Software Engineering. ACM, 428–439.
[37]
Baishakhi Ray, Meiyappan Nagappan, Christian Bird, Nachiappan Nagappan, and Thomas Zimmermann. 2014.
[38]
The Uniqueness of Changes: Characteristics and Applications. Technical Report. Microsoft Research Technical Report.
[39]
Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In ACM SIGPLAN Notices, Vol. 49. ACM, 419–428.
[40]
Romain Robbes and Michele Lanza. 2008. How program history can improve code completion. In Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 317–326.
[41]
Juliana Saraiva, Christian Bird, and Thomas Zimmermann. 2015.
[42]
Products, developers, and milestones: how should I build my N-Gram language model. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 998–1001.
[43]
Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 269–280.
[44]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98.

Cited By

View all
  • (2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
  • (2024)On the Heterophily of Program Graphs: A Case Study of Graph-based Type InferenceProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671389(1-10)Online publication date: 24-Jul-2024
  • (2024)An Empirical Study on Code Review Activity Prediction and Its Impact in PracticeProceedings of the ACM on Software Engineering10.1145/36608061:FSE(2238-2260)Online publication date: 12-Jul-2024
  • Show More Cited By

Index Terms

  1. Are deep neural networks the best choice for modeling source code?

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering
    August 2017
    1073 pages
    ISBN:9781450351058
    DOI:10.1145/3106237
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. language models
    2. naturalness
    3. software tools

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ESEC/FSE'17
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)358
    • Downloads (Last 6 weeks)46
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
    • (2024)On the Heterophily of Program Graphs: A Case Study of Graph-based Type InferenceProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671389(1-10)Online publication date: 24-Jul-2024
    • (2024)An Empirical Study on Code Review Activity Prediction and Its Impact in PracticeProceedings of the ACM on Software Engineering10.1145/36608061:FSE(2238-2260)Online publication date: 12-Jul-2024
    • (2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
    • (2024)Context-Aware Name Recommendation for Field RenamingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639195(1-13)Online publication date: 20-May-2024
    • (2024)Code Search is All You Need? Improving Code Suggestions with Code SearchProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639085(1-13)Online publication date: 20-May-2024
    • (2024)Toward a Theory of Causation for Interpreting Neural Code ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.337994350:5(1215-1243)Online publication date: May-2024
    • (2024)BAFLineDP: Code Bilinear Attention Fusion Framework for Line- Level Defect Prediction2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00036(1-12)Online publication date: 12-Mar-2024
    • (2024)API Completion Recommendation Algorithm Based on Programming Site Context2024 10th International Symposium on System Security, Safety, and Reliability (ISSSR)10.1109/ISSSR61934.2024.00037(251-257)Online publication date: 16-Mar-2024
    • (2024)Enhancing User Experience on Q&A Platforms: Measuring Text Similarity Based on Hybrid CNN-LSTM Model for Efficient Duplicate Question DetectionIEEE Access10.1109/ACCESS.2024.335842212(34512-34526)Online publication date: 2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media