research-article

IntelliCode compose: code generation using transformer

Authors:

Alexey Svyatkovskiy,

Neel SundaresanAuthors Info & Claims

ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1433 - 1443

https://doi.org/10.1145/3368089.3417058

Published: 08 November 2020 Publication History

Abstract

In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments.

In this paper, we introduce IntelliCode Compose – a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook.

Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.

Supplementary Material

Auxiliary Teaser Video (fse20ind-p75-p-teaser.mp4)

In this paper, we introduce IntelliCode Compose – a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.

Download
9.51 MB

Auxiliary Presentation Video (fse20ind-p75-p-video.mp4)

In this paper, we introduce IntelliCode Compose – a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.

Download
203.45 MB

References

[1]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep API learning. CoRR, abs/1605.08535, 2016.

[2]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, ICSE '18, page 933-944, New York, NY, USA, 2018. Association for Computing Machinery.

Digital Library

[3]

Song Wang, Taiyue Liu, and Lin Tan. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering, ICSE '16, page 297-308, New York, NY, USA, 2016. Association for Computing Machinery.

Digital Library

[4]

Richard Shin, Miltiadis Allamanis, Marc Brockschmidt, and Oleksandr Polozov. Program synthesis and semantic parsing with learned code idioms. CoRR, abs/ 1906.10816, 2019.

[5]

Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. CoRR, abs/ 1808.01400, 2018.

[6]

Marcel Bruch, Martin Monperrus, and Mira Mezini. Learning from examples to improve code completion systems. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), 2009.

Digital Library

[7]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In Proceedings of the International Conference on Software Engineering (ICSE), 2012.

[8]

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. A statistical semantic language model for source code. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), 2013.

Digital Library

[9]

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. Pythia: Ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, page 2727-2735, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[10]

Alexey Svyatkovskiy, Sebastian Lee, Anna Hadjitofi, Maik Riechert, Juliana Franco, and Miltiadis Allamanis. Fast and memory-eficient neural code completion, 2020.

[11]

Muhammad Asaduzzaman, Chanchal Roy, Kevin Schneider, and Daqing Hou. Cscc: Simple, eficient, context sensitive code completion. Proceedings-30th International Conference on Software Maintenance and Evolution, ICSME 2014, pages 71-80, 12 2014.

Digital Library

[12]

Cheng Zhang, Juyuan Yang, Yi Zhang, Jing Fan, Xin Zhang, Jianjun Zhao, and Peizhao Ou. Automatic parameter recommendation for practical api usage. In Proceedings-34th International Conference on Software Engineering, ICSE 2012, Proceedings-International Conference on Software Engineering, pages 826-836, 7 2012.

[13]

Mattia Fazzini, Qi Xin, and Alessandro Orso. Automated api-usage update for android apps. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, page 204-215, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[14]

Hui Liu, Qiurong Liu, Cristian-Alexandru Staicu, Michael Pradel, and Yue Luo. Nomen est omen: Exploring and exploiting similarities between argument and parameter names. In Proceedings of the 38th International Conference on Software Engineering, ICSE '16, page 1063-1073, New York, NY, USA, 2016. Association for Computing Machinery.

Digital Library

[15]

Mia Xu Chen, Benjamin N. Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M. Dai, Zhifeng Chen, and et al. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, page 2287-2295, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[16]

Alec Radford, Jef Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

[17]

Microsoft Corporation. Ai-assisted development. https://marketplace. visualstudio.com/items?itemName=VisualStudioExptTeam.VSIntelliCode. Visited Jan 2020.

[18]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.

[19]

Alec Radford. Improving language understanding by generative pre-training. 2018.

[20]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/ 1810.04805, 2018.

[21]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/ 1906.08237, 2019.

[22]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/ 1907.11692, 2019.

[23]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geofrey E. Hinton. Layer normalization, 2016.

[24]

David R. So, Chen Liang, and Quoc V. Le. The evolved transformer. CoRR, abs/ 1901.11117, 2019.

[25]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.

[26]

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. CoRR, abs/1611.01462, 2016.

[27]

Jacob Jackson. Autocompletion with deep learning. https://tabnine.com/blog/ deep/. Visited Sep 2019.

[28]

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv: 1802.05799, 2018.

[29]

Colin Rafel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.

[30]

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation, 2019.

[31]

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839-849, San Diego, California, June 2016. Association for Computational Linguistics.

[32]

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference, 2018.

[33]

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74-81, Barcelona, Spain, July 2004. Association for Computational Linguistics.

[34]

Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605-612, Barcelona, Spain, July 2004.

Digital Library

[35]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface's transformers: State-of-the-art natural language processing. ArXiv, abs/ 1910.03771, 2019.

[36]

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007. github.io/OpenWebTextCorpus, 2019.

[37]

Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. A survey of machine learning for big code and naturalness. CoRR, abs/1709.06182, 2017.

[38]

Geofrey Hinton, Oriol Vinyals, and Jef Dean. Distilling the knowledge in a neural network, 2015.

[39]

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks, 2019.

[40]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2019.

[41]

Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In ACM SIGPLAN Notices, volume 49, pages 419-428. ACM, 2014.

Digital Library

[42]

Sebastian Proksch, Johannes Lerch, and Mira Mezini. Intelligent code completion with bayesian networks. ACM Transactions on Software Engineering and Methodology (TOSEM), 25 ( 1 ): 3, 2015.

[43]

Eclipse Foundation. Code Recommenders. www.eclipse.org/recommenders. Visited June 2017.

[44]

Eclipse Recommenders. Eclipse SnipMatch. http://www.eclipse.org/ recommenders/manual/#snipmatch, 2014. Visited Jun 2017.

Cited By

Kotsiantis SVerykios VTzagarakis M(2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
https://doi.org/10.3390/electronics13040767
Bibi NMaqbool ARana TAfzal FKhan A(2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
https://doi.org/10.3390/app14135795
Patel SPatel UNanavati JPatel A(2024)Analyzing Generative Models for Realistic Data Augmentation across Modalities and Applications2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498685(1601-1606)Online publication date: 28-Feb-2024
https://doi.org/10.23919/INDIACom61295.2024.10498685
Show More Cited By

Index Terms

IntelliCode compose: code generation using transformer
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Software and its engineering
  1. Software notations and tools
    1. Development frameworks and environments
      1. Integrated and visual development environments

Recommendations

Pythia: AI-assisted Code Completion System
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as ...
Exploring and Improving Code Completion for Test Code
ICPC '24: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension

Code completion is an important feature in Integrated Development Environments (IDEs). These years, researchers have been making efforts for intelligent code completion. However, existing work on intelligent code completion either only considered ...
Generation Code: I'm a JavaScript Games Maker Advanced Coding

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2020

1703 pages

ISBN:9781450370431

DOI:10.1145/3368089

General Chair:
Prem Devanbu
University of California at Davis, USA
,
Program Chairs:
Myra Cohen
Iowa State University, USA
,
Thomas Zimmermann
Microsoft Research, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '20

Sponsor:

SIGSOFT

ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 8 - 13, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

197
Total Citations
View Citations
2,113
Total Downloads

Downloads (Last 12 months)729
Downloads (Last 6 weeks)47

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kotsiantis SVerykios VTzagarakis M(2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
https://doi.org/10.3390/electronics13040767
Bibi NMaqbool ARana TAfzal FKhan A(2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
https://doi.org/10.3390/app14135795
Patel SPatel UNanavati JPatel A(2024)Analyzing Generative Models for Realistic Data Augmentation across Modalities and Applications2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498685(1601-1606)Online publication date: 28-Feb-2024
https://doi.org/10.23919/INDIACom61295.2024.10498685
Kuramitsu KObara MSato MAkinobu Y(2024)Training AI Model that Suggests Python Code from Student Requests in Natural LanguageJournal of Information Processing10.2197/ipsjjip.32.6932(69-76)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.69
Gurjar ACamp LRingenberg TMa XChaora A(2024)Can Language Models Identify Use of PII Data Types in CodeSSRN Electronic Journal10.2139/ssrn.4619112Online publication date: 2024
https://doi.org/10.2139/ssrn.4619112
Li BYang PSun YHu ZYi M(2024)Advances and challenges in artificial intelligence text generation人工智能文本生成的进展与挑战Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230041025:1(64-83)Online publication date: 8-Feb-2024
https://doi.org/10.1631/FITEE.2300410
Wan YBi ZHe YZhang JZhang HSui YXu GJin HYu P(2024)Deep Learning for Code Intelligence: Survey, Benchmark and ToolkitACM Computing Surveys10.1145/3664597Online publication date: 18-May-2024
https://doi.org/10.1145/3664597
Liu CZhang XZhang HWan ZHuang ZYan Md'Amorim M(2024)An Empirical Study of Code Search in Intelligent Coding Assistant: Perceptions, Expectations, and DirectionsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663848(283-293)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663848
Huang YLi YWu WZhang JLyu M(2024)Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded CredentialsProceedings of the ACM on Software Engineering10.1145/36608181:FSE(2515-2537)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660818
Wang HXu TWang BLo DPenta MXia XHu X(2024)Deep Multiple Assertions GenerationProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering10.1145/3650105.3652293(1-11)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3650105.3652293
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents