research-article

All you need is logs: improving code completion by learning from anonymous IDE usage logs

Authors:

Vitaliy Bibaev,

Vadim Lomshakov,

Yaroslav Golubev,

Alexander Bezzubov,

Nikita Povarov,

Timofey BryksinAuthors Info & Claims

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1269 - 1279

https://doi.org/10.1145/3540250.3558968

Published: 09 November 2022 Publication History

Abstract

In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832.

The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.

References

[1]

Sven Amann, Sebastian Proksch, Sarah Nadi, and Mira Mezini. 2016. A Study of Visual Studio Usage in Practice. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 1, 124–134. https://doi.org/10.1109/SANER.2016.39

[2]

Pasquale Ardimento, Mario Luca Bernardi, Marta Cimitile, and Giuseppe De Ruvo. 2019. Mining Developer’s Behavior from Web-Based IDE Logs. In 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE). 277–282. https://doi.org/10.1109/WETICE.2019.00065

[3]

Gareth Ari Aye and Gail E Kaiser. 2020. Sequence Model Design for Code Completion in the Modern IDE. arXiv preprint arXiv:2004.05249.

[4]

Gareth Ari Aye, Seohyun Kim, and Hongyu Li. 2021. Learning Autocompletion From Real-World Datasets. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 131–139. https://doi.org/10.1109/ICSE-SEIP52600.2021.00022

Digital Library

[5]

Leo Breiman. 2001. Random forests. Machine learning, 45, 1 (2001), 5–32. https://doi.org/10.1023/A:1010933404324

Digital Library

[6]

Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from Examples to Improve Code Completion Systems. In ESEC-FSE’09 - Proceedings of the Joint 12th European Software Engineering Conference and 17th ACM SIGSOFT Symposium on the Foundations of Software Engineering. 213–222. https://doi.org/10.1145/1595696.1595728

Digital Library

[7]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th international conference on Machine learning. 129–136. https://doi.org/10.1145/1273496.1273513

Digital Library

[8]

CatBoost. 2018. QuerySoftMax Loss Function. https://catboost.ai/en/docs/concepts/loss-functions-ranking#QuerySoftMax [Online. Accessed 31.08.2022]

[9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.

[10]

Matteo Ciniselli, Luca Pascarella, and Gabriele Bavota. 2022. To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set? arXiv preprint arXiv:2204.06894.

[11]

Manuel De Stefano, Michele Simone Gambardella, Fabiano Pecorelli, Fabio Palomba, and Andrea De Lucia. 2020. cASpER: A Plug-in for Automated Code Smell Detection and Refactoring. In Proceedings of the International Conference on Advanced Visual Interfaces. 1–3. https://doi.org/10.1145/3399715.3399955

Digital Library

[12]

Bradley Efron. 1992. Bootstrap Methods: Another Look at the Jackknife. 569–593. https://doi.org/10.1007/978-1-4612-4380-9_41

[13]

Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli. 2019. When Code Completion Fails: A Case Study on Real-World Completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 960–970. https://doi.org/10.1109/ICSE.2019.00101

Digital Library

[14]

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the Naturalness of Software. Commun. ACM, 59, 5 (2016), 122–131. https://doi.org/10.1145/2902362

Digital Library

[15]

Daqing Hou and David M. Pletcher. 2011. An Evaluation of the Strategies of Sorting, Filtering, and Grouping API Methods for Code Completion. In IEEE International Conference on Software Maintenance, ICSM. 233–242. https://doi.org/10.1109/ICSM.2011.6080790

Digital Library

[16]

Daqing Hou and Yuejiao Wang. 2009. An Empirical Analysis of the Evolution of User-Visible Features in an Integrated Development Environment. In Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research. 122–135. https://doi.org/10.1145/1723028.1723044

Digital Library

[17]

Constantina Ioannou, Andrea Burattin, and Barbara Weber. 2018. Mining Developers’ Workflows from IDE usage. In International Conference on Advanced Information Systems Engineering. 167–179. https://doi.org/10.1007/978-3-319-92898-2_14

[18]

Maliheh Izadi, Roberta Gismondi, and Georgios Gousios. 2022. CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences. arXiv preprint arXiv:2202.06689.

[19]

JetBrains. 2018. Early Access Program (EAP). https://www.jetbrains.com/idea/nextversion/ [Online. Accessed 31.08.2022]

[20]

JetBrains. 2018. Essential Tools for Software Developers and Teams. https://www.jetbrains.com/ [Online. Accessed 31.08.2022]

[21]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30 (2017).

[22]

Zarina Kurbatova, Yaroslav Golubev, Vladimir Kovalenko, and Timofey Bryksin. 2021. The IntelliJ Platform: a Framework for Building Plugins and Mining Software Data. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). 14–17. https://doi.org/10.1109/ASEW52652.2021.00016

[23]

Zarina Kurbatova, Ivan Veselov, Yaroslav Golubev, and Timofey Bryksin. 2020. Recommendation of Move Method Refactoring Using Path-Based Representation of Code. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. 315–322. https://doi.org/10.1145/3387940.3392191

Digital Library

[24]

Stefano Lambiase, Andrea Cupito, Fabiano Pecorelli, Andrea De Lucia, and Fabio Palomba. 2020. Just-in-time Test Smell Detection and Refactoring: The DARTS project. In Proceedings of the 28th International Conference on Program Comprehension. 441–445. https://doi.org/10.1145/3387904.3389296

Digital Library

[25]

Yun Young Lee, Nicholas Chen, and Ralph E Johnson. 2013. Drag-and-Drop Refactoring: Intuitive and Efficient Program Transformation. In 2013 35th International Conference on Software Engineering (ICSE). 23–32. https://doi.org/10.1109/ICSE.2013.6606548

[26]

Hang Li. 2011. A Short Introduction to Learning to Rank. IEICE TRANSACTIONS on Information and Systems, 94, 10 (2011), 1854–1862. https://doi.org/10.1587/transinf.E94.D.1854

[27]

Emerson Murphy-Hill, Chris Parnin, and Andrew P Black. 2011. How We Refactor, And How We Know It. IEEE Transactions on Software Engineering, 38, 1 (2011), 5–18. https://doi.org/10.1109/TSE.2011.41

Digital Library

[28]

Kıvanç Muşlu, Yuriy Brun, Reid Holmes, Michael D. Ernst, and David Notkin. 2012. Speculative Analysis of Integrated Development Environment Recommendations. ACM SIGPLAN Notices, 47, 10 (2012), 669–682. https://doi.org/10.1145/2398857.2384665

Digital Library

[29]

Stas Negara, Nicholas Chen, Mohsen Vakilian, Ralph E. Johnson, and Danny Dig. 2013. A Comparative Study of Manual and Automated Refactorings. In European Conference on Object-Oriented Programming. 552–576. https://doi.org/10.1007/978-3-642-39038-8_23

Digital Library

[30]

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2013. A Statistical Semantic Language Model for Source Code. In 2013 9th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2013 - Proceedings. 532–542. https://doi.org/10.1145/2491411.2491458

Digital Library

[31]

Thida Oo, Hui Liu, and Bridget Nyirongo. 2018. Dynamic Ranking of Refactoring Menu Items for Integrated Development Environment. IEEE Access, 6 (2018), 76025–76035. https://doi.org/10.1109/ACCESS.2018.2883769

[32]

David M. Pletcher and Daqing Hou. 2009. BCC: Enhancing Code Completion for Better API Usability. In 2009 IEEE International Conference on Software Maintenance. 393–394. https://doi.org/10.1109/ICSM.2009.5306289

[33]

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: Unbiased Boosting with Categorical Features. Advances in Neural Information Processing Systems, 31 (2018).

Digital Library

[34]

Sebastian Proksch, Sven Amann, Sarah Nadi, and Mira Mezini. 2016. Evaluating the Evaluations of Code Recommender Systems: A Reality Check. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering - ASE 2016. 111–121. https://doi.org/10.1145/2970276.2970330

Digital Library

[35]

Sebastian Proksch, Johannes Lerch, and Mira Mezini. 2015. Intelligent Code Completion with Bayesian Networks. ACM Transactions on Software Engineering and Methodology, 25, 1 (2015), 1–31. https://doi.org/10.1145/2744200

Digital Library

[36]

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 419–428. https://doi.org/10.1145/2594291.2594321

Digital Library

[37]

Romain Robbes and Michele Lanza. 2008. How Program History Can Improve Code Completion. In ASE 2008 - 23rd IEEE/ACM International Conference on Automated Software Engineering, Proceedings. 317–326. https://doi.org/10.1109/ASE.2008.42

Digital Library

[38]

Will Snipes, Emerson Murphy-Hill, Thomas Fritz, Mohsen Vakilian, Kostadin Damevski, Anil R Nair, and David Shepherd. 2015. A Practical Guide to Analyzing IDE Usage Data. In The Art and Science of Analyzing Software Data. 85–138. https://doi.org/10.1016/B978-0-12-411519-4.00005-7

[39]

United States. 2018. California Consumer Privacy Act. https://oag.ca.gov/privacy/ccpa [Online. Accessed 31.08.2022]

[40]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode Compose: Code Generation Using Transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433–1443. https://doi.org/10.1145/3368089.3417058

Digital Library

[41]

Alexey Svyatkovskiy, Sebastian Lee, Anna Hadjitofi, Maik Riechert, Juliana Vicente Franco, and Miltiadis Allamanis. 2021. Fast and Memory-Efficient Neural Code Completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 329–340. https://doi.org/10.1109/MSR52588.2021.00045

[42]

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: AI-Assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2727–2735. https://doi.org/10.1145/3292500.3330699

Digital Library

[43]

European Union. 2018. General Data Protection Regulation. https://gdpr.eu/ [Online. Accessed 31.08.2022]

[44]

Wenhan Wang, Sijie Shen, Ge Li, and Zhi Jin. 2020. Towards Full-line Code Completion with Neural Language Models. arXiv preprint arXiv:2009.08603.

Cited By

Sun ZDu XLuo XSong FLo DLi LChristakis MPradel M(2024)FDI: Attack Neural Code Generation Systems through User Feedback ChannelProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680300(528-540)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680300
Eliseeva ASokolov YBogomolov EGolubev YDig DBryksin T(2023)From Commit Message Generation to History-Aware Commit Message Completion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00078(723-735)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00078

Index Terms

All you need is logs: improving code completion by learning from anonymous IDE usage logs

Recommendations

Exploring and Improving Code Completion for Test Code
ICPC '24: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension

Code completion is an important feature in Integrated Development Environments (IDEs). These years, researchers have been making efforts for intelligent code completion. However, existing work on intelligent code completion either only considered ...
Learning from examples to improve code completion systems
ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering

The suggestions made by current IDE's code completion features are based exclusively on static type system of the programming language. As a result, often proposals are made which are irrelevant for a particular working context. Also, these suggestions ...
When code completion fails: a case study on real-world completions
ICSE '19: Proceedings of the 41st International Conference on Software Engineering

Code completion is commonly used by software developers and is integrated into all major IDE's. Good completion tools can not only save time and effort but may also help avoid incorrect API usage. Many proposed completion tools have shown promising ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2022

1822 pages

ISBN:9781450394130

DOI:10.1145/3540250

General Chair:
Abhik Roychoudhury
National University of Singapore, Singapore
,
Program Chairs:
Cristian Cadar
Imperial College London, UK
,
Miryung Kim
University of California at Los Angeles, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '22

Sponsor:

ESEC/FSE '22: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 14 - 18, 2022

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)5

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun ZDu XLuo XSong FLo DLi LChristakis MPradel M(2024)FDI: Attack Neural Code Generation Systems through User Feedback ChannelProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680300(528-540)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680300
Eliseeva ASokolov YBogomolov EGolubev YDig DBryksin T(2023)From Commit Message Generation to History-Aware Commit Message Completion2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00078(723-735)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00078

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents