skip to main content
10.1109/ICSE-SEIP52600.2021.00022acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Learning autocompletion from real-world datasets

Published: 17 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers' actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.

    References

    [1]
    G. C. Murphy, M. Kersten, and L. Findlater, "How are java software developers using the eclipse ide?" IEEE Softw., vol. 23, no. 4, pp. 76--83, Jul. 2006. [Online].
    [2]
    M. Bruch, M. Monperrus, and M. Mezini, "Learning from examples to improve code completion systems," in Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ser. ESEC/FSE '09. New York, NY, USA: ACM, 2009, pp. 213--222. [Online].
    [3]
    A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 837--847. [Online]. Available: http://dl.acm.org/citation.cfm?id=2337223.2337322
    [4]
    G. A. Aye and G. E. Kaiser, "Sequence model design for code completion in the modern ide," 2020.
    [5]
    S. Kim, J. Zhao, Y. Tian, and S. Chandra, "Code prediction by feeding trees to transformers," 2020.
    [6]
    J. Li, Y. Wang, M. R. Lyu, and I. King, "Code completion with neural attention and pointer networks," Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Jul 2018. [Online].
    [7]
    V. Raychev, M. Vechev, and E. Yahav, "Code completion with statistical language models," in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '14. New York, NY, USA: ACM, 2014, pp. 419--428. [Online].
    [8]
    R. Karampatsis and C. Sutton, "Maybe deep neural networks are the best choice for modeling source code," CoRR, vol. abs/1903.05734, 2019. [Online]. Available: http://arxiv.org/abs/1903.05734
    [9]
    V. Raychev, P. Bielik, and M. Vechev, "Probabilistic model for code with decision trees," in Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ser. OOPSLA 2016. New York, NY, USA: Association for Computing Machinery, 2016, p. 731--747. [Online].
    [10]
    M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov, "Generative code modeling with graphs," 2019.
    [11]
    U. Alon, R. Sadaka, O. Levy, and E. Yahav, "Structural language models of code," 2020.
    [12]
    V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, "When code completion fails: A case study on real-world completions," ser. ICSE '19. IEEE Press, 2019, p. 960--970. [Online].
    [13]
    K. Heafield, "KenLM: Faster and smaller language model queries," in Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: Association for Computational Linguistics, Jul. 2011, pp. 187--197. [Online]. Available: https://www.aclweb.org/anthology/W11-2123
    [14]
    K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, "Scalable modified Kneser-Ney language model estimation," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 690--696. [Online]. Available: https://www.aclweb.org/anthology/P13-2121
    [15]
    S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling," in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ser. ACL '96. USA: Association for Computational Linguistics, 1996, p. 310--318. [Online].
    [16]
    R. Sennrich, B. Haddow, and A. Birch, "Neural machine translation of rare words with subword units," 2015.
    [17]
    M. Allamanis, "The adverse effects of code duplication in machine learning models of code," 2019.
    [18]
    A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. IEEE Press, 2012, p. 837--847.
    [19]
    T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, "A statistical semantic language model for source code," in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013. New York, NY, USA: Association for Computing Machinery, 2013, p. 532--542. [Online].
    [20]
    V. J. Hellendoorn and P. Devanbu, "Are deep neural networks the best choice for modeling source code?" in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017. New York, NY, USA: Association for Computing Machinery, 2017, p. 763--773. [Online].
    [21]
    P. Bielik, V. Raychev, and M. Vechev, "Phog: Probabilistic model for code," in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML'16. JMLR.org, 2016, p. 2933--2942.

    Cited By

    View all
    • (2024)Measuring GitHub Copilot's Impact on ProductivityCommunications of the ACM10.1145/363345367:3(54-63)Online publication date: 22-Feb-2024
    • (2024)Language Models for Code Completion: A Practical EvaluationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639138(1-13)Online publication date: 20-May-2024
    • (2024)Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical StudyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623306(1-13)Online publication date: 20-May-2024
    • Show More Cited By

    Index Terms

    1. Learning autocompletion from real-world datasets
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice
            May 2021
            405 pages
            ISBN:9780738146690

            Sponsors

            In-Cooperation

            • IEEE CS

            Publisher

            IEEE Press

            Publication History

            Published: 17 December 2021

            Check for updates

            Author Tags

            1. code completion
            2. integrated development environments
            3. machine learning
            4. naturalness
            5. neural networks
            6. software language models
            7. software tools

            Qualifiers

            • Research-article

            Conference

            ICSE '21
            Sponsor:

            Upcoming Conference

            ICSE 2025

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)9
            • Downloads (Last 6 weeks)1

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Measuring GitHub Copilot's Impact on ProductivityCommunications of the ACM10.1145/363345367:3(54-63)Online publication date: 22-Feb-2024
            • (2024)Language Models for Code Completion: A Practical EvaluationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639138(1-13)Online publication date: 20-May-2024
            • (2024)Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical StudyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623306(1-13)Online publication date: 20-May-2024
            • (2023)Large language models of code fail at completing code with potential bugsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667916(41386-41412)Online publication date: 10-Dec-2023
            • (2023)How Practitioners Expect Code Completion?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616280(1294-1306)Online publication date: 30-Nov-2023
            • (2023)On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of CodeProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616244(1470-1482)Online publication date: 30-Nov-2023
            • (2022)All you need is logs: improving code completion by learning from anonymous IDE usage logsProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3558968(1269-1279)Online publication date: 7-Nov-2022
            • (2022)Productivity assessment of neural code completionProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534864(21-29)Online publication date: 13-Jun-2022
            • (2022)Counterfactual explanations for models of codeProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513081(125-134)Online publication date: 21-May-2022
            • (2022)CodeFillProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510172(401-412)Online publication date: 21-May-2022
            • Show More Cited By

            View Options

            Get Access

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media