skip to main content
10.1145/3491101.3519665acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
poster
Public Access

Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

Published: 28 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Recent advances in Large Language Models (LLM) have made automatic code generation possible for real-world programming tasks in general-purpose programming languages such as Python. However, there are few human studies on the usability of these tools and how they fit the programming workflow. In this work, we conducted a within-subjects user study with 24 participants to understand how programmers use and perceive Copilot, a LLM-based code generation tool. We found that, while Copilot did not necessarily improve the task completion time or success rate, most participants preferred to use Copilot in daily programming tasks, since Copilot often provided a useful starting point and saved the effort of searching online. However, participants did face difficulties in understanding, editing, and debugging code snippets generated by Copilot, which significantly hindered their task-solving effectiveness. Finally, we highlighted several promising directions for improving the design of Copilot based on our observations and participants’ feedback.

    Supplementary Material

    MP4 File (3491101.3519665-talk-video.mp4)
    Talk Video

    References

    [1]
    Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. 2020. Structural Language Models of Code. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 245–256. https://proceedings.mlr.press/v119/alon20a.html
    [2]
    Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. IEEE.
    [3]
    Stavros Antifakos, Nicky Kern, Bernt Schiele, and Adrian Schwaninger. 2005. Towards Improving Trust in Context-Aware Systems by Displaying System Confidence(MobileHCI ’05). Association for Computing Machinery, New York, NY, USA, 9–14. https://doi.org/10.1145/1085777.1085780
    [4]
    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. ArXiv abs/2108.07732(2021).
    [5]
    Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. ArXiv abs/1611.01989(2017).
    [6]
    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata.
    [7]
    Sarah E Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping distributed hierarchical web data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 963–975.
    [8]
    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021).
    [9]
    Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An Empirical Study on the Usage of BERT Models for Code Completion. arXiv preprint arXiv:2103.07115(2021).
    [10]
    Allen Cypher. 1995. Eager: Programming repetitive tasks by example. In Readings in human–computer interaction. Elsevier, 804–810.
    [11]
    Mary T Dzindolet, Scott A Peterson, Regina A Pomranky, Linda G Pierce, and Hall P Beck. 2003. The role of trust in automation reliance. International journal of human-computer studies 58, 6 (2003), 697–718.
    [12]
    John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. ACM SIGPLAN Notices 50, 6 (2015), 229–239.
    [13]
    Github Copilot [n. d.]. Your AI pair programmer.
    [14]
    Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317–330.
    [15]
    Sumit Gulwani, José Hernández-Orallo, Emanuel Kitzelmann, Stephen H Muggleton, Ute Schmid, and Benjamin Zorn. 2015. Inductive programming meets the real world. Commun. ACM 58, 11 (2015), 90–99.
    [16]
    Tong Guo and Huilin Gao. 2019. Content enhanced bert-based text-to-sql generation. arXiv preprint arXiv:1910.07179(2019).
    [17]
    Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, and Graham Neubig. 2018. Retrieval-based neural code generation. arXiv preprint arXiv:1808.10025(2018).
    [18]
    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436(2019).
    [19]
    Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. 2010. Oracle-guided component-based program synthesis. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. IEEE, 215–224.
    [20]
    Rafael-Michael Karampatsis and Charles Sutton. 2019. Maybe deep neural networks are the best choice for modeling source code. arXiv preprint arXiv:1903.05734(2019).
    [21]
    Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code prediction by feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 150–162.
    [22]
    Kite - Free AI Coding Assistant and Code Auto-Complete Plugin 2020. Kite - Free AI Coding Assistant and Code Auto-Complete Plugin. https://www.kite.com/. Accessed: 2022-1-8.
    [23]
    Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
    [24]
    Tessa Lau, Steven A Wolfman, Pedro Domingos, and Daniel S Weld. 2003. Programming by demonstration using version space algebra. Machine Learning 53, 1 (2003), 111–156.
    [25]
    Vu Le and Sumit Gulwani. 2014. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 542–553.
    [26]
    Brian Y Lim and Anind K Dey. 2009. Assessing demand for intelligibility in context-aware applications. In Proceedings of the 11th international conference on Ubiquitous computing. 195–204.
    [27]
    Brian Y Lim and Anind K Dey. 2010. Toolkit to support intelligibility in context-aware applications. In Proceedings of the 12th ACM international conference on Ubiquitous computing. 13–22.
    [28]
    Brian Y Lim, Anind K Dey, and Daniel Avrahami. 2009. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI conference on human factors in computing systems. 2119–2128.
    [29]
    Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336–347.
    [30]
    Brad A Myers. 1990. Creating user interfaces using programming by example, visual programming, and constraints. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 2(1990), 143–177.
    [31]
    Brad A Myers. 1991. Graphical techniques in a spreadsheet for specifying user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 243–249.
    [32]
    OpenAI and Ashley Pilipiszyn. 2021. GPT-3 Powers the Next Generation of Apps. https://openai.com/blog/gpt-3-apps/. Accessed: 2022-1-8.
    [33]
    Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2021. Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?arXiv preprint arXiv:2112.02125(2021).
    [34]
    Hila Peleg, Sharon Shoham, and Eran Yahav. 2018. Programming not only by example. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 1114–1124.
    [35]
    Paul Robinette, Wenchen Li, Robert Allen, Ayanna M Howard, and Alan R Wagner. 2016. Overtrust of robots in emergency evacuation scenarios. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 101–108.
    [36]
    Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodík, and Kemal Ebcioğlu. 2005. Programming by sketching for bit-streaming programs. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 281–294.
    [37]
    Simone Stumpf, Vidya Rajaram, Lida Li, Weng-Keen Wong, Margaret Burnett, Thomas Dietterich, Erin Sullivan, and Jonathan Herlocker. 2009. Interacting meaningfully with machine learning systems: Three experiments. International journal of human-computer studies 67, 8 (2009), 639–662.
    [38]
    Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. Treegen: A tree-based transformer architecture for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8984–8991.
    [39]
    Tabnine [n. d.]. Code Faster with AI Code Completions. https://www.tabnine.com/. Accessed: 2022-1-8.
    [40]
    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4(2019), 1–29.
    [41]
    Richard J Waldinger and Richard CT Lee. 1969. PROW: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence. 241–252.
    [42]
    Ben Wang. 2021. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax.
    [43]
    Chenglong Wang, Yu Feng, Rastislav Bodik, Alvin Cheung, and Isil Dillig. 2019. Visualization by example. Proceedings of the ACM on Programming Languages 4, POPL(2019), 1–28.
    [44]
    Justin D Weisz, Michael Muller, Stephanie Houde, John Richards, Steven I Ross, Fernando Martinez, Mayank Agarwal, and Kartik Talamadupula. 2021. Perfection Not Required? Human-AI Partnerships in Code Translation. In 26th International Conference on Intelligent User Interfaces. 402–412.
    [45]
    Daniel S Weld and Gagan Bansal. 2018. Intelligible artificial intelligence. ArXiv e-prints, March 2018(2018).
    [46]
    Frank F Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig. 2020. Incorporating external knowledge through pre-training for natural language to code generation. arXiv preprint arXiv:2004.09015(2020).
    [47]
    Frank F Xu, Bogdan Vasilescu, and Graham Neubig. 2021. In-IDE Code Generation from Natural Language: Promise and Challenges. arXiv preprint arXiv:2101.11149(2021).
    [48]
    Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696(2017).
    [49]
    Pengcheng Yin and Graham Neubig. 2018. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. arXiv preprint arXiv:1810.02720(2018).
    [50]
    Wojciech Zaremba, Greg Brockman, and OpenAI. 2021. OpenAI Codex. https://openai.com/blog/openai-codex/. Accessed: 2022-1-8.
    [51]
    Tianyi Zhang, London Lowmanstone, Xinyu Wang, and Elena L Glassman. 2020. Interactive Program Synthesis by Augmented Examples. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 627–648.
    [52]
    Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 295–305.

    Cited By

    View all
    • (2024)Colaboração com Assistente de Codificação Baseado em IA: Benefícios e DesafiosAnais do XIX Simpósio Brasileiro de Sistemas Colaborativos (SBSC 2024)10.5753/sbsc.2024.237964(228-236)Online publication date: 29-Apr-2024
    • (2024)Cognitive Apprenticeship and Artificial Intelligence Coding AssistantsNavigating Computer Science Education in the 21st Century10.4018/979-8-3693-1066-3.ch013(261-281)Online publication date: 26-Feb-2024
    • (2024)How New Developers Approach Augmented Reality Development Using Simplified Creation Tools: An Observational StudyMultimodal Technologies and Interaction10.3390/mti80400358:4(35)Online publication date: 22-Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems
    April 2022
    3066 pages
    ISBN:9781450391566
    DOI:10.1145/3491101
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 April 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. github copilot
    2. large language model

    Qualifiers

    • Poster
    • Research
    • Refereed limited

    Funding Sources

    Conference

    CHI '22
    Sponsor:
    CHI '22: CHI Conference on Human Factors in Computing Systems
    April 29 - May 5, 2022
    LA, New Orleans, USA

    Acceptance Rates

    Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6,225
    • Downloads (Last 6 weeks)408
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Colaboração com Assistente de Codificação Baseado em IA: Benefícios e DesafiosAnais do XIX Simpósio Brasileiro de Sistemas Colaborativos (SBSC 2024)10.5753/sbsc.2024.237964(228-236)Online publication date: 29-Apr-2024
    • (2024)Cognitive Apprenticeship and Artificial Intelligence Coding AssistantsNavigating Computer Science Education in the 21st Century10.4018/979-8-3693-1066-3.ch013(261-281)Online publication date: 26-Feb-2024
    • (2024)How New Developers Approach Augmented Reality Development Using Simplified Creation Tools: An Observational StudyMultimodal Technologies and Interaction10.3390/mti80400358:4(35)Online publication date: 22-Apr-2024
    • (2024)MortalityMinder: Visualization and AI Interpretations of Social Determinants of Premature Mortality in the United StatesInformation10.3390/info1505025415:5(254)Online publication date: 30-Apr-2024
    • (2024)The Impact of Large Language Models on Programming Education and Student Learning OutcomesApplied Sciences10.3390/app1410411514:10(4115)Online publication date: 13-May-2024
    • (2024)Toward Artificial Intelligence-Human Paired Programming: A Review of the Educational Applications and Research on Artificial Intelligence Code-Generation ToolsJournal of Educational Computing Research10.1177/07356331241240460Online publication date: 4-Apr-2024
    • (2024)Toward Effective AI Support for DevelopersQueue10.1145/367541622:3(53-78)Online publication date: 10-Jul-2024
    • (2024)Context, Composition, Automation, and Communication: The C2AC Roadmap for Modeling and SimulationACM Transactions on Modeling and Computer Simulation10.1145/367322634:4(1-51)Online publication date: 13-Aug-2024
    • (2024)Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664772(122-130)Online publication date: 10-Jul-2024
    • (2024)Towards AI for Software SystemsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664767(79-84)Online publication date: 10-Jul-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media