poster

Public Access

Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

Authors:

Priyan Vaithilingam,

Elena L. GlassmanAuthors Info & Claims

CHI EA '22: CHI Conference on Human Factors in Computing Systems Extended Abstracts

Article No.: 332, Pages 1 - 7

https://doi.org/10.1145/3491101.3519665

Published: 28 April 2022 Publication History

All formats PDF

Abstract

Recent advances in Large Language Models (LLM) have made automatic code generation possible for real-world programming tasks in general-purpose programming languages such as Python. However, there are few human studies on the usability of these tools and how they fit the programming workflow. In this work, we conducted a within-subjects user study with 24 participants to understand how programmers use and perceive Copilot, a LLM-based code generation tool. We found that, while Copilot did not necessarily improve the task completion time or success rate, most participants preferred to use Copilot in daily programming tasks, since Copilot often provided a useful starting point and saved the effort of searching online. However, participants did face difficulties in understanding, editing, and debugging code snippets generated by Copilot, which significantly hindered their task-solving effectiveness. Finally, we highlighted several promising directions for improving the design of Copilot based on our observations and participants’ feedback.

Supplementary Material

MP4 File (3491101.3519665-talk-video.mp4)

Talk Video

Download
18.35 MB

References

[1]

Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. 2020. Structural Language Models of Code. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 245–256. https://proceedings.mlr.press/v119/alon20a.html

Digital Library

[2]

Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. IEEE.

[3]

Stavros Antifakos, Nicky Kern, Bernt Schiele, and Adrian Schwaninger. 2005. Towards Improving Trust in Context-Aware Systems by Displaying System Confidence(MobileHCI ’05). Association for Computing Machinery, New York, NY, USA, 9–14. https://doi.org/10.1145/1085777.1085780

[4]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. ArXiv abs/2108.07732(2021).

[5]

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. ArXiv abs/1611.01989(2017).

[6]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata.

[7]

Sarah E Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping distributed hierarchical web data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 963–975.

Digital Library

[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021).

[9]

Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An Empirical Study on the Usage of BERT Models for Code Completion. arXiv preprint arXiv:2103.07115(2021).

[10]

Allen Cypher. 1995. Eager: Programming repetitive tasks by example. In Readings in human–computer interaction. Elsevier, 804–810.

[11]

Mary T Dzindolet, Scott A Peterson, Regina A Pomranky, Linda G Pierce, and Hall P Beck. 2003. The role of trust in automation reliance. International journal of human-computer studies 58, 6 (2003), 697–718.

Digital Library

[12]

John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. ACM SIGPLAN Notices 50, 6 (2015), 229–239.

Digital Library

[13]

Github Copilot [n. d.]. Your AI pair programmer.

[14]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317–330.

Digital Library

[15]

Sumit Gulwani, José Hernández-Orallo, Emanuel Kitzelmann, Stephen H Muggleton, Ute Schmid, and Benjamin Zorn. 2015. Inductive programming meets the real world. Commun. ACM 58, 11 (2015), 90–99.

Digital Library

[16]

Tong Guo and Huilin Gao. 2019. Content enhanced bert-based text-to-sql generation. arXiv preprint arXiv:1910.07179(2019).

[17]

Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, and Graham Neubig. 2018. Retrieval-based neural code generation. arXiv preprint arXiv:1808.10025(2018).

[18]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436(2019).

[19]

Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. 2010. Oracle-guided component-based program synthesis. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. IEEE, 215–224.

Digital Library

[20]

Rafael-Michael Karampatsis and Charles Sutton. 2019. Maybe deep neural networks are the best choice for modeling source code. arXiv preprint arXiv:1903.05734(2019).

[21]

Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code prediction by feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 150–162.

Digital Library

[22]

Kite - Free AI Coding Assistant and Code Auto-Complete Plugin 2020. Kite - Free AI Coding Assistant and Code Auto-Complete Plugin. https://www.kite.com/. Accessed: 2022-1-8.

[23]

Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.

Digital Library

[24]

Tessa Lau, Steven A Wolfman, Pedro Domingos, and Daniel S Weld. 2003. Programming by demonstration using version space algebra. Machine Learning 53, 1 (2003), 111–156.

Digital Library

[25]

Vu Le and Sumit Gulwani. 2014. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 542–553.

Digital Library

[26]

Brian Y Lim and Anind K Dey. 2009. Assessing demand for intelligibility in context-aware applications. In Proceedings of the 11th international conference on Ubiquitous computing. 195–204.

Digital Library

[27]

Brian Y Lim and Anind K Dey. 2010. Toolkit to support intelligibility in context-aware applications. In Proceedings of the 12th ACM international conference on Ubiquitous computing. 13–22.

Digital Library

[28]

Brian Y Lim, Anind K Dey, and Daniel Avrahami. 2009. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI conference on human factors in computing systems. 2119–2128.

Digital Library

[29]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336–347.

Digital Library

[30]

Brad A Myers. 1990. Creating user interfaces using programming by example, visual programming, and constraints. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 2(1990), 143–177.

Digital Library

[31]

Brad A Myers. 1991. Graphical techniques in a spreadsheet for specifying user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 243–249.

Digital Library

[32]

OpenAI and Ashley Pilipiszyn. 2021. GPT-3 Powers the Next Generation of Apps. https://openai.com/blog/gpt-3-apps/. Accessed: 2022-1-8.

[33]

Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2021. Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?arXiv preprint arXiv:2112.02125(2021).

[34]

Hila Peleg, Sharon Shoham, and Eran Yahav. 2018. Programming not only by example. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 1114–1124.

Digital Library

[35]

Paul Robinette, Wenchen Li, Robert Allen, Ayanna M Howard, and Alan R Wagner. 2016. Overtrust of robots in emergency evacuation scenarios. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 101–108.

[36]

Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodík, and Kemal Ebcioğlu. 2005. Programming by sketching for bit-streaming programs. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 281–294.

Digital Library

[37]

Simone Stumpf, Vidya Rajaram, Lida Li, Weng-Keen Wong, Margaret Burnett, Thomas Dietterich, Erin Sullivan, and Jonathan Herlocker. 2009. Interacting meaningfully with machine learning systems: Three experiments. International journal of human-computer studies 67, 8 (2009), 639–662.

Digital Library

[38]

Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. Treegen: A tree-based transformer architecture for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8984–8991.

[39]

Tabnine [n. d.]. Code Faster with AI Code Completions. https://www.tabnine.com/. Accessed: 2022-1-8.

[40]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4(2019), 1–29.

Digital Library

[41]

Richard J Waldinger and Richard CT Lee. 1969. PROW: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence. 241–252.

[42]

Ben Wang. 2021. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax.

[43]

Chenglong Wang, Yu Feng, Rastislav Bodik, Alvin Cheung, and Isil Dillig. 2019. Visualization by example. Proceedings of the ACM on Programming Languages 4, POPL(2019), 1–28.

[44]

Justin D Weisz, Michael Muller, Stephanie Houde, John Richards, Steven I Ross, Fernando Martinez, Mayank Agarwal, and Kartik Talamadupula. 2021. Perfection Not Required? Human-AI Partnerships in Code Translation. In 26th International Conference on Intelligent User Interfaces. 402–412.

[45]

Daniel S Weld and Gagan Bansal. 2018. Intelligible artificial intelligence. ArXiv e-prints, March 2018(2018).

[46]

Frank F Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig. 2020. Incorporating external knowledge through pre-training for natural language to code generation. arXiv preprint arXiv:2004.09015(2020).

[47]

Frank F Xu, Bogdan Vasilescu, and Graham Neubig. 2021. In-IDE Code Generation from Natural Language: Promise and Challenges. arXiv preprint arXiv:2101.11149(2021).

[48]

Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696(2017).

[49]

Pengcheng Yin and Graham Neubig. 2018. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. arXiv preprint arXiv:1810.02720(2018).

[50]

Wojciech Zaremba, Greg Brockman, and OpenAI. 2021. OpenAI Codex. https://openai.com/blog/openai-codex/. Accessed: 2022-1-8.

[51]

Tianyi Zhang, London Lowmanstone, Xinyu Wang, and Elena L Glassman. 2020. Interactive Program Synthesis by Augmented Examples. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 627–648.

Digital Library

[52]

Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 295–305.

Digital Library

Cited By

Mendes WSouza Sde Souza C(2024)Colaboração com Assistente de Codificação Baseado em IA: Benefícios e DesafiosAnais do XIX Simpósio Brasileiro de Sistemas Colaborativos (SBSC 2024)10.5753/sbsc.2024.237964(228-236)Online publication date: 29-Apr-2024
https://doi.org/10.5753/sbsc.2024.237964
Poitras ECrane BDempsey DBragg TSiegel ALin M(2024)Cognitive Apprenticeship and Artificial Intelligence Coding AssistantsNavigating Computer Science Education in the 21st Century10.4018/979-8-3693-1066-3.ch013(261-281)Online publication date: 26-Feb-2024
https://doi.org/10.4018/979-8-3693-1066-3.ch013
Ashtari NChilana P(2024)How New Developers Approach Augmented Reality Development Using Simplified Creation Tools: An Observational StudyMultimodal Technologies and Interaction10.3390/mti80400358:4(35)Online publication date: 22-Apr-2024
https://doi.org/10.3390/mti8040035
Show More Cited By

Recommendations

Evaluating Large Language Models in Class-Level Code Generation
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Recently, many large language models (LLMs) have been proposed, showing advanced proficiency in code generation. Meanwhile, many efforts have been dedicated to evaluating LLMs on code generation benchmarks such as HumanEval. Although being very helpful ...
Evaluating Large Language Models on Academic Literature Understanding and Review: An Empirical Study among Early-stage Scholars
CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems
The rapid advancement of large language models (LLMs) such as ChatGPT makes LLM-based academic tools possible. However, little research has empirically evaluated how scholars perform different types of academic tasks with LLMs. Through an empirical study ...
Assessing the quality of GitHub copilot’s code generation
PROMISE 2022: Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering

The introduction of GitHub’s new code generation tool, GitHub Copilot, seems to be the first well-established instance of an AI pair-programmer. GitHub Copilot has access to a large number of open-source projects, enabling it to utilize more extensive ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems

April 2022

3066 pages

ISBN:9781450391566

DOI:10.1145/3491101

Editors:
Simone Barbosa
PUC-Rio, Brazil
,
Cliff Lampe
University of Michigan, USA
,
Caroline Appert
Université Paris-Saclay, France
,
David A. Shamma
Toyota Research Institute, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)

Conference

CHI '22

Sponsor:

SIGCHI

CHI '22: CHI Conference on Human Factors in Computing Systems

April 29 - May 5, 2022

LA, New Orleans, USA

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

151
Total Citations
View Citations
8,579
Total Downloads

Downloads (Last 12 months)6,225
Downloads (Last 6 weeks)408

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mendes WSouza Sde Souza C(2024)Colaboração com Assistente de Codificação Baseado em IA: Benefícios e DesafiosAnais do XIX Simpósio Brasileiro de Sistemas Colaborativos (SBSC 2024)10.5753/sbsc.2024.237964(228-236)Online publication date: 29-Apr-2024
https://doi.org/10.5753/sbsc.2024.237964
Poitras ECrane BDempsey DBragg TSiegel ALin M(2024)Cognitive Apprenticeship and Artificial Intelligence Coding AssistantsNavigating Computer Science Education in the 21st Century10.4018/979-8-3693-1066-3.ch013(261-281)Online publication date: 26-Feb-2024
https://doi.org/10.4018/979-8-3693-1066-3.ch013
Ashtari NChilana P(2024)How New Developers Approach Augmented Reality Development Using Simplified Creation Tools: An Observational StudyMultimodal Technologies and Interaction10.3390/mti80400358:4(35)Online publication date: 22-Apr-2024
https://doi.org/10.3390/mti8040035
Bhanot KErickson JBennett K(2024)MortalityMinder: Visualization and AI Interpretations of Social Determinants of Premature Mortality in the United StatesInformation10.3390/info1505025415:5(254)Online publication date: 30-Apr-2024
https://doi.org/10.3390/info15050254
Jošt GTaneski VKarakatič S(2024)The Impact of Large Language Models on Programming Education and Student Learning OutcomesApplied Sciences10.3390/app1410411514:10(4115)Online publication date: 13-May-2024
https://doi.org/10.3390/app14104115
Liu JLi S(2024)Toward Artificial Intelligence-Human Paired Programming: A Review of the Educational Applications and Research on Artificial Intelligence Code-Generation ToolsJournal of Educational Computing Research10.1177/07356331241240460Online publication date: 4-Apr-2024
https://doi.org/10.1177/07356331241240460
Khemka MHouck B(2024)Toward Effective AI Support for DevelopersQueue10.1145/367541622:3(53-78)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3675416
Uhrmacher AFrazier PHähnle RKlügl FLorig FLudäscher BNenzi LRuiz-Martin CRumpe BSzabo CWainer GWilsdorf P(2024)Context, Composition, Automation, and Communication: The C2AC Roadmap for Modeling and SimulationACM Transactions on Modeling and Computer Simulation10.1145/367322634:4(1-51)Online publication date: 13-Aug-2024
https://dl.acm.org/doi/10.1145/3673226
Kouemo Ngassom SMoradi Dakhel ATambon FKhomh FAdams BZimmermann TOzkaya ILin DZhang J(2024)Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664772(122-130)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3664646.3664772
Eskandani NSalvaneschi GAdams BZimmermann TOzkaya ILin DZhang J(2024)Towards AI for Software SystemsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664767(79-84)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3664646.3664767
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents