skip to main content
research-article

GitHub Copilot AI pair programmer: : Asset or Liability?

Published: 01 September 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (ii) comparing Copilot’s proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of humans’ solutions is greater than Copilot’s suggestions, while the buggy solutions generated by Copilot require less effort to be repaired. Based on our findings, if Copilot is used by expert developers in software projects, it can become an asset since its suggestions could be comparable to humans’ contributions in terms of quality. However, Copilot can become a liability if it is used by novice developers who may fail to filter its buggy or non-optimal solutions due to a lack of expertise.

    Graphical abstract

    Display Omitted

    Highlights

    We investigate the quality of the code Copilot generates as an AI pair programmer.
    Copilot provides efficient solutions; but some are buggy and/or non-reproducible.
    Its solutions are more buggy but easier to fix compared to humans’.
    Copilot’s suggestions are comparable to humans’ contributions in terms of quality.
    Copilot can become an asset for experts, but a liability for novice developers.

    References

    [1]
    Ahmed, U.Z., Srivastava, N., Sindhgatta, R., Karkare, A., 2020. Characterizing the pedagogical benefits of adaptive feedback for compilation errors by novice programmers. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training. pp. 139–150.
    [2]
    Alur R., Bodik R., Juniwal G., Martin M.M., Raghothaman M., Seshia S.A., Singh R., Solar-Lezama A., Torlak E., Udupa A., Syntax-Guided Synthesis, IEEE, 2013.
    [3]
    Arcuri, A., 2008. On the automation of fixing software bugs. In: Companion of the 30th International Conference on Software Engineering. pp. 1003–1006.
    [4]
    Asare O., Nagappan M., Asokan N., Is GitHub’s copilot as bad as humans at introducing vulnerabilities in code?, 2022, arXiv preprint arXiv:2204.04741.
    [5]
    Becker B.A., Denny P., Finnie-Ansley J., Luxton-Reilly A., Prather J., Santos E.A., Programming is hard–or at least it used to be: Educational opportunities and challenges of AI code generation, 2022, arXiv preprint arXiv:2212.01020.
    [6]
    Bera R.K., Bera R.K., Fundamental limits to computing, in: The Amazing World of Quantum Computing, Springer, 2020, pp. 171–206.
    [7]
    Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al., Language models are few-shot learners, Adv. Neural Inf. Process. Syst. 33 (2020) 1877–1901.
    [8]
    Carlini N., Ippolito D., Jagielski M., Lee K., Tramer F., Zhang C., Quantifying memorization across neural language models, 2022, arXiv preprint arXiv:2202.07646.
    [9]
    Chen M., Tworek J., Jun H., Yuan Q., Pinto H.P.d.O., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman G., et al., Evaluating large language models trained on code, 2021, arXiv preprint arXiv:2107.03374.
    [10]
    Clement C.B., Drain D., Timcheck J., Svyatkovskiy A., Sundaresan N., PyMT5: multi-mode translation of natural language and python code with transformers, 2020, arXiv preprint arXiv:2010.03150.
    [11]
    Cohen J., A coefficient of agreement for nominal scales, Educ. Psychol. Measur. 20 (1) (1960) 37–46.
    [12]
    Cormen T.H., Leiserson C.E., Rivest R.L., Stein C., Introduction to Algorithms, fourth ed., MIT Press, 2022.
    [13]
    Cormen T.H., Leiserson C.E., Ronald L Rivest C.S., Introduction to algorithms reviews, 2022, https://www.goodreads.com/book/show/58064696-introduction-to-algorithms.
    [14]
    Dantas C.E.C., Maia M.A., Readability and understandability scores for snippet assessment: An exploratory study, 2021, arXiv preprint arXiv:2108.09181.
    [15]
    Denny, P., Kumar, V., Giacaman, N., 2023. Conversing with copilot: Exploring prompt engineering for solving CS1 problems using natural language. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education Vol. 1. pp. 1136–1142.
    [16]
    dos Santos, R.M., Gerosa, M.A., 2018. Impacts of coding practices on readability. In: Proceedings of the 26th Conference on Program Comprehension. pp. 277–285.
    [17]
    Drechsler R., Harris I.G., Wille R., Generating formal system models from natural language descriptions, in: 2012 IEEE International High Level Design Validation and Test Workshop, HLDVT, IEEE, 2012, pp. 164–165.
    [18]
    Drori I., Verma N., Solving linear algebra by program synthesis, 2021, arXiv preprint arXiv:2111.08171.
    [19]
    Ebert C., Cain J., Antoniol G., Counsell S., Laplante P., Cyclomatic complexity, IEEE Softw. 33 (6) (2016) 27–29.
    [20]
    Fakhoury S., Roy D., Hassan A., Arnaoudova V., Improving source code readability: Theory and practice, in: 2019 IEEE/ACM 27th International Conference on Program Comprehension, ICPC, IEEE, 2019, pp. 2–12.
    [21]
    Feng Z., Guo D., Tang D., Duan N., Feng X., Gong M., Shou L., Qin B., Liu T., Jiang D., et al., Codebert: A pre-trained model for programming and natural languages, 2020, arXiv preprint arXiv:2002.08155.
    [22]
    Finnie-Ansley, J., Denny, P., Becker, B.A., Luxton-Reilly, A., Prather, J., 2022. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In: Australasian Computing Education Conference. pp. 10–19.
    [23]
    Forsgren N., Storey M.-A., Maddila C., Zimmermann T., Houck B., Butler J., The SPACE of developer productivity: There’s more to it than you think, Queue 19 (1) (2021) 20–48.
    [24]
    Fronza I., Sillitti A., Succi G., An interpretation of the results of the analysis of pair programming during novices integration in a team, in: 2009 3rd International Symposium on Empirical Software Engineering and Measurement, IEEE, 2009, pp. 225–235.
    [25]
    Geeksforgeeks Team I., GeeksForGeeks, 2022, https://www.geeksforgeeks.org.
    [26]
    Gulwani S., Dimensions in program synthesis, in: Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming, PPDP ’10, Association for Computing Machinery, New York, NY, USA, ISBN 9781450301329, 2010, pp. 13–24,.
    [27]
    Gulwani S., Radiček I., Zuleger F., Automated clustering and program repair for introductory programming assignments, ACM SIGPLAN Not. 53 (4) (2018) 465–480.
    [28]
    Harris C.B., Harris I.G., Glast: Learning formal grammars to translate natural language specifications into hardware assertions, in: 2016 Design, Automation & Test in Europe Conference & Exhibition, DATE, IEEE, 2016, pp. 966–971.
    [29]
    Hovemeyer D., Pugh W., Finding bugs is easy, SIGPLAN Not. (ISSN ) 39 (12) (2004) 92–106,.
    [30]
    Hovemeyer, D., Spacco, J., Pugh, W., 2005. Evaluating and tuning a static analysis to find null pointer bugs. In: Proceedings of the 6th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering. pp. 13–19.
    [31]
    Hu Y., Ahmed U.Z., Mechtaev S., Leong B., Roychoudhury A., Re-factoring based program repair applied to programming assignments, in: 2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE, IEEE, 2019, pp. 388–398.
    [32]
    Imai, S., 2022. Is GitHub copilot a substitute for human pair-programming? An empirical study. In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. pp. 319–321.
    [33]
    Kim, S., Whitehead, E.J., 2006. How long did it take to fix bugs?. In: Proceedings of the 2006 International Workshop on Mining Software Repositories. pp. 173–174.
    [34]
    Leinonen J., Hellas A., Sarsa S., Reeves B., Denny P., Prather J., Becker B.A., Using large language models to enhance programming error messages, in: Proceedings of the 54th ACM Technical Symposium on Computer Science Education, Vol. 1, in: SIGCSE 2023, Association for Computing Machinery, New York, NY, USA, ISBN 9781450394314, 2023, pp. 563–569,.
    [35]
    Li Y., Choi D., Chung J., Kushman N., Schrittwieser J., Leblond R., Eccles T., Keeling J., Gimeno F., Lago A.D., et al., Competition-level code generation with alphacode, 2022, arXiv preprint arXiv:2203.07814.
    [36]
    Lui K.M., Chan K.C., Pair programming productivity: Novice–novice vs. expert–expert, Int. J. Hum.-Comput. Stud. 64 (9) (2006) 915–925.
    [37]
    Manna Z., Waldinger R., A deductive approach to program synthesis, ACM Trans. Programm. Lang. Syst. (TOPLAS) 2 (1) (1980) 90–121.
    [38]
    Maruping L.M., Zhang X., Venkatesh V., Role of collective ownership and coding standards in coordinating expertise in software project teams, Eur. J. Inf. Syst. 18 (4) (2009) 355–371.
    [39]
    Mihalcea R., Liu H., Lieberman H., NLP (Natural Language Processing) for NLP (Natural Language Programming), in: International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2006, pp. 319–330.
    [40]
    Moradi R., et al., Replication package, 2022, GitHub Repository, GitHub, https://github.com/Copilot-Eval-Replication-Package/CopilotEvaluation.
    [41]
    Moradi Dakhel, A., C. Desmarais, M., Khomh, F., 2021. Assessing Developer Expertise from the Statistical Distribution of Programming Syntax Patterns. In: Evaluation and Assessment in Software Engineering. pp. 90–99.
    [42]
    Moroz E.A., Grizkevich V.O., Novozhilov I.M., The potential of artificial intelligence as a method of software developer’s productivity improvement, in: 2022 Conference of Russian Young Researchers in Electrical and Electronic Engineering, ElConRus, IEEE, 2022, pp. 386–390.
    [43]
    Nguyen, N., Nadi, S., 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In: Accepted for Publication Proceedings of the 19th ACM International Conference on Mining Software Repositories. MSR, pp. 1–5.
    [44]
    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318.
    [45]
    Pearce H., Ahmad B., Tan B., Dolan-Gavitt B., Karri R., Asleep at the keyboard? Assessing the security of GitHub copilot’s code contributions, in: 2022 2022 IEEE Symposium on Security and Privacy (SP), SP, IEEE Computer Society, Los Alamitos, CA, USA, 2022, pp. 980–994,. URL https://doi.ieeecomputersociety.org/10.1109/SP46214.2022.00057.
    [46]
    Plonka L., Sharp H., Van der Linden J., Dittrich Y., Knowledge transfer in pair programming: An in-depth analysis, Int. J. Hum.-Comput. Stud. 73 (2015) 66–78.
    [47]
    Rahit K., Nabil R.H., Huq M.H., Machine translation from natural language to code using long-short term memory, in: Proceedings of the Future Technologies Conference, Springer, 2019, pp. 56–63.
    [48]
    Ren S., Guo D., Lu S., Zhou L., Liu S., Tang D., Sundaresan N., Zhou M., Blanco A., Ma S., Codebleu: A method for automatic evaluation of code synthesis, 2020, arXiv preprint arXiv:2009.10297.
    [49]
    Salazar Paredes P., et al., Comparing Python Programs Using Abstract Syntax Trees, Uniandes, 2020.
    [50]
    Sarwar M.M.S., Shahzad S., Ahmad I., Cyclomatic complexity: The nesting problem, in: Eighth International Conference on Digital Information Management, ICDIM 2013, IEEE, 2013, pp. 274–279.
    [51]
    Scalabrino S., Bavota G., Vendome C., Linares-Vasquez M., Poshyvanyk D., Oliveto R., Automatically assessing code understandability, IEEE Trans. Softw. Eng. 47 (3) (2019) 595–613.
    [52]
    Sobania D., Briesch M., Rothlauf F., Choose your programming copilot: A comparison of the program synthesis performance of GitHub copilot and genetic programming, 2021, arXiv preprint arXiv:2111.07875.
    [53]
    Sobania D., Schweim D., Rothlauf F., Recent developments in program synthesis with evolutionary algorithms, 2021, arXiv preprint arXiv:2108.12227.
    [54]
    Tang L., Ke E., Singh N., Verma N., Drori I., Solving probability and statistics problems by program synthesis, 2021, arXiv preprint arXiv:2111.08267.
    [55]
    Tran N., Tran H., Nguyen S., Nguyen H., Nguyen T., Does BLEU score work for code migration?, in: 2019 IEEE/ACM 27th International Conference on Program Comprehension, ICPC, IEEE, 2019, pp. 165–176.
    [56]
    Vaithilingam, P., Zhang, T., Glassman, E.L., 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In: CHI Conference on Human Factors in Computing Systems Extended Abstracts. pp. 1–7.
    [57]
    W3schools Team P., W3schools, 2022, https://www.w3schools.com.
    [58]
    Wermelinger M., Using GitHub Copilot to solve simple programming problems, in: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, in: SIGCSE 2023, Association for Computing Machinery, New York, NY, USA, ISBN 9781450394314, 2023, pp. 172–178,.
    [59]
    Wohlin C., Runeson P., Höst M., Ohlsson M.C., Regnell B., Wesslén A., Experimentation in Software Engineering, Springer Science & Business Media, 2012.
    [60]
    Zhang F., Khomh F., Zou Y., Hassan A.E., An empirical study on factors impacting bug fixing time, in: 2012 19th Working Conference on Reverse Engineering, IEEE, 2012, pp. 225–234.
    [61]
    Ziegler A., Kalliamvakou E., Simister S., Sittampalam G., Li A., Rice A., Rifkin D., Aftandilian E., Productivity assessment of neural code completion, 2022, arXiv preprint arXiv:2205.06537.

    Cited By

    View all
    • (2024)The Impact of AI-Pair Programmers on Code Quality and Developer Satisfaction: Evidence from TiMi studioProceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security10.1145/3665348.3665383(201-205)Online publication date: 10-May-2024
    • (2024)Towards AI for Software SystemsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664767(79-84)Online publication date: 10-Jul-2024
    • (2024)Unveiling the Potential of a Conversational Agent in Developer Support: Insights from Mozilla’s PDF.js ProjectProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664758(10-18)Online publication date: 10-Jul-2024
    • Show More Cited By

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Systems and Software
    Journal of Systems and Software  Volume 203, Issue C
    Sep 2023
    439 pages

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 01 September 2023

    Author Tags

    1. Code completion
    2. Language model
    3. GitHub copilot
    4. Testing

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The Impact of AI-Pair Programmers on Code Quality and Developer Satisfaction: Evidence from TiMi studioProceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security10.1145/3665348.3665383(201-205)Online publication date: 10-May-2024
    • (2024)Towards AI for Software SystemsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664767(79-84)Online publication date: 10-Jul-2024
    • (2024)Unveiling the Potential of a Conversational Agent in Developer Support: Insights from Mozilla’s PDF.js ProjectProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664758(10-18)Online publication date: 10-Jul-2024
    • (2024)Significant Productivity Gains through Programming with Large Language ModelsProceedings of the ACM on Human-Computer Interaction10.1145/36611458:EICS(1-29)Online publication date: 17-Jun-2024
    • (2024)Generating and Reviewing Programming Codes with Large Language Models: A Systematic Mapping StudyProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658342(1-10)Online publication date: 20-May-2024
    • (2024)Comparing Feedback from Large Language Models and Instructors: Teaching Computer Science at ScaleProceedings of the Eleventh ACM Conference on Learning @ Scale10.1145/3657604.3664660(335-339)Online publication date: 9-Jul-2024
    • (2024)Exploring the Profile of University Assessments Flagged as Containing AI-Generated MaterialACM Inroads10.1145/365647815:2(39-47)Online publication date: 10-May-2024
    • (2024)Navigating the Complexity of Generative AI Adoption in Software EngineeringACM Transactions on Software Engineering and Methodology10.1145/365215433:5(1-50)Online publication date: 4-Jun-2024
    • (2024)Is Attention All You Need? Toward a Conceptual Model for Social Awareness in Large Language ModelsProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering10.1145/3650105.3652294(69-73)Online publication date: 14-Apr-2024
    • (2024)Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case StudyProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering10.1145/3650105.3652289(91-102)Online publication date: 14-Apr-2024
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media