skip to main content
10.1145/3623476.3623522acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair

Published: 23 October 2023 Publication History

Abstract

Type systems are responsible for assigning types to terms in programs. That way, they enforce the actions that can be taken and can, consequently, detect type errors during compilation. However, while they are able to flag the existence of an error, they often fail to pinpoint its cause or provide a helpful error message. Thus, without adequate support, debugging this kind of errors can take a considerable amount of effort. Recently, neural network models have been developed that are able to understand programming languages and perform several downstream tasks. We argue that type error debugging can be enhanced by taking advantage of this deeper understanding of the language’s structure. In this paper, we present a technique that leverages GPT-3’s capabilities to automatically fix type errors in OCaml programs. We perform multiple source code analysis tasks to produce useful prompts that are then provided to GPT-3 to generate potential patches. Our publicly available tool, Mentat, supports multiple modes and was validated on an existing public dataset with thousands of OCaml programs. We automatically validate successful repairs by using Quickcheck to verify which generated patches produce the same output as the user-intended fixed version, achieving a 39% repair rate. In a comparative study, Mentat outperformed two other techniques in automatically fixing ill-typed OCaml programs.

References

[1]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–29.
[2]
Aaron Ang, Alexandre Perez, Arie Van Deursen, and Rui Abreu. 2017. Revisiting the practical use of automated software fault localization techniques. In 2017 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). 175–182.
[3]
Andrea Arcuri. 2011. Evolutionary repair of faulty software. Applied soft computing, 11, 4 (2011), 3494–3514.
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR, abs/2005.14165 (2020), arXiv:2005.14165. arxiv:2005.14165
[5]
José Campos, André Riboira, Alexandre Perez, and Rui Abreu. 2012. Gzoltar: an eclipse plug-in for testing and debugging. In Proceedings of the 27th IEEE/ACM international conference on automated software engineering. 378–381.
[6]
Sheng Chen and Martin Erwig. 2014. Counter-factual typing for debugging type errors. In Symposium on Principles of Programming Languages. Proceedings (POPL ’14). ACM, 583–594. isbn:9781450325448 https://doi.org/10.1145/2535838.2535863
[7]
Sheng Chen and Martin Erwig. 2014. Guided type debugging. In Functional and Logic Programming. Proceedings, Michael Codish and Eijiro Sumii (Eds.) (LNCS 8475). Springer, 35–51. https://doi.org/10.1007/978-3-319-07151-0_3
[8]
Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering, 47, 9 (2019), 1943–1959.
[9]
Olaf Chitil. 2001. Compositional explanation of types and algorithmic debugging of type errors. In International Conference on Functional Programming. Proceedings (ICFP ’01). ACM, 193–204. isbn:1581134150 https://doi.org/10.1145/507635.507659
[10]
Koen Claessen and John Hughes. 2000. QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. SIGPLAN Not., 35, 9 (2000), sep, 268–279. issn:0362-1340 https://doi.org/10.1145/357766.351266
[11]
David Coimbra, Sofia Reis, Rui Abreu, Corina Păsăreanu, and Hakan Erdogmus. 2021. On using distributed representations of source code for the detection of C security vulnerabilities. arXiv preprint arXiv:2106.01367.
[12]
Luis Damas and Robin Milner. 1982. Principal type-schemes for functional programs. In Symposium on Principles of Programming Languages. Proceedings (POPL ’82). ACM, 207–212. isbn:0897910656 https://doi.org/10.1145/582153.582176
[13]
Richard A DeMillo, Richard J Lipton, and Frederick G Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Computer, 11, 4 (1978), 34–41.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[15]
Yangruibo Ding, Baishakhi Ray, Premkumar Devanbu, and Vincent J Hellendoorn. 2020. Patching as translation: the data and the metaphor. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). 275–286.
[16]
Thomas Durieux and Martin Monperrus. 2016. Dynamoth: dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop on Automation of Software Test. 85–91.
[17]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.
[18]
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated Program Repair. Commun. ACM, 62, 12 (2019), nov, 56–65. issn:0001-0782 https://doi.org/10.1145/3318162
[19]
Christian Haack and Joe B. Wells. 2004. Type error slicing in implicitly typed higher-order languages. Science of Computer Programming, 50, 1-3 (2004), 189–224. https://doi.org/10.1016/j.scico.2004.01.004
[20]
BJ Heeren, JT Jeuring, Doaitse Swierstra, and Pablo Azero Alcocer. 2002. Improving type-error messages in functional languages.
[21]
Bastiaan Heeren, Daan Leijen, and Arjan van IJzendoorn. 2003. Helium, for learning Haskell. In Proceedings of the 2003 ACM SIGPLAN workshop on Haskell. 62–71.
[22]
Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. 2018. Deep Learning Type Inference. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA. 152–162. isbn:9781450355735 https://doi.org/10.1145/3236024.3236051
[23]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering, 38, 1 (2011), 54–72.
[24]
Oukseh Lee and Kwangkeun Yi. 1998. Proofs about a folklore let-polymorphic type inference algorithm. ACM Transactions on Programming Languages and Systems, 20, 4 (1998), 707–723. issn:0164-0925 https://doi.org/10.1145/291891.291892
[25]
Benjamin S. Lerner, Matthew Flower, Dan Grossman, and Craig Chambers. 2007. Searching for Type-Error Messages. SIGPLAN Not., 42, 6 (2007), jun, 425–434. issn:0362-1340 https://doi.org/10.1145/1273442.1250783
[26]
Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. Dlfix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 602–614.
[27]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, and Duyu Tang. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
[28]
Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 101–114.
[29]
Seokhyeon Moon, Yunho Kim, Moonzoo Kim, and Shin Yoo. 2014. Ask the mutants: Mutating faulty programs for fault localization. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation. 153–162.
[30]
Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. Semfix: Program repair via semantic analysis. In 2013 35th International Conference on Software Engineering (ICSE). 772–781.
[31]
Chris Parnin and Alessandro Orso. 2011. Are automated debugging techniques actually helping programmers? In Proceedings of the 2011 international symposium on software testing and analysis. 199–209.
[32]
Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and improving fault localization. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 609–620.
[33]
Alexandre Perez, Rui Abreu, and IT HASLab. 2018. Leveraging Qualitative Reasoning to Improve SFL. In IJCAI. 1935–1941.
[34]
Benjamin C. Pierce. 2002. Types and Programming Languages (1st ed.). The MIT Press. isbn:0262162091
[35]
Julian Aron Prenner, Hlib Babii, and Romain Robbes. 2022. Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
[36]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1, 8 (2019), 9.
[37]
Vincent Rahli, Joe B. Wells, John Pirie, and Fairouz Kamareddine. 2017. Skalpel: a constraint-based type error slicer for Standard ML. Journal of Symbolic Computation, 80, 1 (2017), 164–208. https://doi.org/10.1016/j.jsc.2016.07.013
[38]
Francisco Ribeiro, Rui Abreu, and João Saraiva. 2021. On Understanding Contextual Changes of Failures. In 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). 1036–1047.
[39]
Francisco Ribeiro, Rui Abreu, and João Saraiva. 2022. Framing Program Repair as Code Completion. In Proceedings of the Third International Workshop on Automated Program Repair (APR ’22). Association for Computing Machinery, New York, NY, USA. 38–45. isbn:9781450392853 https://doi.org/10.1145/3524459.3527347
[40]
Francisco Ribeiro, José Macedo, Kanae Tsushima, Rui Abreu, and João Saraiva. 2023. GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair (SLE 2023). 10, https://doi.org/10.6084/m9.figshare.23646903.v2
[41]
Georgios Sakkas, Madeline Endres, Benjamin Cosman, Westley Weimer, and Ranjit Jhala. 2020. Type Error Feedback via Analytic Program Repair. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA. 16–30. isbn:9781450376136 https://doi.org/10.1145/3385412.3386005
[42]
Thomas Schilling. 2012. Constraint-free type error slicing. In Trends in Functional Programming. Proceedings, Ricardo Peña and Rex Page (Eds.) (LNCS 7193). Springer, 1–16. isbn:978-3-642-32037-8 https://doi.org/10.1007/978-3-642-32037-8_1
[43]
Peter J. Stuckey, Martin Sulzmann, and Jeremy Wazny. 2003. Interactive type debugging in Haskell. In Workshop on Haskell. Proceedings (Haskell ’03). ACM, 72–83. https://doi.org/10.1145/871895.871903
[44]
Peter J. Stuckey, Martin Sulzmann, and Jeremy Wazny. 2004. Improving type error diagnosis. In Workshop on Haskell. Proceedings (Haskell ’04). ACM, 80–91. isbn:1581138504 https://doi.org/10.1145/1017472.1017486
[45]
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433–1443.
[46]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arxiv:2302.13971.
[47]
Kanae Tsushima and Kenichi Asai. 2012. An embedded type debugger. In Symposium on Implementation and Application of Functional Languages. 190–206.
[48]
Kanae Tsushima and Kenichi Asai. 2013. An embedded type debugger. In Implementation and Application of Functional Languages. Proceedings, Ralf Hinze (Ed.) (LNCS 8241). Springer, 190–206. https://doi.org/10.1007/978-3-642-41582-1_12
[49]
Kanae Tsushima and Kenichi Asai. 2014. A weighted type-error slicer. Journal of Computer Software, 31, 4 (2014), 131–148.
[50]
Kanae Tsushima, Olaf Chitil, and Joanna Sharrad. 2019. Type debugging with counter-factual type error messages using an existing type checker. In Symposium on Implementation and Application of Functional Languages. Proceedings (IFL ’19). ACM, Article 7, 12 pages. isbn:9781450375627 https://doi.org/10.1145/3412932.3412939
[51]
Mitchell Wand. 1986. Finding the source of type errors. In Symposium on Principles of Programming Languages. Proceedings (POPL ’86). ACM, 38–43. isbn:9781450373470 https://doi.org/10.1145/512644.512648
[52]
Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Transactions on Software Engineering, 43, 1 (2016), 34–55.
[53]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. arxiv:2205.01068.

Cited By

View all
  • (2024)Software engineering education in the era of conversational AI: current trends and future directionsFrontiers in Artificial Intelligence10.3389/frai.2024.14363507Online publication date: 29-Aug-2024
  • (2024)Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-JudgeProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 110.1145/3649217.3653612(52-58)Online publication date: 3-Jul-2024
  • (2024)Large Language Models in Automated Repair of Haskell Type ErrorsProceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair10.1145/3643788.3648012(42-45)Online publication date: 20-Apr-2024
  • Show More Cited By

Index Terms

  1. GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SLE 2023: Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering
    October 2023
    231 pages
    ISBN:9798400703966
    DOI:10.1145/3623476
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. Automated Program Repair
    2. Code Generation
    3. Fault Localization
    4. GPT-3

    Qualifiers

    • Research-article

    Funding Sources

    • Fundação para a Ciência e a Tecnologia
    • Haslab/INESC TEC

    Conference

    SLE '23
    Sponsor:

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)306
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Software engineering education in the era of conversational AI: current trends and future directionsFrontiers in Artificial Intelligence10.3389/frai.2024.14363507Online publication date: 29-Aug-2024
    • (2024)Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-JudgeProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 110.1145/3649217.3653612(52-58)Online publication date: 3-Jul-2024
    • (2024)Large Language Models in Automated Repair of Haskell Type ErrorsProceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair10.1145/3643788.3648012(42-45)Online publication date: 20-Apr-2024
    • (2024)Software Testing With Large Language Models: Survey, Landscape, and VisionIEEE Transactions on Software Engineering10.1109/TSE.2024.336820850:4(911-936)Online publication date: 20-Feb-2024
    • (2024)Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability DetectionComputer Security – ESORICS 202410.1007/978-3-031-70879-4_14(271-289)Online publication date: 5-Sep-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media