skip to main content
10.1109/ASE.2019.00047acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Regexes are hard: decision-making, difficulties, and risks in programming regular expressions

Published: 07 February 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Regular expressions (regexes) are a powerful mechanism for solving string-matching problems. They are supported by all modern programming languages, and have been estimated to appear in more than a third of Python and JavaScript projects. Yet existing studies have focused mostly on one aspect of regex programming: readability. We know little about how developers perceive and program regexes, nor the difficulties that they face.
    In this paper, we provide the first study of the regex development cycle, with a focus on (1) how developers make decisions throughout the process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes. We took a mixed-methods approach, surveying 279 professional developers from a diversity of backgrounds (including top tech firms) for a high-level perspective, and interviewing 17 developers to learn the details about the difficulties that they face and the solutions that they prefer.
    In brief, regexes are hard. Not only are they hard to read, our participants said that they are hard to search for, hard to validate, and hard to document. They are also hard to master: the majority of our studied developers were unaware of critical security risks that can occur when using regexes, and those who knew of the risks did not deal with them in effective manners. Our findings provide multiple implications for future work, including semantic regex search engines for regex reuse and improved input generators for regex validation.

    References

    [1]
    Hacker news. https://news.ycombinator.com/.
    [2]
    Reddit. https://www.reddit.com/.
    [3]
    Regular expression library. https://web.archive.org/web/20180920164647/http://regexlib.com/.
    [4]
    S. Adolph, W. Hall, and P. Kruchten. Using grounded theory to study the experience of software development. Empirical Software Engineering, 16(4):487--513, 2011.
    [5]
    P. Arcaini, A. Gargantini, and E. Riccobene. MutRex: A Mutation-Based Generator of Fault Detecting Strings for Regular Expressions. In International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2017.
    [6]
    Bacchelli and Bird. Expectations, Outcomes, and Challenges of Modern Code Review. In International Conference on Software Engineering, pages 712--721, 2013.
    [7]
    A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao. Playing regex golf with genetic programming. pages 1063--1070. Association for Computing Machinery (ACM), 7 2014.
    [8]
    A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao. Inference of Regular Expressions for Text Extraction from Examples. IEEE Transactions on Knowledge and Data Engineering, 28(5):1217--1230, 5 2016.
    [9]
    P. Biernacki and D. Waldorf. Snowball Sampling: Problems and Techniques of Chain Referral Sampling. Sociological Methods & Research, 10(2):141--163, 11 1981.
    [10]
    S. Breu, R. Premraj, J. Sillito, and T. Zimmermann. Information needs in bug reports. In Proceedings of the 2010 ACM conference on Computer supported cooperative work - CSCW '10, page 301, New York, New York, USA, 2010. ACM Press.
    [11]
    R. P. L. Buse and T. Zimmermann. Information Needs for Software Development Analytics. In Proceedings of the 34th International Conference on Software Engineering, pages 987--996, Zurich, Switzerland, 2012. IEEE.
    [12]
    C. Chapman and K. T. Stolee. Exploring regular expression usage and context in Python. In Proceedings of the 25th International Symposium on Software Testing and Analysis - ISSTA 2016, pages 282--293, New York, New York, USA, 2016. ACM Press.
    [13]
    C. Chapman and K. T. Stolee. Exploring regular expression usage and context in Python. International Symposium on Software Testing and Analysis (ISSTA), 2016.
    [14]
    C. Chapman, P. Wang, and K. T. Stolee. Exploring Regular Expression Comprehension. In Automated Software Engineering (ASE), 2017.
    [15]
    R. Cox. Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...), 2007.
    [16]
    J. W. Creswell and J. D. Creswell. Research design: Qualitative, quantitative, and mixed methods approaches. Sage publications, 2017.
    [17]
    S. Crosby. Denial of service through regular expressions. USENIX Security work in progress report, 2003.
    [18]
    S. A. Crosby and D. S. Wallach. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security, 2003.
    [19]
    J. C. Davis, C. A. Coghlan, F. Servant, and D. Lee. The Impact of Regular Expression Denial of Service (ReDoS) in Practice: an Empirical Study at the Ecosystem Scale. In The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2018.
    [20]
    J. C. Davis, L. G. Michael IV, C. A. Coghlan, F. Servant, and D. Lee. Why aren't regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019, pages 443--454, New York, New York, USA, 2019. ACM Press.
    [21]
    J. C. Davis, D. Moyer, A. Kazerouni, and D. Lee. Testing regex generalizability and its implications: A large-scale many-language measurement study. In ACM International Conference on Automated Software Engineering (ASE). ACM, 2019.
    [22]
    J. C. Davis, E. R. Williamson, and D. Lee. A Sense of Time for JavaScript and Node.js: First-Class Timeouts as a Cure for Event Handler Poisoning. In USENIX Security Symposium (USENIX Security), 2018.
    [23]
    M. J. Ennis. txt2re. http://www.txt2re.com/, 2006.
    [24]
    N. A. Ernst, S. Bellomo, I. Ozkaya, R. L. Nord, and I. Gorton. Measure it? Manage it? Ignore it? software practitioners and technical debt. pages 50--60. Association for Computing Machinery (ACM), 8 2015.
    [25]
    S. Fannoun and J. Kerins. Towards organisational learning enhancement: assessing software engineering practice. Learning Organization, 26(1):44--59, 1 2019.
    [26]
    J. E. Friedl. Mastering regular expressions. "O'Reilly Media, Inc.", 2006.
    [27]
    E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Abstraction and Reuse of Object-Oriented Design. In ECOOP' 93 --- Object-Oriented Programming, pages 406--431. Springer Berlin Heidelberg, 1993.
    [28]
    N. Golafshani. The Qualitative Report Understanding Reliability and Validity in Qualitative Research. Technical report.
    [29]
    G. Gousios, A. Zaidman, M. A. Storey, and A. Van Deursen. Work practices and challenges in pull-based development: The integrator's perspective. In Proceedings - International Conference on Software Engineering, volume 1, pages 358--368. IEEE Computer Society, 8 2015.
    [30]
    R. Hodován, Z. Herczeg, and Á. Kiss. Regular expressions on the web. In International Symposium on Web Systems Evolution (WSE), 2010.
    [31]
    L. G. M. IV, J. Donohue, J. C. Davis, D. Lee, and F. Servant. Replication package for "Regexes are Hard: Decision-making, Difficulties, and Risks in Programming Regular Expressions", Sept. 2019.
    [32]
    B. A. Kitchenham and S. L. Pfleeger. Personal opinion surveys. In Guide to Advanced Empirical Software Engineering. 2008.
    [33]
    A. J. Ko, R. DeLine, and G. Venolia. Information needs in collocated software development teams. In Proceedings - International Conference on Software Engineering, pages 344--353, 2007.
    [34]
    B. L BERG. Qualitative research methods for the social sciences. 2001.
    [35]
    E. Larson. Automatic Checking of Regular Expressions. In Source Code Analysis and Manipulation (SCAM), 2018.
    [36]
    E. Larson and A. Kirk. Generating Evil Test Strings for Regular Expressions. In Proceedings - 2016 IEEE International Conference on Software Testing, Verification and Validation, ICST 2016, pages 309--319. Institute of Electrical and Electronics Engineers Inc., 7 2016.
    [37]
    P. L. Li, A. J. Ko, and J. Zhu. What makes a great software engineer? In Proceedings - International Conference on Software Engineering, volume 1, pages 700--710. IEEE Computer Society, 8 2015.
    [38]
    T. R. Lindlof and B. C. Taylor. Qualitative communication research methods. Sage publications, 2017.
    [39]
    J. G. S. C. Ltd. Regexmagic. https://www.regexmagic.com/autogenerate.html, 2014.
    [40]
    D. R. MacIver. What is property based testing? https://hypothesis.works/articles/what-is-property-based-testing/.
    [41]
    T. McCabe. A Complexity Measure. IEEE Transactions on Software Engineering, SE-2(4):308--320, 12 1976.
    [42]
    A. Møller. dk. brics. automaton-finite-state automata and regular expressions for java, 2010, 2010.
    [43]
    T. Parr. The definitive ANTLR 4 reference. Pragmatic Bookshelf, 2013.
    [44]
    R. Pressman. Software Engineering: A Practitioner's Approach. chapter Process Models, pages 30--64. McGraw-Hill, seventh edition edition, 2010.
    [45]
    A. Rathnayake and H. Thielecke. Static Analysis for Regular Expression Exponential Runtime via Substructural Logics. Technical report, 2014.
    [46]
    S. Reid. An empirical analysis of equivalence partitioning, boundary value analysis and random testing. pages 64--73. Institute of Electrical and Electronics Engineers (IEEE), 11 2002.
    [47]
    G. R. Sadler, H.-C. Lee, R. S.-H. Lim, and J. Fullerton. Research Article: Recruitment of hard-to-reach population subgroups via adaptations of the snowball sampling strategy. Nursing & Health Sciences, 12(3):369--374, 9 2010.
    [48]
    Y. Shen, Y. Jiang, C. Xu, P. Yu, X. Ma, and J. Lu. ReScue: Crafting Regular Expression DoS Attacks. In Automated Software Engineering (ASE), 2018.
    [49]
    J. Siegmund, C. Kästner, J. Liebig, S. Apel, and S. Hanenberg. Measuring and modeling programming experience. Empirical Software Engineering, 19(5):1299--1334, 10 2014.
    [50]
    E. Spishak, W. Dietl, and M. D. Ernst. A type system for regular expressions. pages 20--26. Association for Computing Machinery (ACM), 7 2012.
    [51]
    C.-A. Staicu and M. Pradel. Freezing the Web: A Study of ReDoS Vulnerabilities in JavaScript-based Web Servers. In USENIX Security Symposium (USENIX Security), 2018.
    [52]
    S. Team. Sublime search and replace. http://docs.sublimetext.info/en/latest/search_and_replace/search_and_replace_overview.html.
    [53]
    V. S. C. Team. Visual studio code - basic editing. https://code.visualstudio.com/docs/editor/codebasics.
    [54]
    M. Veanes, P. De Halleux, and N. Tillmann. Rex: Symbolic regular expression explorer. International Conference on Software Testing, Verification and Validation (ICST), 2010.
    [55]
    P. Wang, G. R. Bai, and K. T. Stolee. Exploring Regular Expression Evolution. Technical report.
    [56]
    P. Wang, G. R. Bai, and K. T. Stolee. Exploring Regular Expression Evolution. In Software Analysis, Evolution, and Reengineering (SANER), 2019.
    [57]
    P. Wang and K. T. Stolee. How well are regular expressions tested in the wild? In Foundations of Software Engineering (FSE), 2018.
    [58]
    N. Weideman, B. van der Merwe, M. Berglund, and B. Watson. Analyzing matching time behavior of backtracking regular expression matchers by using ambiguity of NFA. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9705, pages 322--334, 2016.
    [59]
    R. S. Weiss. Learning from strangers: The art and method of qualitative interview studies. Simon and Schuster, 1995.
    [60]
    V. Wustholz, O. Olivo, M. J. H. Heule, and I. Dillig. Static Detection of DoS Vulnerabilities in Programs that use Regular Expressions. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2017.

    Cited By

    View all
    • (2024)ExLi: An Inline-Test Generation Tool for JavaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663817(652-656)Online publication date: 10-Jul-2024
    • (2024)Static Analysis for Checking the Disambiguation Robustness of Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564618:PLDI(2073-2097)Online publication date: 20-Jun-2024
    • (2024)Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer ForumsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644424(190-201)Online publication date: 15-Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASE '19: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering
    November 2019
    1333 pages
    ISBN:9781728125084

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    IEEE Press

    Publication History

    Published: 07 February 2020

    Check for updates

    Author Tags

    1. developer process
    2. qualitative research
    3. regular expressions

    Qualifiers

    • Research-article

    Conference

    ASE '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 82 of 337 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ExLi: An Inline-Test Generation Tool for JavaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663817(652-656)Online publication date: 10-Jul-2024
    • (2024)Static Analysis for Checking the Disambiguation Robustness of Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564618:PLDI(2073-2097)Online publication date: 20-Jun-2024
    • (2024)Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer ForumsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644424(190-201)Online publication date: 15-Apr-2024
    • (2024)Efficient Matching with Memoization for Regexes with Look-around and Atomic GroupingProgramming Languages and Systems10.1007/978-3-031-57267-8_4(90-118)Online publication date: 6-Apr-2024
    • (2023)Repairing Regular Expressions for ExtractionProceedings of the ACM on Programming Languages10.1145/35912877:PLDI(1633-1656)Online publication date: 6-Jun-2023
    • (2023)Learning the Structure of Commands by Detecting Random Tokens Using Markov ModelProceedings of the 2023 8th International Conference on Machine Learning Technologies10.1145/3589883.3589892(61-67)Online publication date: 10-Mar-2023
    • (2023)Using Micro Parsons Problems to Scaffold the Learning of Regular ExpressionsProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 110.1145/3587102.3588853(457-463)Online publication date: 29-Jun-2023
    • (2023)HybridCISave: A Combined Build and Test Selection Approach in Continuous IntegrationACM Transactions on Software Engineering and Methodology10.1145/357603832:4(1-39)Online publication date: 26-May-2023
    • (2023)Sound static analysis of regular expressions for vulnerabilities to denial of service attacksScience of Computer Programming10.1016/j.scico.2023.102960229:COnline publication date: 1-Jul-2023
    • (2022)Solving string constraints with Regex-dependent functions through transducers with priorities and variablesProceedings of the ACM on Programming Languages10.1145/34987076:POPL(1-31)Online publication date: 12-Jan-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media