skip to main content
10.1145/2931037.2931073acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Exploring regular expression usage and context in Python

Published: 18 July 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Due to the popularity and pervasive use of regular expressions, researchers have created tools to support their creation, validation, and use. However, little is known about the context in which regular expressions are used, the features that are most common, and how behaviorally similar regular expressions are to one another.
    In this paper, we explore the context in which regular expressions are used through a combination of developer surveys and repository analysis. We survey 18 professional developers about their regular expression usage and pain points. Then, we analyze nearly 4,000 open source Python projects from GitHub and extract nearly 14,000 unique regular expression patterns. We map the most common features used in regular expressions to those features supported by four major regex research efforts from industry and academia: brics, Hampi, RE2, and Rex. Using similarity analysis of regular expressions across projects, we identify six common behavioral clusters that describe how regular expressions are often used in practice. This is the first rigorous examination of regex usage and it provides empirical evidence to support design decisions by regex tool builders. It also points to areas of needed future work, such as refactoring regular expressions to increase regex understandability and context-specific tool support for common regex usages.

    References

    [1]
    F. Alkhateeb, J.-F. Baget, and J. Euzenat. Extending sparql with regular expression patterns (for querying rdf). Web Semant., 7(2):57–73, Apr. 2009.
    [2]
    S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, and P. Mcminn. An orchestrated survey of methodologies for automated software test case generation. J. Syst. Softw., 86(8):1978–2001, Aug. 2013.
    [3]
    A. Arslan. Multiple sequence alignment containing a sequence of regular expressions. In Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB ’05. Proceedings of the 2005 IEEE Symposium on, pages 1–7, Nov 2005.
    [4]
    R. Babbar and N. Singh. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND ’10, pages 43–50, New York, NY, USA, 2010. ACM.
    [5]
    R. A. Baeza-Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. J. ACM, 43(6):915–936, Nov. 1996.
    [6]
    F. Beck, S. Gulan, B. Biegel, S. Baltes, and D. Weiskopf. Regviz: Visual debugging of regular expressions. In Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014, pages 504–507, New York, NY, USA, 2014. ACM.
    [7]
    A. Begel, Y. P. Khoo, and T. Zimmermann. Codebook: Discovering and exploiting relationships in software repositories. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 125–134, New York, NY, USA, 2010. ACM.
    [8]
    O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How developers use the dynamic features of programming languages: The case of smalltalk. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 23–32, New York, NY, USA, 2011. ACM.
    [9]
    O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How (and why) developers use the dynamic features of programming languages: The case of smalltalk. Empirical Software Engineering, 18(6):1156–1194, Dec. 2013.
    [10]
    C. Chambers and C. Scaffidi. Smell-driven performance analysis for end-user programmers. In Proc. of VLH/CC ’13, pages 159–166, 2013.
    [11]
    C. Chambers and C. Scaffidi. Impact and utility of smell-driven performance tuning for end-user programmers. Journal of Visual Languages & Computing, 28:176–194, 2015. to appear.
    [12]
    T.-H. Chen, M. Nagappan, E. Shihab, and A. E. Hassan. An empirical study of dormant bugs. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 82–91, New York, NY, USA, 2014. ACM.
    [13]
    R. Dattero and S. D. Galup. Programming languages and gender. Commun. ACM, 47(1):99–102, Jan. 2004.
    [14]
    R. Dyer, H. Rajan, H. A. Nguyen, and T. N. Nguyen. Mining billions of ast nodes to study actual and potential usage of java language features. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 779–790, New York, NY, USA, 2014. ACM.
    [15]
    S. J. Galler and B. K. Aichernig. Survey on test data generation tools. Int. J. Softw. Tools Technol. Transf., 16(6):727–751, Nov. 2014.
    [16]
    I. Ghosh, N. Shafiei, G. Li, and W.-F. Chiang. Jst: An automatic test generation tool for industrial java applications with strings. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 992–1001, Piscataway, NJ, USA, 2013. IEEE Press.
    [17]
    M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale java open source code repository. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’10, pages 11:1–11:10, New York, NY, USA, 2010. ACM.
    [18]
    A. Kiezun, V. Ganesh, S. Artzi, P. J. Guo, P. Hooimeijer, and M. D. Ernst. Hampi: A solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol., 21(4):25:1–25:28, Feb. 2013.
    [19]
    C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer. GenProg: A generic method for automated software repair. Transactions on Software Engineering, 38(1):54–72, 2012.
    [20]
    J. Lee, M.-D. Pham, J. Lee, W.-S. Han, H. Cho, H. Yu, and J.-H. Lee. Processing sparql queries with regular expressions in rdf databases. In Proceedings of the ACM Fourth International Workshop on Data and Text Mining in Biomedical Informatics, DTMBIO ’10, pages 23–30, New York, NY, USA, 2010. ACM.
    [21]
    Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 21–30, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
    [22]
    M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, R. Oliveto, M. Di Penta, and D. Poshyvanyk. Mining energy-greedy api usage patterns in android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 2–11, New York, NY, USA, 2014. ACM.
    [23]
    B. Livshits, J. Whaley, and M. S. Lam. Reflection analysis for java. In Proceedings of the Third Asian Conference on Programming Languages and Systems, APLAS’05, pages 139–160, Berlin, Heidelberg, 2005. Springer-Verlag.
    [24]
    L. A. Meyerovich and A. S. Rabkin. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 1–18, New York, NY, USA, 2013. ACM.
    [25]
    A. Møller. dk.brics.automaton – finite-state automata and regular expressions for Java, 2010. http://www.brics.dk/automaton/.
    [26]
    The Bro Network Security Monitor. https://www.bro.org/, May 2015.
    [27]
    C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of java generics. Empirical Softw. Engg., 18(6):1047–1089, Dec. 2013.
    [28]
    RE2. https://github.com/google/re2, May 2015.
    [29]
    G. Richards, S. Lebresne, B. Burg, and J. Vitek. An analysis of the dynamic behavior of javascript programs. SIGPLAN Not., 45(6):1–12, June 2010.
    [30]
    E. Spishak, W. Dietl, and M. D. Ernst. A type system for regular expressions. In Proceedings of the 14th Workshop on Formal Techniques for Java-like Programs, FTfJP ’12, pages 20–26, New York, NY, USA, 2012. ACM.
    [31]
    N. Tillmann, J. de Halleux, and T. Xie. Transferring an automated test generation tool to practice: From pex to fakes and code digger. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 385–396, New York, NY, USA, 2014. ACM.
    [32]
    M.-T. Trinh, D.-H. Chu, and J. Jaffar. S3: A symbolic string solver for vulnerability detection in web applications. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, pages 1232–1243, New York, NY, USA, 2014. ACM.
    [33]
    M. Veanes, P. d. Halleux, and N. Tillmann. Rex: Symbolic regular expression explorer. In Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, ICST ’10, pages 498–507, Washington, DC, USA, 2010. IEEE Computer Society.
    [34]
    W. Weimer, S. Forrest, C. Le Goues, and T. Nguyen. Automatic program repair with evolutionary computation. Communications of the ACM Research Highlight, 53(5):109–116, May 2010.

    Cited By

    View all
    • (2024)Linear Matching of JavaScript Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564318:PLDI(1336-1360)Online publication date: 20-Jun-2024
    • (2024)Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer ForumsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644424(190-201)Online publication date: 15-Apr-2024
    • (2024)On the Decidability of Infix Inclusion ProblemTheory of Computing Systems10.1007/s00224-023-10160-w68:3(301-321)Online publication date: 13-Jan-2024
    • Show More Cited By

    Index Terms

    1. Exploring regular expression usage and context in Python

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis
      July 2016
      452 pages
      ISBN:9781450343909
      DOI:10.1145/2931037
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 July 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. developer survey
      2. regular expressions
      3. repository analysis

      Qualifiers

      • Research-article

      Conference

      ISSTA '16
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 58 of 213 submissions, 27%

      Upcoming Conference

      ISSTA '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)142
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 14 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Linear Matching of JavaScript Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564318:PLDI(1336-1360)Online publication date: 20-Jun-2024
      • (2024)Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer ForumsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644424(190-201)Online publication date: 15-Apr-2024
      • (2024)On the Decidability of Infix Inclusion ProblemTheory of Computing Systems10.1007/s00224-023-10160-w68:3(301-321)Online publication date: 13-Jan-2024
      • (2023)June: A Type Testability Transformation for Improved ATG PerformanceProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598055(274-284)Online publication date: 12-Jul-2023
      • (2023)Mining SQL Problem Solving Patterns using Advanced Sequence Processing AlgorithmsProceedings of the 2nd International Workshop on Data Systems Education: Bridging education practice with education research10.1145/3596673.3596973(37-43)Online publication date: 23-Jun-2023
      • (2023)A Robust Theory of Series Parallel GraphsProceedings of the ACM on Programming Languages10.1145/35712307:POPL(1058-1088)Online publication date: 11-Jan-2023
      • (2023)Improving Developers’ Understanding of Regex Denial of Service Tools through Anti-Patterns and Fix Strategies2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179442(1238-1255)Online publication date: May-2023
      • (2023)Efficient Pattern-based Static Analysis Approach via Regular-Expression Rules2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER56733.2023.00022(132-143)Online publication date: Mar-2023
      • (2023)Improvements of Blank Element Selection Algorithm for Element Fill-in-blank Problem in Web-client Programming2023 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia)10.1109/ICCE-Asia59966.2023.10326341(1-6)Online publication date: 23-Oct-2023
      • (2023)Comparison of Student Learning Outcomes Among SQL Problem-Solving Patterns2023 IEEE Frontiers in Education Conference (FIE)10.1109/FIE58773.2023.10343395(1-9)Online publication date: 18-Oct-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media