skip to main content
article

Inference of Regular Expressions for Text Extraction from Examples

Published: 01 May 2016 Publication History

Abstract

A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work, we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it.

Cited By

View all
  • (2024)An Analysis of the Ingredients for Learning Interpretable Symbolic Regression Models with Human-in-the-loop and Genetic ProgrammingACM Transactions on Evolutionary Learning and Optimization10.1145/36436884:1(1-30)Online publication date: 23-Feb-2024
  • (2024)Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS AttacksProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results10.1145/3639476.3639757(52-56)Online publication date: 14-Apr-2024
  • (2023)Learning from Uncurated Regular Expressions for Semantic Type ClassificationProceedings of the 1st Workshop on Simplicity in Management of Data10.1145/3596225.3596226(1-5)Online publication date: 23-Jun-2023
  • Show More Cited By

Index Terms

  1. Inference of Regular Expressions for Text Extraction from Examples

      Recommendations

      Reviews

      Michael Lesk

      This paper is a thorough evaluation of using machine learning to generate regular expressions for data mining, such as extracting email addresses from web pages. The paper even includes a comparison with humans asked to do the same tasks. It is thorough, and reference to the unpublished appendix shows that it finds quite readable and concise expressions, and the references even include xkcd . The tasks are fairly limited and performance is still low for harder tasks such as phone numbers and Congressional bill numbers, even with thousands of learning examples. We are still pretty far from being able to fix a general class of mistakes in a few lines of a database and have a program that watches us and then finishes the job, as was tried back in 1983 by R. P. Nix [1]. I recommended this paper to anyone interested in the details of machine learning methods as applied to text mining. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Knowledge and Data Engineering
      IEEE Transactions on Knowledge and Data Engineering  Volume 28, Issue 5
      May 2016
      261 pages

      Publisher

      IEEE Educational Activities Department

      United States

      Publication History

      Published: 01 May 2016

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 04 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)An Analysis of the Ingredients for Learning Interpretable Symbolic Regression Models with Human-in-the-loop and Genetic ProgrammingACM Transactions on Evolutionary Learning and Optimization10.1145/36436884:1(1-30)Online publication date: 23-Feb-2024
      • (2024)Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS AttacksProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results10.1145/3639476.3639757(52-56)Online publication date: 14-Apr-2024
      • (2023)Learning from Uncurated Regular Expressions for Semantic Type ClassificationProceedings of the 1st Workshop on Simplicity in Management of Data10.1145/3596225.3596226(1-5)Online publication date: 23-Jun-2023
      • (2023)Repairing Regular Expressions for ExtractionProceedings of the ACM on Programming Languages10.1145/35912877:PLDI(1633-1656)Online publication date: 6-Jun-2023
      • (2023)Search-Based Regular Expression Inference on a GPUProceedings of the ACM on Programming Languages10.1145/35912747:PLDI(1317-1339)Online publication date: 6-Jun-2023
      • (2023)Human-in-the-loop Regular Expression Extraction for Single Column Format InconsistencyProceedings of the ACM Web Conference 202310.1145/3543507.3583515(3859-3867)Online publication date: 30-Apr-2023
      • (2023)Almost Rerere: Learning to Resolve Conflicts in Distributed ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.321528949:4(2255-2271)Online publication date: 1-Apr-2023
      • (2023)Deducing Matching Strings for Real-World Regular ExpressionsDependable Software Engineering. Theories, Tools, and Applications10.1007/978-981-99-8664-4_19(331-350)Online publication date: 27-Nov-2023
      • (2022)Exploiting input sanitization for regex denial of serviceProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510047(883-895)Online publication date: 21-May-2022
      • (2022)Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithmGenetic Programming and Evolvable Machines10.1007/s10710-021-09411-x23:1(105-131)Online publication date: 1-Mar-2022
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media