from Hacker News

Show HN: Regex Derivatives (Brzozowski Derivatives)

by c0nstantine on 3/7/23, 9:21 AM with 35 comments

A Python sketch of a regex engine in less than 150 lines of code
  • by mananaysiempre on 3/7/23, 12:57 PM

    From my own experiments[1], I suspect your engine might still be prone to exponential blowup: you only get proper linear matching if you reassociate and sort the choice expressions and merge identical terms (and Brzozowski’s original paper[2] does point this out). This makes the implementation quite a bit more annoying.

    ETA: Yup. Try (successfully) matching a long string of As against (A*)* and watch it die.

    [1] http://ix.io/4qan/ (matching only), http://ix.io/4qap/ (adds parsing and printing).

    [2] https://doi.org/10.1145/321239.321249

  • by evolveyourmind on 3/7/23, 1:55 PM

    And here the formalized proof in less than 150 lines of code (in Agda) for Brzozowski Derivatives for regex matching (and additional regular languages theorems): https://github.com/desi-ivanov/agda-regexp-automata
  • by noelwelsh on 3/7/23, 3:12 PM

    The next logical step is parsing with derivatives: https://matt.might.net/papers/might2011derivatives.pdf

    I found the paper was fairly readable. YMMV.

  • by djoldman on 3/7/23, 6:59 PM

  • by aarchi on 3/7/23, 5:01 PM

    I'm currently building a couple of regexp engines:

    One, that's a formalization[0] in Coq with big-step semantics, which uncommonly has the intersection operator, and includes several equivalence relations and a proof of the pumping lemma, excepting one case (more on that below).

    As a learning exercise and for historical reasons, I've also mostly ported Rust Cox's re1 engine to Rust[1], which includes VM matchers in the style of Henry Spencer, Ken Thompson, and Rob Pike. I also plan to port Doug McIlroy's engine[2], which is interesting for having intersection and complement and special handling for sublanguages, all the way down to just concatenation matched with Knuth-Morris-Pratt. I also want to examine the Rust (thanks burntsushi!), RE2, and Plan 9 engines in more depth.

    Once I have time to get back to the project, I want to get back to my regular expression crossword puzzle solver. For that, I'm converting the hint regexps to DFAs, that match strings of some fixed length, and concatenating and intersecting them, until a single regexp is yielded, which should be a string literal, if the puzzle has a single solution. For backreferences, it's more tricky, but I plan on rewriting backreferences to the captured expression, where the lengths of both match, then either executing it with a stack like a pushdown automata or constructing a set of constraints on the characters by index.

    As an aside: In my proof of the pumping lemma[3], I got stuck on the case for intersection and I'd love insight. Regular languages are closed under intersection, so the pumping lemma should hold for my implementation. I need to prove that if s =~ re1 and s =~ re2 can be pumped, then so can s =~ And re1 re2. s is is split into different substrings for re1 and re2, s = s11 ++ s12 ++ s13 = s21 ++ s22 ++ s23, then repeated an arbitrary number of times, (forall n, s11 ++ repeat s12 n ++ s13 =~ re1) and (forall n, s21 ++ repeat s22 n ++ s23 =~ re2). My intuition is that s11 = s21, s12 = s22, and s13 = s23, because they both match for the intersection, but I'm not convinced of that and haven't been able to formulate a proof for that.

    0: https://github.com/thaliaarchi/recross-coq

    1: https://github.com/thaliaarchi/re1-rust

    2: https://github.com/arnoldrobbins/mcilroy-regex

    3: https://github.com/thaliaarchi/recross-coq/blob/main/theorie...

  • by enricozb on 3/7/23, 12:47 PM

    I think there is a typo when defining v(e), it says both v(e) = e and v(e) = empty set.
  • by DonHopkins on 3/7/23, 5:44 PM

    This "derivatives of regular expressions" technique is what James Clark used in the design of the efficient implementation of the Relax/NG tree regular expression based XML validator, based on Janusz A. Brzozowski, "Derivatives of Regular Expressions", Journal of the ACM, Volume 11, Issue 4, 1964.

    https://relaxng.org/jclark/design.html

    >Unordered content

    >SGML provides an & operator: A & B matches A followed by B or B followed by A. XML removed the & operator. RELAX NG reintroduces it with a twist. In SGML, a content model of A & B* requires all the B elements to be consecutive: the required A element cannot occur in between two B elements. Usually, users use the & operator because they want to allow child elements to occur in any order, so this restriction is undesirable. In RELAX NG, the corresponding operator has interleaving semantics. It matches any interleaving of a sequence containing a single A element and a sequence containing zero or more B elements; it thus allows the A element to occur anywhere, including between two B elements.

    >XML removed the & operator mainly because of the & operator's reputation for implementation complexity. The most difficult part of implementing the & operator in SGML is detecting whether a content model including & is 1-unambiguous. Unlike SGML, XML and W3C XML Schema, RELAX NG does not restrict content models to be 1-unambiguous, so this implementation difficulty is removed. The classic implementation technique for SGML and XML content models is to construct a Glushkov automaton. The 1-unambiguity restriction is helpful for this technique because it ensures that the Glushkov automaton is deterministic. An interleaving operator causes difficulty with this technique. However, there is an alternative implementation technique available [17] based on derivatives of regular expressions [4]. This handles content models that are not 1-unambiguous without any additional effort and can deal with interleaving without difficulty. RELAX NG imposes restrictions on the use of interleave which are sufficient to ensure that a derivative-based implementation will not exhibit exponential behavior.

    [4] Janusz A. Brzozowski. Derivatives of Regular Expressions. Journal of the ACM, Volume 11, Issue 4, 1964.

    https://dl.acm.org/doi/10.1145/321239.321249

    https://dl.acm.org/doi/pdf/10.1145/321239.321249

    [17] Joe English. How to validate XML. 1999. See http://www.flightlab.com/~joe/sgml/validate.html

  • by SomeoneOnTheWeb on 3/7/23, 1:17 PM

    IIRC, that's what the Rust regex engine[1] currently uses

    [1] https://github.com/rust-lang/regex