from Hacker News

RegExpBuilder – Create regular expressions using chained methods

by jrullmann on 2/11/15, 2:59 PM with 54 comments

  • by draegtun on 2/11/15, 6:21 PM

    Thought this might be of interest; below shows how the examples provided would look in Rebol:

        digits: digit: charset "0123456789"
    
        rule: [
            thru "$"
            some digits
            "."
            digit
            digit
        ]
    
        parse "$10.00" rule    ;; true
    
    
        pattern: [
            some "p"
            2 "q" any "q"
        ]
    
        new-rule: [
            2 pattern
        ]
    
        parse "pqqpqq" new-rule    ;; true
    
    Rebol doesn't have regular expressions instead it comes with a parse dialect which is a TDPL - http://en.wikipedia.org/wiki/Top-down_parsing_language

    Some parse refs: http://en.wikibooks.org/wiki/REBOL_Programming/Language_Feat... | http://www.rebol.net/wiki/Parse_Project | http://www.rebol.com/r3/docs/concepts/parsing-summary.html

  • by tragomaskhalos on 2/11/15, 4:21 PM

    There have been many efforts similar to this in many languages, but most of us seem happy to stick to the more succinct canonical form, supplemented via /x # comments when things get too hairy
  • by marktangotango on 2/11/15, 4:52 PM

    Generally, I find that if one's regexes are so complex that one needs visualizers or other aids in writing them, one doesn't have a regex problem, but a parsing problem. The method of parsing by recursive descent can often lead to much more understandable (if more verbose) "pattern matching".
  • by UnoriginalGuy on 2/11/15, 5:02 PM

    Looks like Linq (from .Net/C#). Pretty sexy way to write Regular Expressions if you ask me.

    I've "learned" regular expressions multiple times but it just never sticks, I have no idea why. It certainly doesn't help that there are several different incompatible syntaxes (so what I remember and think "should" work doesn't).

    I'd prefer to write RegX's in this style, however I would pay attention to performance (not that Regular Expressions are high performance, however I wouldn't want to see a large performance loss either).

  • by chris-at on 2/11/15, 3:12 PM

    Thanks, this is a lot better than writing this (even if the formatting worked here):

    ``` (?xi) \b ( # Capture 1: entire matched URL (?: [a-z][\w-]+: # URL protocol and colon (?: /{1,3} # 1-3 slashes | # or [a-z0-9%] # Single letter or digit or '%' # (Trying not to match e.g. "URI::Escape") ) | # or www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | # or [a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash ) (?: # One or more: [^\s()<>]+ # Run of non-space, non-()<> | # or \(([^\s()<>]+|(\([^\s()<>]+\)))\) # balanced parens, up to 2 levels )+ (?: # End with: \(([^\s()<>]+|(\([^\s()<>]+\)))\) # balanced parens, up to 2 levels | # or [^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars ) ) ```

  • by jluxenberg on 2/11/15, 6:02 PM

    S-expressions are a natural fit for construction of regular expressions, see http://community.schemewiki.org/?scheme-faq-programming#H-1w...

    e.g.

      (: (or (in ("az")) (in ("AZ"))) 
        (* (uncase (in ("az09")))))
  • by jgalt212 on 2/11/15, 5:18 PM

    Definitely a debugable way to write regexes. Whenever I have to maintain a hairy regex, I like to plot the regex as a railroad diagram.

    These web based tools can do it:

    https://www.debuggex.com/

    http://jex.im/regulex/

  • by dkarapetyan on 2/11/15, 4:56 PM

    Generalize just a little bit and you got parser combinators.
  • by zzzcpan on 2/11/15, 10:55 PM

    Regexpes exist to avoid cumbersome code like this, to make it less error prone. Makes me sad to see so many upvotes.

    I get that some people have a hard time understanding regexpes with all the backtracking and greediness. Yes, syntax is a bit complicated. Maybe simplified predictable default mode could help. But there is no problem with DSL being used as an abstraction. In fact, we need more DSLs, for everything!

  • by psychometry on 2/11/15, 5:19 PM

    Now you have three problems.
  • by kazinator on 2/11/15, 6:20 PM

    Yes, regexes can have other syntactic representations, like:

        (compound "$" (1+ :digit) "." :digit :digit)
    
    Run:

        $ txr -p "(regex-compile '(compound \"$\" (1+ :digit) \".\" :digit :digit))"
        #/$\d+\.\d\d/
  • by epicureanideal on 2/11/15, 7:45 PM

    Nice work! I don't know if it'll be ideal for all use cases, but it does add some readability.
  • by otakucode on 2/11/15, 10:56 PM

    Now do an example where you create a regex to parse the IMDB movies.list data file!
  • by gcao on 2/11/15, 4:17 PM

    Great work! This is very intriguing!