from Hacker News

A Regular Expression Matcher (2007)

by ville on 2/13/20, 12:18 PM with 22 comments

  • by dang on 2/13/20, 5:11 PM

    A thread from 2016: https://news.ycombinator.com/item?id=12199836

    2013: https://news.ycombinator.com/item?id=5672875

    2011: https://news.ycombinator.com/item?id=2723366

    Maybe others?

    p.s. these links are just for curious readers; reposts are ok after a year or so—see https://news.ycombinator.com/newsfaq.html.

  • by adriantam on 2/13/20, 2:34 PM

    This is chapter 1 of "Beautiful Code" (http://shop.oreilly.com/product/9780596510046.do)
  • by glangdale on 2/13/20, 9:34 PM

    These posts are great for history, but regular expression implementation has moved on considerably from early Thompson implementations whether backtracking or NFAs. There is a considerable body of literature about regex implementation, including many quite convincing implementations and alternate formulations, some of which weren't even done by people at, or from, Bell Labs!

    We seem to have something of an Eternal September of regex implementation knowledge (abetted by Russ Cox's amazingly selective bibliography in his posts introducing RE2).

  • by hstaab on 2/13/20, 3:50 PM

    I saw this in the repo of single file language implementations that was posted yesterday. Glad to see it on the front page now.

    Here’s the repo if anyone is interested in checking out the others:

    https://github.com/marcpaq/b1fipl

  • by jstimpfle on 2/13/20, 10:10 PM

    Skimming over it, the implementation looks inefficient:

        int matchstar(int c, char *regexp, char *text)
        {
            do {    /* a * matches zero or more instances */
                if (matchhere(regexp, text))
                    return 1;
            } while (*text != '\0' && (*text++ == c || c == '.'));
            return 0;
        }
    
    That can be used to match short strings, but not to grep through a filesystem. A good regex matcher has runtime O(N*M) where N is the length of the regex (typically very short) and M is the length of the scanned text.
  • by raverbashing on 2/13/20, 4:57 PM

    Seeing code like this I can understand why the pioneers though C code could be pretty.

    But now all I see there is something more akin to a demonstration of ice carving with chainsaws than something that should be used in a production system.

  • by leoh on 2/13/20, 4:53 PM

    This would be a very good technical interview question — i.e. "implement this simple regex specification."