from Hacker News

The Parser that Cracked the MediaWiki Code

by rams on 5/6/11, 5:01 PM with 31 comments

  • by neilk on 5/6/11, 6:44 PM

    This isn't the first alternative parser for MediaWiki content -- there are 28 rows in this table. (I just added Sweble's and my own project...)

    http://www.mediawiki.org/wiki/Alternative_parsers#Known_impl...

    Most of these are special purpose hacks. Kiwi and Sweble are the most serious projects I'm aware of, that have tried to generate a full parse.

    However, few of these projects are useful for upgrading Wikipedia itself. Even the general parsers like Sweble are effectively special-purpose, since we have a lot of PHP that hooks into the parser and warps its behaviour in "interesting" ways. The average parser geek usually wants to write to a cleaner spec in, well, any language other than PHP. ;)

    Currently the Wikimedia Foundation is just starting a MediaWiki.next project. Parsing is just one of the things we are going to change in major ways -- fixing this will make it much easier to do WYSIWYG editing or to publish content in ways that aren't just HTML pages.

    (Obviously we will be looking at Sweble carefully.)

    If this sounds like a fun project to you, please get in touch! Or check out the "Future" portal on MediaWiki.org.

    http://www.mediawiki.org/wiki/Future

  • by sigil on 5/6/11, 6:21 PM

    It's great to see people tackling this problem, but I wouldn't declare victory for sweble just yet ("The Parser That Cracked..."). There are other promising MediaWiki parser efforts out there.

    For one, sweble is a Java parser, and I'm not sure this makes it a good drop-in replacement for the current MediaWiki PHP code. The DBPedia Project also has what looks like a decent AST-based Java parser [1]. I would be interested in a comparison between sweble and DBPedia's WikiParser.

    I stumbled across a very nice MediaWiki scanner and parser in C a while ago [2]. It uses ragel [3] for the scanner; the parser is not a completely generic AST builder, but is rather specific to the problem of converting MediaWiki markup to some other wiki markup. It does do quite a bit of the parser work already though.

    Presumably a PHP extension around a C or C++ scanner/parser could someday replace the current MediaWiki parsing code.

    [1] http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=...

    [2] http://git.wincent.com/wikitext.git

    [3] http://www.complang.org/ragel/

  • by bjonathan on 5/6/11, 5:56 PM

  • by sunir on 5/6/11, 6:27 PM

    This is a breakthrough and a welcome one. From a end user point of view, it has a couple major implications.

    First, I believe this reveals the complexity of the parser, which implies a complex syntax, which implies a complex user interface as felt by end users. A more complex the user interface may make it harder it is to attract new editors, although it's unclear (to me) if that is a fact.

    Second, having an AST representation is awesome. It makes it possible to even think about building a path towards WYSIWYG or some other form of rich text editing. It was not really possible to build a WYSIWYG editor around the wiki syntax.

    If you have an AST, you can also store the page as the AST since you can regenerate the wiki syntax from the AST for people who need text-based editors.

  • by mdaniel on 5/6/11, 6:24 PM

    From reading the article, and especially the interesting comments thereon, it seems this problem is half a bogus "language" specification and half that the unwashed masses are inputting any damn thing they like and Wikipedia accepts it.

    I suppose this is one of the knobs that must be tuned to balance between reproducible I/O and turning away meaningful contributions from the community.

  • by Semiapies on 5/6/11, 5:33 PM

    I hadn't realized that there were any parsing issues around MediaWiki's markup. 5000 lines of PHP? Eek.
  • by pornel on 5/6/11, 6:01 PM

    AST of an example page is the interesting bit:

    http://sweble.org/crystalball/result?query=ASDF&format=t...

  • by car on 5/6/11, 9:39 PM

    Site is down due to harddisk problems, but the actually referenced Sweble Wikipedia Parser project site is at http://www.sweble.org.
  • by brianjolney on 5/6/11, 5:51 PM

    link died. any mirrors?
  • by seanp2k on 5/6/11, 6:45 PM

    I think we killed this poor site.