by rams on 5/6/11, 5:01 PM with 31 comments
by neilk on 5/6/11, 6:44 PM
http://www.mediawiki.org/wiki/Alternative_parsers#Known_impl...
Most of these are special purpose hacks. Kiwi and Sweble are the most serious projects I'm aware of, that have tried to generate a full parse.
However, few of these projects are useful for upgrading Wikipedia itself. Even the general parsers like Sweble are effectively special-purpose, since we have a lot of PHP that hooks into the parser and warps its behaviour in "interesting" ways. The average parser geek usually wants to write to a cleaner spec in, well, any language other than PHP. ;)
Currently the Wikimedia Foundation is just starting a MediaWiki.next project. Parsing is just one of the things we are going to change in major ways -- fixing this will make it much easier to do WYSIWYG editing or to publish content in ways that aren't just HTML pages.
(Obviously we will be looking at Sweble carefully.)
If this sounds like a fun project to you, please get in touch! Or check out the "Future" portal on MediaWiki.org.
by sigil on 5/6/11, 6:21 PM
For one, sweble is a Java parser, and I'm not sure this makes it a good drop-in replacement for the current MediaWiki PHP code. The DBPedia Project also has what looks like a decent AST-based Java parser [1]. I would be interested in a comparison between sweble and DBPedia's WikiParser.
I stumbled across a very nice MediaWiki scanner and parser in C a while ago [2]. It uses ragel [3] for the scanner; the parser is not a completely generic AST builder, but is rather specific to the problem of converting MediaWiki markup to some other wiki markup. It does do quite a bit of the parser work already though.
Presumably a PHP extension around a C or C++ scanner/parser could someday replace the current MediaWiki parsing code.
[1] http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=...
by bjonathan on 5/6/11, 5:56 PM
cache version: http://webcache.googleusercontent.com/search?q=cache:8xjwEj-...
by sunir on 5/6/11, 6:27 PM
First, I believe this reveals the complexity of the parser, which implies a complex syntax, which implies a complex user interface as felt by end users. A more complex the user interface may make it harder it is to attract new editors, although it's unclear (to me) if that is a fact.
Second, having an AST representation is awesome. It makes it possible to even think about building a path towards WYSIWYG or some other form of rich text editing. It was not really possible to build a WYSIWYG editor around the wiki syntax.
If you have an AST, you can also store the page as the AST since you can regenerate the wiki syntax from the AST for people who need text-based editors.
by mdaniel on 5/6/11, 6:24 PM
I suppose this is one of the knobs that must be tuned to balance between reproducible I/O and turning away meaningful contributions from the community.
by Semiapies on 5/6/11, 5:33 PM
by pornel on 5/6/11, 6:01 PM
by car on 5/6/11, 9:39 PM
by brianjolney on 5/6/11, 5:51 PM
by seanp2k on 5/6/11, 6:45 PM