from Hacker News

HTML/XML Parsing with Node & jQuery

by mjijackson on 11/14/11, 6:00 PM with 10 comments

  • by pshc on 11/14/11, 10:33 PM

    I was scraping with jQuery for a while but it felt like an awful lot of overhead. In the case of simpler scraping tasks that happen a lot I've actually gone back to nuts and bolts with HTML5[1]'s tokenizer and a custom state machine that only accumulates the data I want. At no time is any DOM node actually created in memory, let alone the entire DOM tree. It means I feel safer running many of these in parallel on a VPS. It also means I can write a nice streaming API where you start emitting data the moment you get enough input. Buffering input just feels wrong in node.js.

    But jQuery is a great scraper if your transformation is complex and non-streamable. [1] https://github.com/aredridel/html5

  • by ricardobeat on 11/14/11, 6:26 PM

        doc.find('h2:gt(0)').before('<hr />')
  • by peteretep on 11/14/11, 9:53 PM

    Actually, I'm doing this for my SUPER SECRET startup at the moment. Originally the front-end would just send the back-end the whole HTML of a user's page when they executed the browser plugin, and the back-end would intercept it and knock it up in Perl.

    Wasn't sure how well that was going to scale, and was worried people would get weird about sending the entire contents of the page they're on - I have a 90% working solution now where it's all done in-browser, with a bunch of classes I've been working on with a node.js set of testing tools

  • by bialecki on 11/14/11, 8:37 PM

    One of my biggest pet peeves with crawling the web is using XPath. Not because I have strong feelings about XPath, just that I use css selector syntax so much, it's a pain I can't leverage that knowledge in this domain as well. Something like this is really awesome and going to make crawling the web more accessible.
  • by orc on 11/14/11, 7:36 PM

    Wow, I was just thinking this morning how awesome it would be to make a desktop app that could crawl websites with jquery. And since node.js has a windows installer, it sounds like a much better solution than the C# HtmlAgilityPack I've been using.
  • by slashclee on 11/14/11, 11:23 PM

    Apparently node.js doesn't implement the DOMParser object, which means that you can't actually use jquery's parseXML method. That's a bummer :(