from Hacker News

What Happened to XPath?

by AhtiK on 10/30/20, 9:54 AM with 93 comments

  • by orf on 10/30/20, 12:48 PM

    XPath post 1.0 got ridiculous, like many things do. What started with a simple, elegant language morphed into one with a http client, filesystem methods, json support, functions, loops, extensions and the ability to read environment variables.

    I wrote a post about it a while back[1] (I regret some of the wording used there) and maintain a tool[2] that can exploit XPath injection issues. I'd recommend sticking with 1 or maybe 2, and pretending 3.x doesn't exist.

    1. https://tomforb.es/xcat-1.0-released-or-xpath-injection-issu...

    2. https://github.com/orf/xcat

  • by irjustin on 10/30/20, 12:51 PM

    Anyone who does scraping or automated browser work eventually comes across XPath.

    In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

    I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.

    After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.

    There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

  • by Crazyontap on 10/30/20, 1:57 PM

    Xpath is so powerful for web scraping I just realized recently. I'd been using css selectors for my occasional scraping needs and never bothered to learn xpath until on day on a whim decided to learn at least the basics.

    Man I can now write scrapers in 2 minutes that used to take me quite some time thanks to the power of xpath. Thing like ancestors, contains, the ability to chain, etc is so so powerful. I used to write so many hacks just to do the same with css before.

  • by benibela on 10/30/20, 2:07 PM

    The biggest problem with the new XPath versions is that the W3C made the standards, but almost no one implemented them, so you cannot actually use them

    I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2. And currently I am updating it to XPath 3.1: http://www.videlibri.de/xidel.html

  • by thom on 10/30/20, 2:05 PM

    XPath and XSLT was the first time (despite doing Haskell at university) that I started to really understand functional programming. The first time was working on a tech stack that was basically Microsoft SHAPE queries transformed into HTML. The second was multiple projects customising Google custom search engine results. It was weird realising that these very limited primitive were actually infinitely powerful if you were willing to warp your brain the right way.

    That said, I scrape a fair few webpages now and have never once revisited XPath. I suppose people have mostly written off anything that feels too much like XML as enterprisey and deprecated.

  • by ping_pong on 10/30/20, 3:26 PM

    XPath and XML in general is a great example of "Death by Committee". They tried too hard to be too smart and try to solve everything, and overcomplicated it to death. This is why people largely abandoned it. This is what is happening to C++ and they are steering themselves by committee into a dead end.
  • by projektfu on 10/30/20, 12:58 PM

    With increasing power comes the likelihood that people accidentally implement behavior that is nonpolynomial. It looks good in testing but then with real live data starts taking seconds to render/re-render. There are probably examples of this already in CSS but seems more likely with arbitrarily backtracking XPath expressions.
  • by anonymousblip on 10/30/20, 5:34 PM

    I love the XPath model of declaratively querying and transforming data, which has been highly influential (see JQ, JSONPath, GROQ, etc.). Ultimately, it was too closely tied with XML, which was overdesigned complex, and sucked into the committee hell that brought us more overdesigned technologies like SOAP and XML Schema.
  • by mongol on 10/30/20, 2:17 PM

    Xpath 1.0 is maybe the single most useful output from the XML universe. Did something like it exist before?
  • by icedchai on 10/30/20, 1:55 PM

    XPath 1.0 was released in the late 90’s. I remember using it in some server-side XML processing code (Java 1.2?) It did the job where the alternative was writing a ton of procedural code to get at a specific node, etc.
  • by lkuty on 10/30/20, 3:05 PM

    XPath 3 and XQuery 3 are powerful and great technologies to query XML if you need that stuff. The problem is that most implementations cover XPath 1.0 because I guess it is too difficult (i.e. time consuming and involved) to produce a 2.x or 3.x implementation, let alone with full W3C XML Schema support. There is also BaseX which implements XQuery 3.x which is a nice native XML database. I really dig XML and its technologies. I wish XQuery 3.x was available everywhere.
  • by jarym on 10/30/20, 1:47 PM

    Shameless plug of DefiantJS[1] that gives a lovely fast XPath query capability to JSON data.

    1. https://defiantjs.com

  • by dehrmann on 10/30/20, 6:59 PM

    One of the huge gaps in JSON tooling is there isn't a standard XPath equivalent (there's JSON Pointer, but it's nowhere close to XPath, and JSON Path which isn't standardized) and no XSLT equivalent.

    For as painful as XSLT was, at least it was a standard thing that existed.

  • by johnward on 10/30/20, 5:12 PM

    I do a bunch of of XML/XSLT work still. I use XPATH 1.0 basically everyday. It's also awesome for web scraping. Overall, it's a great tool that doesn't get a ton of exposure.
  • by mapgrep on 10/30/20, 2:34 PM

    Is there something I can read to get up to speed on xpath? Any recommendations for online or printed resources? (Particularly from folks who use it regularly!)
  • by varispeed on 10/30/20, 9:57 PM

    I remember spending good two weeks writing XPath parser in C and then the client changed their system responses to JSON. My last experience with XPath.
  • by chriswarbo on 10/30/20, 4:02 PM

    XPath is great, and works equally well in lumbering, ceremony-heavy Enterprise Java environments; and in quick bash one-liners.

    I use it in a bunch scraping scripts for Web sites which don't provide RSS feeds. It's really nice for quickly 'exploring' a document to find the needed data; it's simple to update when sites change their layout; and it can be read in from a config file, argument, env var, etc. to keep things generic and flexible.

  • by forgotmypw17 on 10/30/20, 6:06 PM

    XPath is hard to replace when writing Selenium WebDriver scripts. Thank you for existing, XPath.
  • by mimixco on 10/30/20, 12:53 PM

    I thought XPath was pretty terrific for the day. It let you transform XML into a user interface in an entirely declarative way -- not just the appearance of items like CSS but the actual content could be inspected and altered. I built some cool things in XPath before frameworks like Angular took over.
  • by techsin101 on 11/3/20, 7:56 AM

    css selector aren't alternative to xpath, alternative would be to write it out yourself in js, sort of entire tree parsing algo. there are times when this is the only option when scrapping.
  • by chrshawkes on 11/2/20, 3:27 PM

    What is the alternative for accurate scraping?
  • by dzonga on 10/30/20, 3:40 PM

    if you do any type of webscraping. xpath is the way to go. thanks to my former co-worker Justin, for showing me that.
  • by dsq on 10/30/20, 3:55 PM

    I used xpath last week for something
  • by tinus_hn on 10/30/20, 1:00 PM

    This is that weird language you use to make WebDAV servers look okay in a browser, right?
  • by katzgrau on 10/30/20, 12:48 PM

    It's hard not to read this as satire, because XPath is so inelegant. Not that CSS selectors are a model of elegance, but it gets the job done (most of the time) and is easy enough for rookie devs and designers to pick up.