by jodoglevy on 2/20/14, 10:51 PM with 43 comments
by brey on 2/21/14, 8:13 AM
if you can see or submit data using the website, it means the website does
have some kind of API ... For example, if you do a search in Yahoo, you can
see the search page sent to Yahoo’s servers has the following url ...
https://search.yahoo.com/search?p=search+term
no. no no no. this is not an API. this is about as far from an application programming INTERFACE as it can get. this means an agreed format, where there's a contract (social or otherwise) to provide stability to APPLICATION clients. there's no contract here other than 'a human types something into the box, presses some buttons and some results appears on the website'./search?p=search+term is an implementation detail hidden from the humans the site is built for. they can, and most likely will, change this at any time. the HTML returned (and being scraped) is an implementation detail. today, HTML. tomorrow, AJAX. next week? who knows, maybe Flash.
fine, it's a scraper builder. but don't call what it's using an API, and don't imply it's anything more than a fragile house of cards built on the shaky foundation of 'if you get noticed you're going to get banned or a cease and desist'.
by ChuckMcM on 2/20/14, 11:34 PM
by digitalboss on 2/20/14, 11:42 PM
by loceng on 2/21/14, 12:28 AM
by h3ro on 2/21/14, 10:10 PM
What I like most about your competition though is the JS interface that gets used for one good last thing (before being properly scraped and de-AD- and de-java-fied): clicking on the content you want, and deselecting content you don't want: subtly, with your mouse you lead a pattern-matching algorithm doing the annoying work.
Honestly the simplicity of this interface is even more breathtaking to me than gargl :P But it's even more limited, as after clicking twice it thinks that it has understood the pattern already, although that might not be the case.
I'd suggest to integrate the idea, but to make the learning process more clever, make it possible to select more things, even though the engine thinks there can't be any more similar things. Give that AI more things to learn from. We want more identifiers than just counts and HTML elements: "2nd subelement of <h1>".
There's good stuff you can do with statistics, too. Some data exists only once, some exists only 3 times, some always exists over 10 times. That's valuable info. Some data has many words of whitespace seperated text - oh a paragraph!
tldr We need something that generates good semantics out of normal web sites automatically, so that users can use a simple Web UI mangled into the target web site to choose the right pattern.
by anigbrowl on 2/21/14, 12:32 AM
by yid on 2/20/14, 11:33 PM
Also, it seems like authenticated sites would be difficult to scrape with this, i.e. ones that require login and possibly some logic (like sending a hash of request parameters) with every request.
by jheriko on 2/20/14, 11:43 PM
in that regard its nice to see the big warning at the top of the page about ease of misuse (and a refreshing slap in the face - i was thinking 'pfft some hipster forgot common sense again' and expecting not to see anything of the kind)
there is something off here and i can't quite put my finger on it though... as a low level programmer I cringe when I hear web people using API to describe some weird little subset of APIs anyway. Here I feel almost like what this does is takes an existing 'API' (http - the internet) and refactors the interface in highly specific ways to make it easier to use...
At any rate. Its a clever idea and nice to see such a well thought through implementation - but its also far too open to misuse imo. I wish the creator the best of luck... hopefully no takedown requests too soon.
by benwilber0 on 2/21/14, 1:12 AM
by dfgonzalez on 2/21/14, 2:30 AM
by notastartup on 2/21/14, 12:06 AM
Armchair lawyers please advise, we need more details.