by adamlj on 2/28/14, 11:43 AM with 118 comments
by spindritf on 2/28/14, 1:10 PM
Many people don't write for money, to put ads on their website, or as part of some "content marketing" campaign. All they want is a little recognition. A boost in positioning on the SERP means we will be getting useful stuff at no cost.
And there are genuine replies there. Ryan Jones[1] even got the scrapers to confess their sins[2].
[1] https://twitter.com/RyanJones/status/439123533349015553
[2] https://www.google.com/search?q=%20%22istwfn%22+%22stole+thi...
by VikingCoder on 2/28/14, 12:40 PM
You may as will just show http://images.google.com and complain that it's scraping. Or http://news.google.com.
In general, do you think Wikipedia gets more traffic because Google exists, or do you think Google gets more traffic because Wikipedia exists? Meaning, which affect is larger? I'm pretty sure the answer to this is obvious.
And if more scrapers donated millions to the site they scrape from, the world would be a much better place.
http://wikimediafoundation.org/wiki/Press_releases/Wikimedia...
by xuki on 2/28/14, 12:22 PM
by jjoonathan on 2/28/14, 12:50 PM
1) Hasn't been chunked into 20 pieces of varying grammatical structure which are automatically matched to corresponding questions
2) Hasn't been subsequently pasted over a slideshow of completely irrelevant stock photos in bold, white font
3) Isn't accompanied by a grid of ~30 vaguely related questions helpfully linked to similar pages and tastefully decorated with more irrelevant stock photos
4) Only occupies ~1.5 rather than 3 or 4 of the front page search results
5) Contains only closely related textual ads rather than a melange of casino, fast food, and online college banners
6) Has fewer than 25 trustworthy stock faces smiling back at me from any given scroll position
If this is the best google can do then I don't think wiki.answers.com has anything to fear.
------------
Seriously, how the hell does wiki.answers.com manage to pollute half of the searches I make with their algorithmically generated garbage (multiple times, at that)?! What kind of SEO catapulted them to the top despite 0 viewer retention and what surely must be about 0 reputable backlinks? How haven't they been sent to the 1000th page with manual penalties already? They show up before wikipedia itself, for crying out loud!
Google, if you aren't going to let users maintain a manual blacklist, you need to be on top of this kind of thing. It's seriously degrading my search experience and I suspect I'm not alone. This kind of inattention is the type of thing that can push even the most inattentive users to change default search engines.
by pud on 2/28/14, 12:54 PM
So this is neither scraping, nor against the rules.
Here are dumps in SQL and XML format:
http://dumps.wikimedia.org/enwiki/
Ps- Yes the original post was meant to be funny and it was; I do have a sense of humor. :)
by _wmd on 2/28/14, 12:36 PM
by level09 on 2/28/14, 1:37 PM
by fear91 on 2/28/14, 12:54 PM
It's a shame that the search engine market share isn't split evenly by several different engines. I think it would be beneficent both to the users and website owners. Right now everyone tries to court Google and they seem to do whatever the fuck they want.
by smoyer on 2/28/14, 12:44 PM
EDIT: I should also note that I'm one of those who switched over to DuckDuckGo for privacy reasons, so I don't see these results as often now.
by k-mcgrady on 2/28/14, 12:38 PM
by 300bps on 2/28/14, 12:36 PM
In testing, they definitely don't seem to scrape every article:
by habosa on 2/28/14, 3:53 PM
</rant>
by ITB on 2/28/14, 5:05 PM
1. They are not only doing this with wikipedia, but with many, many sites: "what is the smallest cell in the human body", "what is the biggest planet in the solar system".
2. The sites they chose to link are not always the highest quality sites, such as the two examples above- why are these websites being featured?
3. Many times, the user will get their answer right then and there, and be done with the search process. The site misses a visitor. In spite of these type of questions being "facts", someone took the time to organize and give context to these "facts". Turning facts into useful, consumable, content costs money. Google should not be taking visitors away from these sites.
4. There should be public information on the CTR of these snippets. See if it helps or hurts the user.
5. Google is abusing its power as a major search engine to reinforce structuring rules, such as microformats. With these rules, webmasters are giving more and more semantic meaning to their content, which means Google has an easier time completing their knowledge graph. They might link to the source site for a while, but there is no good argument for linking back to wikipedia to attribute the fact that Jupiter is the largest planet, since it's a fact, just like 2+2 is 4 (no attribution).
6. Google is all about ML/NLP/AI driven knowledge. But in reality they are turning all of the internet content creators into a giant sweat shop for their knowledge graph. This is not fair, and sooner or later it will come back to bite them.
by higherpurpose on 2/28/14, 1:25 PM
by altcognito on 2/28/14, 12:30 PM
by Angostura on 2/28/14, 12:42 PM
by nkuttler on 2/28/14, 1:04 PM
A happy DDG user, who still uses !g too often though.
by tobehonest on 2/28/14, 12:40 PM
"Do as I say, not as I do" -- Google
by Grue3 on 2/28/14, 12:46 PM
by ricg on 2/28/14, 12:39 PM
by sebii on 2/28/14, 12:36 PM
by gwu78 on 3/1/14, 4:46 AM
By this definition webcache.googleusercontent.com qualifies.
It is a full copy of every site GoogleBot scrapes.
Google gives attrition to the original source, but if this isn't "scraping", what is?
They have been sued for this, and they've won. The benefits of a decent search engine outweigh the burden of infringing the copyrights of others. At least where Google and other search engines that cache websites are concerned.
by baldfat on 2/28/14, 12:49 PM
Seriously that was just a stretch, but they both say the full url. So all of Google News is a scraper site and any other summery given is a scrapper site then. Sad.
by return0 on 2/28/14, 4:49 PM
by cousin_it on 2/28/14, 2:28 PM
by bhartzer on 2/28/14, 12:57 PM
Matt is looking for scapers that rank better than the original, basically meaning that they have higher PageRank and more links.
by lazyjones on 2/28/14, 5:45 PM
by MitziMoto on 2/28/14, 9:30 PM
by rip747 on 2/28/14, 2:28 PM
by globalpanic on 2/28/14, 4:37 PM
by motyar on 3/1/14, 2:01 AM
by iamabraham on 2/28/14, 12:37 PM
by pearjuice on 2/28/14, 6:56 PM