from Hacker News

How to prevent scraping?

by metaprinter on 11/27/11, 12:22 AM with 5 comments

I've gone and built an extensive website for nursing students (that took forever to populate with data) but I'm wary of launching it until I learn how to prevent or minimize automated scraping of the content.

I thought about showing a teaser and requiring login to see everything, but then I lose out on google juice, no?

It's an LAMP environment. Any thoughts?

by georgemcbay on 11/27/11, 12:31 AM
Any time you spend thinking about this is a waste. You can't stop scraping on the web, period. And any halfass attempt you make to try it is going to kill you on SEO as you already suspect.
by jnbiche on 11/27/11, 3:29 AM
Your site will likely not be scraped unless/until it takes off. And once that happens, you'll have your foothold and no me-too site is going to surpass you unless they add more/better content. I wouldn't worry about it at this stage.
by stray on 11/27/11, 1:56 AM
You can't prevent scraping, but you can poison it. I can think of two approaches:
1. Replace bits of text on output with unicode look-alikes. Humans will still read what you want them to read, but non-humans get crap.
2. The mountweasel approach: put fake entries in that humans would never find. Then you can google these fake entries - any site other than your own with Mt. Weasel, is the result of scraping your site.
But honestly, most of our efforts to protect "our" work is just misguided busy-work...