from Hacker News

How to prevent scraping?

by metaprinter on 11/27/11, 12:22 AM with 5 comments

I've gone and built an extensive website for nursing students (that took forever to populate with data) but I'm wary of launching it until I learn how to prevent or minimize automated scraping of the content.

I thought about showing a teaser and requiring login to see everything, but then I lose out on google juice, no?

It's an LAMP environment. Any thoughts?

  • by georgemcbay on 11/27/11, 12:31 AM

    Any time you spend thinking about this is a waste. You can't stop scraping on the web, period. And any halfass attempt you make to try it is going to kill you on SEO as you already suspect.
  • by jnbiche on 11/27/11, 3:29 AM

    Your site will likely not be scraped unless/until it takes off. And once that happens, you'll have your foothold and no me-too site is going to surpass you unless they add more/better content. I wouldn't worry about it at this stage.
  • by stray on 11/27/11, 1:56 AM

    You can't prevent scraping, but you can poison it. I can think of two approaches:

    1. Replace bits of text on output with unicode look-alikes. Humans will still read what you want them to read, but non-humans get crap.

    2. The mountweasel approach: put fake entries in that humans would never find. Then you can google these fake entries - any site other than your own with Mt. Weasel, is the result of scraping your site.

    But honestly, most of our efforts to protect "our" work is just misguided busy-work...