from Hacker News

Sites scramble to block ChatGPT web crawler after instructions emerge

by specto on 8/11/23, 9:48 PM with 34 comments

by Meekro on 8/11/23, 11:49 PM
I have two sites that provide documentation for open source libraries I've created, and I definitely won't be blocking ChatGPT. It has already read my documentation and can correctly answer most StackOverflow-level questions about my libraries' use. This is seriously impressive and very helpful, as far as I'm concerned.
by vouaobrasil on 8/12/23, 1:05 AM
From the article:
> For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.
I would rather leave the internet entirely if AI chatbots become a primary user interface.
by 8organicbits on 8/11/23, 10:38 PM
I wonder if its worth poisoning the replies for scrapers that don't obey robots.txt. Send back nonsense, lies, and noise. This would be an adversarial approach like https://adnauseam.io/ uses for ad tracking.
by JohnFen on 8/11/23, 10:29 PM
> blocking GPTBot will not guarantee that a site's data does not end up training all AI models of the future. Aside from issues of scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as The Pile) that are not affiliated with OpenAI.
This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.
I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.
by wildpeaks on 8/12/23, 11:40 AM
This gives the illusion of being in control, but if enough people block the bot, they'll just scrape differently (if they don't already) because too much money is at stake, more than whatever fine they may get if they do get caught and can't settle out of court, not to mention they may consider it will be someone else's problem by then.
It's more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren't aligned between content authors and scrapers.
On the other hand, robots.txt was benefiting both search engines and content authors because it signaled data that wasn't useful to show in search results, therefore search engines had an incentive to follow its rules.
by blibble on 8/11/23, 10:53 PM
blocked it on every single site I manage
there is zero benefit to me in allowing OpenAI to absorb my content
it is a parasite, plain and simple (as is GitHub Copilot)
and I'll be hooking in the procedurally generated garbage pages for it soon!
by karaterobot on 8/11/23, 10:52 PM
The article does not say whether it obeys `User-agent: *`. My guess is that, if it doesn't respect that, it doesn't truly respect `User-agent: GPTBot` either.
by askvictor on 8/11/23, 11:31 PM
I've been reading lots of datasheets and application notes in the embedded space recently. Most of these are only accessible after creating a (free) login. In one sense, it's a reasonably simple way to prevent scraping like this (at least until the AI-based scrapers can generate their own logins). On the other hand, a lot of that kind of material would be _really_ useful to be able to ask an LLM about.
by CableNinja on 8/12/23, 7:04 AM
For anyone reading this, you can skip the robots.txt, as others have pointed out, who knows if they will actually listen to it.
Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx