from Hacker News

Ask HN: Why is ChatGPT allowed to scrape other sites via prompts?

by jbryu on 5/21/24, 8:55 PM with 45 comments

The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?
  • by bicx on 5/22/24, 5:12 AM

    Google scrapes like a maniac. And for profit. Many others do the same.

    A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.

    The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...

    If you’re trying to prevent scraping of your data, your best option is to not make it public.

  • by Nextgrid on 5/21/24, 11:22 PM

    If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.
  • by persedes on 5/21/24, 11:53 PM

    I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:

    https://www.sigmaaldrich.com/robots.txt

  • by icedchai on 5/21/24, 11:50 PM

    My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.
  • by brianjking on 5/21/24, 9:28 PM

  • by tripplyons on 5/21/24, 10:57 PM

    Scraping and violating TOS are not illegal to do, but they can get you blocked.
  • by xcasperx on 5/22/24, 6:43 PM

    I believe this is current precedent around scraping:

    https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

  • by brudgers on 5/22/24, 5:28 AM

    Terms of service enforcement is a matter of civil law.

    Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.

  • by mensetmanusman on 5/21/24, 11:58 PM

    Preventing scraping also entrenches google for eternity.
  • by rl3 on 5/22/24, 12:23 AM

    The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.
  • by 8note on 5/22/24, 1:56 AM

    Why? It's another user agent. Curl does the same thing, as does chrome and firefox