from Hacker News

Cloudflare's new marketplace lets websites charge AI bots for scraping

by boristsr on 9/23/24, 1:31 PM with 270 comments

  • by billyhoffman on 9/23/24, 2:39 PM

    Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.

    In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).

    There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

  • by creatonez on 9/23/24, 2:29 PM

    This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.
  • by neilv on 9/23/24, 2:26 PM

    Cloudflare found a new variation on their traditional service of protecting from abusers.

    This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

    And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

    I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

  • by flaburgan on 9/23/24, 3:45 PM

    I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them. This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!
  • by kijin on 9/23/24, 2:36 PM

    AI scrapers are parasites.

    I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.

    I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.

  • by FlyingSnake on 9/23/24, 2:21 PM

    More details here at the Cloudflare blog: https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-c...
  • by sdflhasjd on 9/23/24, 4:16 PM

    How long does the world-wide-web have left? It's always felt like it would be around forever, but at some point it will fade into obscurity like IRC has done. The golden age, I feel, has been gone a while, but "AI" seems like the beginning of the end.
  • by neilv on 9/23/24, 2:16 PM

    > A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare’s tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site.

    And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.

  • by zebomon on 9/23/24, 6:14 PM

    Here's a look at my AI Audit on Bingeclock for anyone who's curious. Interesting drop in the last 48 hours given that it coincided with Cloudflare's announcement.

    https://www.bingeclock.com/blog/img/ai-audit-cloudflare-0923...

    The payment program sounds intriguing, I suppose. I can't imagine it will do much to move the needle for websites that will become unviable due to traffic drain. Without a doubt, AI scrapers will (quite rationally from their POV) avoid anything but nominal payments until they're forced to do otherwise.

  • by dageshi on 9/23/24, 6:35 PM

    Ahhh I love it. The era of silo's has well and truly arrived, I hope websites milk every dollar they can from the AI startups, they can afford it!
  • by marcus_holmes on 9/24/24, 1:53 AM

    > If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved

    I'm not sure this is true. Maybe they stop creating commercial stuff for sale, and go do something else for money, but generally creative people don't stop creating just because they can't get paid for it.

  • by osigurdson on 9/23/24, 6:54 PM

    Next step: generate reams of content using generative AI and get paid by Cloudflare when this is scanned by generative AI.
  • by Mistletoe on 9/23/24, 2:09 PM

  • by boristsr on 9/23/24, 2:01 PM

    I'm pretty interested in how companies are exploring how to properly monetize or compensate for scraped content to help keep a strong ecosystem of quality content. Id love to see more efforts like this.
  • by kylehotchkiss on 9/23/24, 6:37 PM

    Is anybody else seeing an absolutely massive amount of Amazonbot crawls on their site? What are they up to? And why so aggressively?
  • by sharpshadow on 9/23/24, 4:16 PM

    It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.

    The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.

  • by NoMoreNicksLeft on 9/23/24, 3:37 PM

    Great. The HR software my company uses can charge me when my own bot "scrapes" my paystub pdf.
  • by delanyoyoko on 9/23/24, 4:19 PM

    I guess with marketplace like this, if webmasters are happy and the AI agents are also happy, then we'll be seeing quite a few services to come up with similar solution.

    Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.

  • by siliconc0w on 9/23/24, 3:53 PM

    Any recommendations for simple WAF tool that will stop the majority of the abuse without having to use Cloudflare? I use Cloudflare just to keep that noise away from my logs but I'm not super keen to be dependent on them.
  • by AtNightWeCode on 9/23/24, 6:21 PM

    Maybe they could solve some of the core issues instead. It is like CF lost the source code and just pushing new more or less useless features all the time. Even though I think this is a fair change.
  • by CatWChainsaw on 9/24/24, 2:03 AM

    I guess Web3 will exist after all. In a microtransaction-per-webpage-utilized sense. No way websites don't start charging real people when there's money to be made.
  • by dangoodmanUT on 9/23/24, 4:53 PM

    the blog makes it seem like the bot buys access

    but if they are only tracking the bot via the user agent

    then can't i piggyback on that user agent?

    no ai scraper is going to include an auth header when accessing your website...

  • by rahimnathwani on 9/23/24, 8:15 PM

      While it’s a bold idea, Cloudflare is not sharing a fully fleshed-out idea of what its marketplace will look like.
  • by datavirtue on 9/24/24, 1:18 AM

    Wasn't the web designed to be scraped?
  • by 015a on 9/23/24, 3:51 PM

    One minor, tedious thing that I've become so tired of lately is showcased very plainly in the screenshot in this article: That the Cloudflare admin dashboard has now prominently placed "AI Audit (ALPHA)" as a top-level navigation menu item at the very top of the list of a Cloudflare Account's products. Everyone is doing this, for AI products or whatever came before them, and it genuinely pushes me away from paying for Cloudflare, as I get the distinct sense that they aren't building the things or fixing the problems that I feel are important to me.

    I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.

  • by renewiltord on 9/23/24, 9:30 PM

    Just use some residential proxy network and slam your target. They can't detect you.
  • by synack on 9/23/24, 8:58 PM

    Are they gonna let me block the scrapers that run on Cloudflare Workers?
  • by j45 on 9/23/24, 10:57 PM

    Neat licensing idea - look forward to seeing some case studies.
  • by johnisgood on 9/23/24, 3:29 PM

    How are they going to pay? How much? Can it be enforced?
  • by micromacrofoot on 9/23/24, 8:09 PM

    absent of legal changes this mostly rewards companies that figure out how to scrape without being detected, this problem has existed before AI
  • by zkid18 on 9/23/24, 3:01 PM

    What's wrong with AI agents accessing website content? We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.
  • by Workaccount2 on 9/23/24, 2:29 PM

    Props to cloudaflare for referring to it as "scanning your data", which is probably the most technically accurate way to describe what AI training bots are doing.
  • by johnsutor on 9/23/24, 2:34 PM

    Or, you know, just create your own API for your platform and charge people per request to that.
  • by zackmorris on 9/23/24, 5:41 PM

    Boy I'm sick of clicking "Verify you are human" on everything from GitLab to banking apps running Cloudflare.

    Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.

    One company shouldn't be allowed to police access to the internet. And certainly shouldn't be allowed to start gatekeeping what is viewable by discriminating against the person or software doing the viewing.

    I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community. If you work there, it might be time to consider getting a different job. If you own stock, maybe divest. If you're connected, perhaps your associates can buy from competitors. That's probably the only way to get the board and CEO replaced these days.

  • by xyzzy_plugh on 9/23/24, 2:41 PM

    Ah yes, the ol' monopoly invents an illusionary marketplace ploy.

    Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s

    What absolute garbage.

  • by kelsey98765431 on 9/23/24, 2:31 PM

    lol good luck
  • by meiraleal on 9/23/24, 3:58 PM

    Wow, a big tech thinking about creators not about how to extract all they can but to give back. That became so uncommon nowadays. Cloudflare deserves their exponential growth. Kudos for them.
  • by giancarlostoro on 9/23/24, 2:20 PM

    I really love Cloudflare. They're always up to something interesting and different. I hope we see more companies rise up similar to Cloudflare. I almost want to say Cloudflare is everything we hoped Google would be, but Google became another corporate cog machine that innovates and then scraps things up in one swoop. I don't recall the last I heard of Cloudflare spinning something up just to wind it back down? I don't think its impossible for them to make a bad choice, but I think they really think their projects through typically.

    My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.

    My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.