from Hacker News

Reddit's Robots.txt Changed

by titaniumtown on 7/7/24, 6:36 AM with 51 comments

  • by CobrastanJorji on 7/7/24, 7:17 AM

    Interesting. There have been all sorts of "Reddit's content is key to search results, adding 'reddit' to search results makes them good" stories. And there's been a lot of talk about how some of the big ML makers, notably Google, depend on Reddit's content to train their AI. And Google has that recent $60 million deal for content. So clearly Reddit's execs have been talking about how their content is valuable and they shouldn't give it away for free.

    But at the same time, blocking search engines from indexing your social media site is a dangerous game. Any search engine that respects this is gonna effectively de-list Reddit. That's no good for views, and views is what makes Reddit money. Presumably they have negotiated private deals with Google and probably Microsoft for this and are trying to sell their data to ML companies, because otherwise this would seem suicidal.

    Kind of a shame. The information is still going to get shared around to all the giant corporations, but Reddit will presumably make it harder to access for all the little guys. And the more they tie the content to dollars, the more managers on the inside will start doing stupid things to try and generate more of whatever the most valuable kinds of content are.

  • by sunaookami on 7/7/24, 7:39 AM

    Related thread for the official blog post: https://news.ycombinator.com/item?id=40799275

    Side note: They seem to serve other robots.txt for different User-Agents & IPs: https://merj.com/blog/investigating-reddits-robots-txt-cloak...

  • by benreesman on 7/7/24, 7:17 AM

    If anyone is curious how deeply destructive, or how deeply approved by the NSA, or how deeply self-sabotaging for society the modern AI training data pipeline is becoming I’d refer them to SB 1047.

    OpenAI is openly collaborating with the NSA, Google is manipulating the definition of a web crawl, Anthropic has installed a bunch of humanitarians from Jump Trading as the leading mech interp group that makes strident claims about how all this stuff works based on weights you do not and never will have access to.

    They’re telling you: “And you will do nothing, because you can do nothing.”

    I invite you to join me in proving that we can in fact do something.

  • by dageshi on 7/7/24, 7:13 AM

    Yup, this is what I thought would happen. It wouldn't surprise me if reddit goes a step further and requires login to view pages.

    AI is the death-knell of the web as we've known it for the past three decades. Once freely available information will retreat behind login walls and charge bigco's for access to train their models.

    I wonder if some standardised data API will be settled upon, perhaps it already exists?

  • by rany_ on 7/7/24, 9:44 AM

    https://www.reddit.com/robots.txt has additional comments:

      # Welcome to Reddit's robots.txt
      # Reddit believes in an open internet, but not the misuse of public content.
      # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
      # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
      # policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
    
      User-agent: *
      Disallow: /
  • by Seattle3503 on 7/7/24, 7:23 AM

    Technically this title violates HNs title policy as it should just be "Reddits robot.txt" or something, but "Reddits robot.txt changed" is more useful. I'm curious to see if mods change it.
  • by hamilyon2 on 7/7/24, 8:07 AM

    Is it a good time to start competitor? Given that reddit might take Quora's path to oblivion.
  • by eps on 7/7/24, 8:13 AM

    This is not good.

    80% of my Google searches for other people's opinions now end with "site:reddit.com", and there is surprisingly quite a few of them. The alternative is Reddit's own search and it tends to produce less relevant results.

  • by skilled on 7/7/24, 6:55 AM

    Every search engine other than Google has stopped indexing pages from Reddit.

    Google has not commented on whether they plan to respect it. Rich Results[0] say they're using a version from June 25. The new version was last modified July 1.

    [0]: https://search.google.com/test/rich-results

  • by jkhanlar on 7/7/24, 11:54 AM

    Also it seems that since 2018 it has not actually changed, lol http://web.archive.org/web/20180501000000*/https://old.reddi...
  • by nubinetwork on 7/7/24, 8:28 AM

    Most robots don't honour robots.txt anyways...
  • by jkhanlar on 7/7/24, 10:08 AM

    lol what? Just 20-30 minutes ago I saw this at #52, and tried to find it again now and see it #468 https://archive.ph/ReTR5 but I wonder if that is algorithmically natural or whatnot, lol
  • by Lorin on 7/7/24, 7:36 AM

    Oh great now we can rely on Reddit's own renowned search functionality /s

    What are they thinking?