from Hacker News

Tell HN: We should snapshot a mostly AI output free version of the web

by jacquesm on 4/16/24, 11:05 PM with 60 comments

While we can, and if it isn't too late already. The web is overrun with AI generated drivel, I've been searching for information on some widely varying subjects and I keep landing in recently auto-generated junk. Unfortunately most search engines associate 'recency' with 'quality' or 'relevance' and that is very much no longer true.

While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.

https://en.wikipedia.org/wiki/Low-background_steel

by simonw on 4/16/24, 11:27 PM
Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: https://data.commoncrawl.org/crawl-data/index.html
(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)
by vitovito on 4/16/24, 11:36 PM
2024 might already be too late, since this sentiment has been shared since at least 2021:
2021: https://twitter.com/jackclarkSF/status/1376304266667651078
2022: https://twitter.com/william_g_ray/status/1583574265513017344
2022: https://twitter.com/mtrc/status/1599725875280257024
Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.
Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview
Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017
by talldayo on 4/16/24, 11:22 PM
> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on
They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.
Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.
by uyzstvqs on 4/16/24, 11:30 PM
I've posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.
The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.
by potatoman22 on 4/16/24, 11:13 PM
I feel like archive.org and The Pile have this covered, no?
by Zenzero on 4/17/24, 2:36 AM
This implies that the pre-AI internet wasn't already overrun with SEO optimized junk. Much of the internet is not worth preserving.
by skybrian on 4/16/24, 11:23 PM
SEO content farms have been publishing for decades now.
by signaru on 4/17/24, 10:16 AM
Alternatively, searching has to be changed. The non AI content doesn't necessarily disappear, but are gradually becoming "hidden gems". Something like Marginalia which does this for SEO noise would be nice.
by jdswain on 4/17/24, 1:37 AM
At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it'll get better to the point where it'll be hard to tell, but maybe then it's also good enough to be worth reading?
by anigbrowl on 4/17/24, 12:52 AM
I don't really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It's not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.
That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.
by aaronblohowiak on 4/16/24, 11:12 PM
Internet archive?
by neilk on 4/17/24, 2:36 AM
Using "before:2023" in your Google query helps. For now.
A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.
https://udongein.xyz/notice/AcwmRcIzxOLmrSamum
There are some obvious problems with it, but I think I'd still like to see what that would look like.
by giantg2 on 4/17/24, 12:58 AM
Sure, we can take a snapshot of our bot filled web today before it goes true AI. Not sure what the real benefit would be.
by dudus on 4/17/24, 12:46 AM
I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.
by ccgreg on 4/18/24, 4:24 PM
I've been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.
by greyzor7 on 4/18/24, 9:38 AM
that's what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (https://microlaunch.net/p/screenshotone)
by wseqyrku on 4/17/24, 7:49 AM
> recently auto-generated junk
this would only apply for pre-agi era though
by MattGaiser on 4/16/24, 11:23 PM
Is this really all that different from the procedurally generated drivel or the offshore freelance copy/paste generated drivel?
I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.
by metadat on 4/16/24, 11:12 PM
Reality is a mess in a lot of ways. Unfortunately in this case, it's a bit late.
Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.
I'm not holding my breath.
by jamesy0ung on 4/17/24, 2:05 AM
Internet Archive exists for webpages
by acheron on 4/16/24, 11:26 PM
The web has been overrun by drivel for over two decades now.
by mceoin on 4/17/24, 12:12 AM
Isn’t this common crawl?
by RecycledEle on 4/17/24, 4:52 AM
It's way too late.
by LorenDB on 4/17/24, 1:07 AM
r/Datahoarder probably already has you covered.
by fuzztester on 4/17/24, 12:29 AM
Same seems to have been happening on hn from the last several months.
had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.
not sure if he was right because I still see evidence of such stuff.
by alpenbazi on 4/16/24, 11:54 PM
yes
by keepamovin on 4/17/24, 2:31 AM
Embrace it. Stop living in the past, Gatsby. Just ask ChatGPT for the answers you seek. Hahaha! :)
What are you searching for anyway??