by jacquesm on 4/16/24, 11:05 PM with 60 comments
While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.
https://en.wikipedia.org/wiki/Low-background_steel
by simonw on 4/16/24, 11:27 PM
(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)
by vitovito on 4/16/24, 11:36 PM
2021: https://twitter.com/jackclarkSF/status/1376304266667651078
2022: https://twitter.com/william_g_ray/status/1583574265513017344
2022: https://twitter.com/mtrc/status/1599725875280257024
Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.
Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview
Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017
by talldayo on 4/16/24, 11:22 PM
They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.
Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.
by uyzstvqs on 4/16/24, 11:30 PM
The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.
by potatoman22 on 4/16/24, 11:13 PM
by Zenzero on 4/17/24, 2:36 AM
by skybrian on 4/16/24, 11:23 PM
by signaru on 4/17/24, 10:16 AM
by jdswain on 4/17/24, 1:37 AM
by anigbrowl on 4/17/24, 12:52 AM
That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.
by aaronblohowiak on 4/16/24, 11:12 PM
by neilk on 4/17/24, 2:36 AM
A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.
https://udongein.xyz/notice/AcwmRcIzxOLmrSamum
There are some obvious problems with it, but I think I'd still like to see what that would look like.
by giantg2 on 4/17/24, 12:58 AM
by dudus on 4/17/24, 12:46 AM
by ccgreg on 4/18/24, 4:24 PM
by greyzor7 on 4/18/24, 9:38 AM
by wseqyrku on 4/17/24, 7:49 AM
this would only apply for pre-agi era though
by MattGaiser on 4/16/24, 11:23 PM
I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.
by metadat on 4/16/24, 11:12 PM
Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.
I'm not holding my breath.
by jamesy0ung on 4/17/24, 2:05 AM
by acheron on 4/16/24, 11:26 PM
by mceoin on 4/17/24, 12:12 AM
by RecycledEle on 4/17/24, 4:52 AM
by LorenDB on 4/17/24, 1:07 AM
by fuzztester on 4/17/24, 12:29 AM
had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.
not sure if he was right because I still see evidence of such stuff.
by alpenbazi on 4/16/24, 11:54 PM
by keepamovin on 4/17/24, 2:31 AM
What are you searching for anyway??