from Hacker News

Low-background Steel: content without AI contamination

by jgrahamc on 6/10/25, 5:55 PM with 268 comments

by gojomo on 6/10/25, 10:37 PM
Look, we just need to add some new 'planes' to Unicode - that mirror all communicatively-useful characters, but with extra state bits for...
guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.
for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.
Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.
When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.
I am only (1 - epsilon) joking.
by K0balt on 6/10/25, 9:12 PM
Ai generated content is inherently a regression to the mean and harms both training and human utility. There is no benefit in publishing anything that an AI can generate, just ask the question yourself. Maybe publish all AI content with <AI generated content> tags, but other than that it is a public nuisance much more often than a public good.
by Legend2440 on 6/10/25, 6:41 PM
I'm not convinced this is going to be as big of a deal as people think.
Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
by protocolture on 6/11/25, 6:49 AM
I like how the chosen terminology is perfectly picked to paint the concern as irrelevant.
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."
I dont see that:
1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.
2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
3. That LLM output is going to infest everything anyway.
by ACCount36 on 6/10/25, 7:33 PM
Currently, there is no reason to believe that "AI contamination" is a practical issue for AI training runs.
AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
by schmookeeg on 6/10/25, 6:34 PM
I'm not as allergic to AI content as some (although I'm sure I'll get there) -- but I admire this analogy to low-background steel. Brilliant.
by koolba on 6/10/25, 6:51 PM
I feel oddly prescient today: https://news.ycombinator.com/item?id=44217676
by onecommentman on 6/11/25, 7:58 AM
Used paper books, especially poor-but-functional copies known as “reading copies” or “ex-library”, are going for a song on the used book market. Recommend starting your own physical book library, including basic reference texts, and supporting your local public and university libraries. Paper copies of articles in your areas of expertise and interest. Follow the ways of your ancestors.
I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.
by nialv7 on 6/10/25, 8:08 PM
Does this analogy work? It's exceedingly hard to make new low-background steels, since those radioactive particles are everywhere. But it's not difficult to make AI-free content - well just don't use AI to write it.
by gorgoiler on 6/10/25, 8:27 PM
This site is literally named for the Y combinator! Module some philosophical hand waving, if there’s one thing we ought to demand of our inference models it’s the ability to find the fixed point of a function that takes content and outputs content, then consumes that same content!
I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.
by submeta on 6/11/25, 6:45 AM
I have started to write „organic“ content again, as I am fed up with ultra polished super noisy texts by colleagues.
I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.
And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.
by vunderba on 6/10/25, 7:53 PM
Was the choice to go with a very obviously AI generated image for the banner intentional? If I had to guess it almost looks like DALL-E version 2.
by Ekaros on 6/10/25, 8:18 PM
Wouldn't actually curated content be still better? That is content were say lot of blogspam and and other content potentially generated by certain groups was removed? As I distinctly remember that lot of content even before AIs was very poor quality.
On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.
by ChrisArchitect on 6/10/25, 7:00 PM
Love the concept (and the historical story is neat too).
Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.
by aunty_helen on 6/10/25, 7:13 PM
Any user profile created pre-2022 is low background steel. I’m now finding myself check date created when it seems like the user is outputting low quality content. Much to my dismay, I’m often wrong.
by swyx on 6/10/25, 6:43 PM
i put together a brief catalog of AI pollution of the web the last time this topic came up: https://www.latent.space/i/139368545/the-concept-of-low-back...
i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.
(plenty of humans are pure noise, too, dont forget)
by jeffchuber on 6/10/25, 9:42 PM
https://x.com/jeffreyhuber/status/1732069197847687658
by carlosjobim on 6/10/25, 10:09 PM
The shadow libraries are the largest and highest quality source of human knowledge, larger than the Internet in scope and actual content.
It is also uncontaminated by AI.
by tomgag on 6/11/25, 11:05 AM
Interesting idea, I also mentioned the low-background analogy back in 2024:
https://gagliardoni.net/#ml_collapse_steel
https://infosec.exchange/@tomgag/111815723861443432
by thm on 6/10/25, 6:37 PM
Related: https://news.ycombinator.com/item?id=43811732
by sorokod on 6/10/25, 7:27 PM
Elsewhere I proposed a "100% organic data" label for uncontaminated content. Should have a "100% organic data" logo too.
by Animats on 6/10/25, 11:14 PM
Someone else pointed out the problem when I suggested, a few days ago, that it would be useful to have a LLM trained on public domain materials for which copyright has expired. The Great Books series, the out of copyright material in the Harvard libraries, that sort of thing.
That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]
[1] https://www.smbc-comics.com/comic/copyright
by yodon on 6/10/25, 7:19 PM
Anyone who thinks their reading skills are a reliable detector of AI-generated content is either lying to themselves about the validity of their detector or missing the opportunity to print money by selling it.
I strongly suspect more people are in the first category than the second.
by steve_gh on 6/11/25, 6:02 AM
And this is why the Wayback Machine is potentially the most valuable data on the internet
by Crontab on 6/10/25, 9:12 PM
Off topic:
When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.
by Ferret7446 on 6/12/25, 8:17 AM
This "problem" is self-contradicting.
If you can distinguish AI content, then you can just do that.
If you can't, what's the problem?
by blt on 6/11/25, 7:01 AM
tangentially, does anyone know a good way to limit web searches to the "low-background" era that integrates with address bar, OS right-click menus, etc? I often add a pre-2022 filter on searches manually in reaction to LLM junk results, but I'd prefer to have it on every search by default.
by mclau157 on 6/10/25, 8:54 PM
is this not just www.archive.org ?
by vouaobrasil on 6/10/25, 11:09 PM
Like the idea but I'm not about to create a Tumblr account.