by jgrahamc on 6/10/25, 5:55 PM with 268 comments
by gojomo on 6/10/25, 10:37 PM
guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.
for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.
Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.
When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.
I am only (1 - epsilon) joking.
by K0balt on 6/10/25, 9:12 PM
by Legend2440 on 6/10/25, 6:41 PM
Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
by protocolture on 6/11/25, 6:49 AM
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."
I dont see that:
1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.
2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
3. That LLM output is going to infest everything anyway.
by ACCount36 on 6/10/25, 7:33 PM
AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
by schmookeeg on 6/10/25, 6:34 PM
by koolba on 6/10/25, 6:51 PM
by onecommentman on 6/11/25, 7:58 AM
I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.
by nialv7 on 6/10/25, 8:08 PM
by gorgoiler on 6/10/25, 8:27 PM
I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.
by submeta on 6/11/25, 6:45 AM
I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.
And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.
by vunderba on 6/10/25, 7:53 PM
by Ekaros on 6/10/25, 8:18 PM
On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.
by ChrisArchitect on 6/10/25, 7:00 PM
Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.
by aunty_helen on 6/10/25, 7:13 PM
by swyx on 6/10/25, 6:43 PM
i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.
(plenty of humans are pure noise, too, dont forget)
by jeffchuber on 6/10/25, 9:42 PM
by carlosjobim on 6/10/25, 10:09 PM
It is also uncontaminated by AI.
by tomgag on 6/11/25, 11:05 AM
by thm on 6/10/25, 6:37 PM
by sorokod on 6/10/25, 7:27 PM
by Animats on 6/10/25, 11:14 PM
That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]
by yodon on 6/10/25, 7:19 PM
I strongly suspect more people are in the first category than the second.
by steve_gh on 6/11/25, 6:02 AM
by Crontab on 6/10/25, 9:12 PM
When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.
by Ferret7446 on 6/12/25, 8:17 AM
If you can distinguish AI content, then you can just do that.
If you can't, what's the problem?
by blt on 6/11/25, 7:01 AM
by mclau157 on 6/10/25, 8:54 PM
by vouaobrasil on 6/10/25, 11:09 PM