from Hacker News

Eventually ChatGPT will eat itself

by sfryxell on 2/21/23, 11:42 PM with 62 comments

by jabbany on 2/22/23, 1:29 AM
This is not even a new problem...
Back in 2011, Google faced the same problem mining bi-texts from the Internet for their statistical machine translation software. The thought was that one could utilize things like multi-lingual websites to learn corresponding translations.
They quickly realized that a lot of sites were actually using Google Translate without human intervention to make multi-lingual versions of their site, so naive approaches would cause the model to get trained on its own suboptimal output.
So they came up with a whole watermarking system so that the model could recognize its own output with some statistical level of certainty, and avoid it. It wouldn't be surprising if this is being done for LLMs too. The more concerning problem is when different LLMs, who are not aware of each others' watermarks, end up potentially becoming inbred should the ratio of LLM content rise dramatically...
Ref: https://aclanthology.org/D11-1126.pdf
by gregw2 on 2/22/23, 1:05 AM
If ChatGPT is able to emit output that is watermarked such that it can detect itself as Scott Aaronson and others are working on for OpenAI (source: https://techcrunch.com/2022/12/10/openais-attempts-to-waterm... ) this “resonance”/feedback/eating-itself can be avoided.
by xsmasher on 2/22/23, 12:47 AM
I've seen people try ChatGPT for solving r/tipOfMyTongue questions. The AI is hilariously bad at this task. It happily invents new plots for existing movies and books.
if it starts to ingest that data it will only get more wrong over time. Unless it also ingest the replies that say "ChatGPT is full of shit here?"
by antiquark on 2/22/23, 1:07 AM
Reminds me of those old "Spider Traps" [0][1] that would generate (on access) an endless hierarchy of fake HTML pages full of an endless collection of fake email addresses, to clog up the works of spammers trying to gather email addresses.
Eventually someone's going to write an "AI Trap" that serves up a seemingly infinite forum or reddit-style site, but is actually just generating an endless stream of (non)consciousness from some LLM chatbot.
[0] https://en.wikipedia.org/wiki/Spider_trap
[1] https://www.gsp.com/support/virtual/web/cgi/lib/wpoison/
by chasing on 2/22/23, 12:58 AM
“Romeo and Juliet both ran away to New York at the end. He works in corporate finance and she makes bespoke soap. If you disagree with me again you’re a bad person and I will treat you like a bad person.”
As long as you agree with the new facts, you’re fine. Problem solved!
by touringa on 2/22/23, 5:08 AM
It's already happening.
“ChatGPT, a version of OpenAI’s GPT-3.5 model… gained more than 100m users in its first two months, and is now estimated to produce a volume of text every 14 days that is equivalent to all the printed works of humanity.”
— Dr Thompson, Feb/2023, cited in report by the National Bureau of Economic Research (Scholes, Bernanke, MIT)
https://www.nber.org/system/files/working_papers/w30957/w309...
https://lifearchitect.ai/chatgpt/
by MrLeap on 2/22/23, 12:47 AM
Even if it were used to flood the internet with shitty info, the only thing that would interfere with would be competitors training competing AI off the "internet dataset"
GPT could filter out anything they themselves emitted in future trains, yeah? Because they know what their bot's said. They get the benefit of looking at a conversation, knowing reasonably well what's copy/pasted from ai.com and what's the exasperated expert trying to correct a doomed world :p
The only way it eats itself is 1. Colossal mistakes. 2. Everyone decides to get off the internet and go outside.
2 seems pretty unrealistic, we put up with a lot :D
by m00x on 2/22/23, 1:27 AM
Sounds like a /r/showerthoughts post.
There is no issue with AI ingesting data from itself in itself. Humans do it as well. That data might even be higher quality than human data. The scale at which humans produce data will most likely stay higher than AI data for a long time.
There is already bot data out there from lower quality AIs/bots, and chatGPT has ingested it.
LLMs are made to be good at some textual tasks, and not for what they're being used right now. They're not information stores, or Q/A. It only answers what a human is likely to answer.
by candiodari on 2/22/23, 8:25 AM
This is only a problem as long as ChatGPT uses human output to learn. Once it starts learning against the "real world", or itself, the biggest difference between ChatGPT and us will disappear: that ChatGPT gets all it's information secondhand, and filtered, at best.
This is of course also a necessary condition for ChatGPT to come up with original insights. Except perhaps where it comes to things like fiction, which probably has value in itself.
by jbenjoseph on 2/22/23, 1:14 AM
But even so, the human picks the prompts and only publishes the AI outputs they think read nicely. There is information gain even in that.
by jaitaiwan on 2/22/23, 12:55 AM
This is already sort of happening with Bing
by jmcphers on 2/22/23, 12:42 AM
Citation needed. A lot of neural-net based AIs actually get better when trained on their own output[1].
[1] https://en.wikipedia.org/wiki/AlphaZero
by tyrelb on 2/22/23, 12:51 AM
I actually thought of this same thing today! Human-written content seems more lively... and with time... content from ChatGPT will become more "grey" (i.e. dull) (as more & more ChatGPT content gets fed into the system...).
by sourcecodeplz on 2/22/23, 12:39 AM
Not really if you think about it more and research how llms work. If anything they will just get better.
I used to think the same but after reading and learning some more, I realized not.
by panarky on 2/22/23, 12:38 AM
As if OpenAI hasn't already thought through this.
by sys_64738 on 2/22/23, 1:55 AM
If it does then does that mean double trouble as it self-replicates, or will it consume itself leaving nothing remaining?
by petilon on 2/22/23, 1:37 AM
As long as the AI generated content has been curated by humans, there is no harm in AI ingesting AI generated content.
by Nijikokun on 2/22/23, 12:42 AM
I believe they already model based on the question-response content being generated today.
by seymourhersh on 2/22/23, 12:01 AM
Basically just like humans on the internet.
by mrcaosr on 2/22/23, 1:22 AM
to resonate against itself? sounds like its gonna hit its natural frequency and blow up
seems more like it's gonna eat its own vomit, degrading it (maybe not completely) to inbreed (?)
by sfryxell on 2/21/23, 11:42 PM
How long until the machines are eating their own output
by unusualmonkey on 2/22/23, 1:27 AM
I wonder if this is the problem people think it is.
Playing one AI against another is an established technique to developing AI.
Furthermore, content on the internet will always vary from more reliable (well established wiki pages, Reuters) to less reliable (random blog posts, disinformation).
Whether or not an AI generated text doesn't seem to be that important - what's more important is how reliable it is, and how well humans engage with it.
by ravenstine on 2/22/23, 1:07 AM
What does that even mean? Strictly within the scope of that phrase, technically, yes, if ChatGPT consumes content generated by itself, it's eating its own words. I'm guessing something more dire than that is implied by "eat itself." Did humanity "eat itself" because it's been reading its own literature? You can say we are pretty misinformed by ourselves in many areas, and yet here we are.
Maybe our view of AI is being colored by sci-fi stereotypes of robots malfunctioning when asked to compute really hard problems generating infinite recursion. I'm not so sure that LLMs will totally destabilize. We might see some interesting output, but I don't think we know yet whether the stability of the system will merely fluctuate as a whole without falling apart.
by transfire on 2/22/23, 12:21 AM
Interesting