by sfryxell on 2/21/23, 11:42 PM with 62 comments
by jabbany on 2/22/23, 1:29 AM
Back in 2011, Google faced the same problem mining bi-texts from the Internet for their statistical machine translation software. The thought was that one could utilize things like multi-lingual websites to learn corresponding translations.
They quickly realized that a lot of sites were actually using Google Translate without human intervention to make multi-lingual versions of their site, so naive approaches would cause the model to get trained on its own suboptimal output.
So they came up with a whole watermarking system so that the model could recognize its own output with some statistical level of certainty, and avoid it. It wouldn't be surprising if this is being done for LLMs too. The more concerning problem is when different LLMs, who are not aware of each others' watermarks, end up potentially becoming inbred should the ratio of LLM content rise dramatically...
by gregw2 on 2/22/23, 1:05 AM
by xsmasher on 2/22/23, 12:47 AM
if it starts to ingest that data it will only get more wrong over time. Unless it also ingest the replies that say "ChatGPT is full of shit here?"
by antiquark on 2/22/23, 1:07 AM
Eventually someone's going to write an "AI Trap" that serves up a seemingly infinite forum or reddit-style site, but is actually just generating an endless stream of (non)consciousness from some LLM chatbot.
[0] https://en.wikipedia.org/wiki/Spider_trap
[1] https://www.gsp.com/support/virtual/web/cgi/lib/wpoison/
by chasing on 2/22/23, 12:58 AM
As long as you agree with the new facts, you’re fine. Problem solved!
by touringa on 2/22/23, 5:08 AM
“ChatGPT, a version of OpenAI’s GPT-3.5 model… gained more than 100m users in its first two months, and is now estimated to produce a volume of text every 14 days that is equivalent to all the printed works of humanity.”
— Dr Thompson, Feb/2023, cited in report by the National Bureau of Economic Research (Scholes, Bernanke, MIT)
https://www.nber.org/system/files/working_papers/w30957/w309...
by MrLeap on 2/22/23, 12:47 AM
GPT could filter out anything they themselves emitted in future trains, yeah? Because they know what their bot's said. They get the benefit of looking at a conversation, knowing reasonably well what's copy/pasted from ai.com and what's the exasperated expert trying to correct a doomed world :p
The only way it eats itself is 1. Colossal mistakes. 2. Everyone decides to get off the internet and go outside.
2 seems pretty unrealistic, we put up with a lot :D
by m00x on 2/22/23, 1:27 AM
There is no issue with AI ingesting data from itself in itself. Humans do it as well. That data might even be higher quality than human data. The scale at which humans produce data will most likely stay higher than AI data for a long time.
There is already bot data out there from lower quality AIs/bots, and chatGPT has ingested it.
LLMs are made to be good at some textual tasks, and not for what they're being used right now. They're not information stores, or Q/A. It only answers what a human is likely to answer.
by candiodari on 2/22/23, 8:25 AM
This is of course also a necessary condition for ChatGPT to come up with original insights. Except perhaps where it comes to things like fiction, which probably has value in itself.
by jbenjoseph on 2/22/23, 1:14 AM
by jaitaiwan on 2/22/23, 12:55 AM
by jmcphers on 2/22/23, 12:42 AM
by tyrelb on 2/22/23, 12:51 AM
by sourcecodeplz on 2/22/23, 12:39 AM
I used to think the same but after reading and learning some more, I realized not.
by panarky on 2/22/23, 12:38 AM
by sys_64738 on 2/22/23, 1:55 AM
by petilon on 2/22/23, 1:37 AM
by Nijikokun on 2/22/23, 12:42 AM
by seymourhersh on 2/22/23, 12:01 AM
by mrcaosr on 2/22/23, 1:22 AM
seems more like it's gonna eat its own vomit, degrading it (maybe not completely) to inbreed (?)
by sfryxell on 2/21/23, 11:42 PM
by unusualmonkey on 2/22/23, 1:27 AM
Playing one AI against another is an established technique to developing AI.
Furthermore, content on the internet will always vary from more reliable (well established wiki pages, Reuters) to less reliable (random blog posts, disinformation).
Whether or not an AI generated text doesn't seem to be that important - what's more important is how reliable it is, and how well humans engage with it.
by ravenstine on 2/22/23, 1:07 AM
Maybe our view of AI is being colored by sci-fi stereotypes of robots malfunctioning when asked to compute really hard problems generating infinite recursion. I'm not so sure that LLMs will totally destabilize. We might see some interesting output, but I don't think we know yet whether the stability of the system will merely fluctuate as a whole without falling apart.
by transfire on 2/22/23, 12:21 AM