by theduder99 on 12/2/23, 4:10 PM with 50 comments
by Jimmc414 on 12/2/23, 8:18 PM
[The researchers wrote in their blog post, “As far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.”]
by mike_hearn on 12/2/23, 9:27 PM
OpenAI argue that they can use copyrighted content so repeating that isn't going to change anything. The only issue would be if they had used stolen/confidential data to train on, and it was discovered that way, but it also seems unlikely anyone could easily detect that given that there'd be nothing to intersect it with, unlike in this paper.
The blog post seems to slide around quite a bit, roving from "it's not surprising to us that small amounts of random text is memorized" straight to "it's unsafe and surprising and nobody knew". The nobody knew idea, as Jimmc414 has nicely proven in this thread, is false alarm because their technique actually was detected and the paper authors just didn't know that it had been. And "it's unsafe" doesn't make any sense in this context. Repeating random bits of memorized text surrounded by huge amounts of original text isn't a safety problem. Nor is it an "exploit" that needs to be "patched". OpenAI could ignore this problem and nobody would care except AI alignment researchers.
The culture of alarmism in AI research is vaguely reminiscent of the early Victorians who argued that riding trains might be dangerous, because at such high speeds the air could be sucked out of the carriages.
by dang on 12/2/23, 9:19 PM
Scalable extraction of training data from (production) language models - https://news.ycombinator.com/item?id=38496715 - Dec 2023 (12 comments)
Extracting training data from ChatGPT - https://news.ycombinator.com/item?id=38458683 - Nov 2023 (126 comments)
by FartyMcFarter on 12/2/23, 9:30 PM
by cedws on 12/2/23, 7:54 PM
by gardenhedge on 12/2/23, 10:49 PM