from Hacker News

OpenAI says it has evidence DeepSeek used its model to train competitor

by timsuchanek on 1/29/25, 4:21 AM with 1541 comments

  • by udev on 1/29/25, 4:25 AM

  • by Imnimo on 1/29/25, 5:13 PM

    I think there's two different things going on here:

    "DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

    "DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.

  • by bilekas on 1/29/25, 3:29 PM

    > “It’s also extremely hard to rally a big talented research team to charge a new hill in the fog together,” he added. “This is the key to driving progress forward.”

    Well I think DeepSeek releasing it open source and on an MIT license will rally the big talent. The open sourcing of a new technology has always driven progress in the past.

    The last paragraph too is where OpenAi seems to be focusing their efforts..

    > we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models ..

    > ... we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.

    So they'll go for getting DeepSeek banned like TikTok was now that a precedent has been set ?

  • by mrkpdl on 1/29/25, 10:12 PM

    The cat is out of the bag. This is the landscape now, r1 was made in a post-o1 world. Now other models can distill r1 and so on.

    I don’t buy the argument that distilling from o1 undermines deep seek’s claims around expense at all. Just as open AI used the tools ‘available to them’ to train their models (eg everyone else’ data), r1 is using today’s tools.

    Does open AI really have a moral or ethical high ground here?

  • by blast on 1/29/25, 6:01 PM

    Everyone is responding to the intellectual property issue, but isn't that the less interesting point?

    If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.

  • by ok123456 on 1/29/25, 3:32 PM

    OpenAI's models were trained on ebooks from a private ebook torrent tracker leeched en-mass during a free leech event by people who hated private torrent trackers and wanted to destroy their "economy."

    The books were all in epub format, converted, cleaned to plain text, and hosted on a public data hoarder site.

  • by wanderingmoose on 1/29/25, 4:57 PM

    There is a lot of discussion here about IP theft. Honest question, from deepseek's point of view as a company under a different set of laws than US/Western -- was there IP theft?

    A company like OpenAI can put whatever licensing they want in place. But that only matters if they can enforce it. The question is, can they enforce it against deepseek? Did deepseek do something illegal under the laws of their originating country?

    I've had some limited exposure to media related licensing when releasing content in China and what is allowed is very different than what is permitted in the US.

    The interesting part which points to innovation moving outside of the US is US companies are beholden to strict IP laws while many places in the world don't have such restrictions and will be able to utilize more data more easily.

  • by glenstein on 1/29/25, 4:13 PM

    All the top level comments are basking in the irony of it, which is fair enough. But I think this changes the Deepseek narrative a bit. If they just benefited from repurposing OpenAI data, that's different than having achieved an engineering breakthrough, which may suggest OpenAI's results were hard earned after all.
  • by daft_pink on 1/29/25, 3:53 PM

    This reminds me of the railroads, where once railroads were invented, there was a huge investment boom of eveyrone trying to make money of the railroads, but the competition brought the costs down where the railroads weren’t the people who generally made the money and got the benefit, but the consumers and regular businesses did and competition caused many to fail.

    AI is probably similar where the Moore’s law and advancement will eventually allow people to run open models locally and bring down the cost of operation. Competiition will make it hard for all but one or two players to survive and Nvidia, OpenAI, Deepseek, etc most investments in AI by these large companies will fail to generate substantial wealth but maybe earn some sort of return or maybe not.

  • by me551ah on 1/29/25, 3:47 PM

    OpenAI is going after a company that open sourced their model, by distilling from their non-open AI?

    OpenAI talks a lot about the principles of being Open, while still keeping their models closed and not fostering the open source community or sharing their research. Now when a company distills their models using perfectly allowed methods on the public internet, OpenAI wants to shut them down too?

    High time OpenAI changes their name to ClosedAI

  • by readyplayernull on 1/29/25, 3:27 PM

    Do you remember when Microsoft was caught scrapping data from Google:

    https://www.wired.com/2011/02/bing-copies-google/

    They don't care, T&C and copyright is void unless it affects them, others can go kick rocks. Not surprising they and OpenAI will do a legal battle over this.

  • by Ciantic on 1/29/25, 3:26 PM

    I'm not being sarcastic, but we may soon have to torrent DeepSeek's model. OpenAI has a lot of clout in the US and could get DeepSeek banned in western countries for copyright.
  • by dragonwriter on 1/29/25, 6:47 PM

    Hey, OpenAI, so, you know that legal theory that is the entire basis of your argument that any of your products are legal? "Training AI on proprietary data is a use that doesn't require permission from the owner of the data"?

    You might want to consider how it applies to this situation.

  • by mhitza on 1/29/25, 3:32 PM

    This is funny because its.

    1. Something I'd expect to happen.

    2. Lived through a similar scenario in 2010 or so.

    Early in my professional career I've worked for a media company that was scraping other sites (think Craigslist but for our local market) to republish the content on our competing website. I wasn't working on that specific project, but I did work on an integration on my teams project where the scraping team could post jobs on our platform directly. When others started scraping "our content" there were a couple of urgent all hands on deck meetings scheduled, with a high level of disbelief.

  • by 1970-01-01 on 1/29/25, 3:11 PM

    DeepSeek have more integrity than 'Open'AI by not even pretending to care about that.
  • by olalonde on 1/30/25, 2:40 AM

    If it's true, how is it problematic? It seems aligned with their mission:

    > We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

    > We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.

    https://openai.com/charter/

    /s, we all know what their true mission is...

  • by sho_hn on 1/29/25, 3:15 PM

    While I'm as amused as everyone else - I think it's technically accurate to point out that the "we trained it for $6 mio" narrative is contingent on the done investment by others.
  • by concerndc1tizen on 1/29/25, 4:51 PM

    Is OpenAI claiming copyright ownership over the generated synthetic data?

    That would be a dangerous precedent to establish.

    If it's a terms of service violation, I guess they're within their rights to terminate service, but what other recourse do they have?

    Other than that, perhaps this is just rhetoric aimed at introducing restrictions in the US, to prevent access to foreign AI, to establish a national monopoly?

  • by thorum on 1/29/25, 3:39 PM

    > “It is (relatively) easy to copy something that you know works,” Altman tweeted. “It is extremely hard to do something new, risky, and difficult when you don’t know if it will work.”

    The humor/hypocrisy of the situation aside, it does seem to be true that OpenAI is consistently the one coming up with new ideas first (GPT 4, o1, 4o-style multimodality, voice chat, DALL-E, …) and then other companies reproduce their work, and get more credit because they actually publish the research.

    Unfortunately for them it’s challenging to profit in the long term from being first in this space and the time it takes for each new idea to be reproduced is getting shorter.

  • by divbzero on 1/29/25, 8:08 PM

    I was wondering if this might be the case, similar to how Bing’s initial training included Google’s search results [1]. I’d be curious to see more details of OpenAI’s evidence.

    It is, of course, quite ironic for OpenAI to indiscriminately scrape the entire web and then complain about being scraped themselves.

    [1]: https://searchengineland.com/google-bing-is-cheating-copying...

  • by esskay on 1/30/25, 12:04 AM

    Hard to really have any sympathy for OpenAI's position when they're actively stealing content, ignoring requests to stop then spending huge amounts to get around sites running ai poisoning scripts, making it clear they'll still take your content regardless of if you consent to it.
  • by bhouston on 1/29/25, 3:07 PM

    The US government likely will favor a large strategic company like OpenAI instead of individual's copyrights, so while ironic, the US government definitely doesn't care.

    And the US government is also likely itching to reduce the power of Chinese AI companies that could out compete US rivals (similar to the treatment of BYD, TikTok, solar panel manufacturers, network equipment manufacturers, etc), so expect sweeping legislation that blocks access to all Chinese AI endeavours to both the US and then soon US allies/West (via US pressure.)

    The likely legislation will be on the surface justified both by security concerns and by intellectual property concerns, but ultimately it will be motivated by winning the economic competition between China and the US and it will attempt to tilt the balance via explicitly protectionist policies.

  • by oli5679 on 1/30/25, 3:17 PM

    this is pretty ridiculous

    A. below is a list of OpenAI initial hires from Google. It's implausible to me that there wasn't quite significant transfer of Google IP

    B. google published extensively, including the famous 'attention is all you need' paper, but open-ai despite its name, has not explained the breakthroughs that enabled O1. It has also switched from a charity to a for-profit company.

    C. Now this company, with a group of smart, unknown machine learning engineers, presumably paid fractions of what OpenAI are published, has created a model far cheaper, and openly published the weights, many methodological insights, which will be used by OpenAI.

    1. Ilya Sutskever – One of OpenAI’s co-founders and its former Chief Scientist. He previously worked at Google Brain, where he contributed to the development of deep learning models, including TensorFlow. 2. Jakub Pachocki – Formerly OpenAI’s Director of Research, he played a major role in the development of GPT-4. He had a background in AI research that overlapped with Google’s fields of interest. 3. John Schulman – Co-founder of OpenAI, he worked on reinforcement learning and helped develop Proximal Policy Optimization (PPO), a method used in training AI models. While not a direct Google hire, his work aligned with DeepMind’s research areas. 4. Jeffrey Wu – One of the key researchers involved in fine-tuning OpenAI’s models. He worked on reinforcement learning techniques similar to those developed at DeepMind. 5. Girish Sastry – Previously involved in OpenAI’s safety and alignment work, he had research experience that overlapped with Google’s AI safety initiatives.

  • by karim79 on 1/30/25, 1:43 AM

    Oh God. I know exactly how this feels. A few years ago I made a bread hydration and conversion calculator for a friend, and put it up on JSFiddle. My friend, at the time, was an apprentice baker.

    Just weeks later, I discovered that others were pulling off similar calculations! They were making great bread with ease and not having to resort to notebooks and calculators! The horror! I can't believe that said close friend of mine would actually share those highly hydraty mathematical formulas with other humans without first requesting my consent </sarc>.

    Could it be, that this stuff just ends up in the dumpster of "sorry you can't patent math" or the like?

  • by osigurdson on 1/29/25, 3:09 PM

    I do think that distilling a model from another is much less impressive than distilling one from raw text. However, it is hard to say if it is really illegal or even immoral, perhaps just one step further in the evolution of the space.
  • by TylerJaacks on 1/30/25, 2:16 AM

    Cry me a fucking river OpenAI, as if your business model isn't entirely based on this exact same thing.
  • by nelblu on 1/29/25, 3:42 PM

    Hahaha I can't stop laughing... i dont know the validity of the claim, but immediately i thought of the British Museum complaining about theft.
  • by alasr on 1/30/25, 1:17 AM

    > OpenAI says it has evidence DeepSeek used its model to train competitor.

    > The San Francisco-based ChatGPT maker told the Financial Times it had seen some evidence of “distillation”, which it suspects to be from DeepSeek.

    > ...

    > OpenAI declined to comment further or provide details of its evidence. Its terms of service state users cannot “copy” any of its services or “use output to develop models that compete with OpenAI”.

    OAI share the evidence with the public; or, accept the possibility that your case is not as strong as you're claiming here.

  • by kranke155 on 1/29/25, 11:42 PM

    The very idea that OAI scrapes the entire internet and ignore individual rights and thats ok, but if another company takes the output data from their model, thats a gross violation of the law / TOS - that very idea is evil.
  • by glitchc on 1/29/25, 3:55 PM

    [flagged]
  • by kamranjon on 1/30/25, 3:42 AM

    I was just wondering if this is even feasible?

    The amount of iterations of training that would be needed for DeepSeek to actually learn anything from OpenAI would seem to be an insane amount of requests from a non-local AI, which you’d think would be immediately obvious to OpenAI just by looking at suspicious requests?

    Am I correct in this assumption or am I missing something? Is it even realistic that something like this is possible without a local model?

  • by emsign on 1/30/25, 4:55 AM

    "yOu ShOuLdN't TaKe OtHeR pEoPlE's DaTa!1!1" are they mental? How can people at OpenAI lack be so self-righteous and unaware? Is thia arrogance or a mental illness?
  • by fimdomeio on 1/29/25, 9:27 PM

    But what is the problem here? Isn’t open AI mission “to ensure that artificial general intelligence benefits all of humanity”? Sounds like success to me.
  • by duchenne on 1/29/25, 11:04 PM

    The reasoning happens in the chain of thoughts. But OpenAI (aka ClosedAI) doesn't show this part when you use the o1 model, whether through the API or chat. They hide it to prevent distillation. Deepseek, though, has come up with something new.
  • by metaxz on 1/29/25, 6:34 PM

    I don't understand how OpenAI claims it would have happened. The weights are closed and as far as I read they are not complaining Deepseek hacked them and obtained the weight. So all they could do was to query OpenAI and generate test data. But how much did they query really - I would suppose it would require a huge amount done via an external, paid-for API? Is there any proof of this besides OpenAI saying it? Even if we suppose it is true, I suppose this must have happened via the API so they paid per token etc. So they paid for each and every token of training data. As I understand, the requester owns the copyright on what is generated by OpenAI's models and is free to do what they want.
  • by nbgoodall on 1/29/25, 11:42 PM

    I lol'd, from the DeepSeek news release[1]: "Pushing the boundaries of open AI!"

    [1]: https://api-docs.deepseek.com/news/news250120

  • by temporallobe on 1/29/25, 3:17 PM

    OpenAI is also possibly in violation of many IP laws by scraping the entirety of the internet and using to train their models, so there’s that.
  • by __MatrixMan__ on 1/29/25, 3:09 PM

    If they want us to care they can open up their models so we can be the judge.
  • by _moof on 1/29/25, 6:45 PM

    This reminds me of a (probably apocryphal) story about fast food chains that made the rounds decades ago: McDonald's invests tons of time into finding the best real estate for new stores; Burger King just opens stores near McDonalds!
  • by crowcroft on 1/29/25, 3:43 PM

    The AI companies were happy to take whatever they want and put the onus of proving they were breaking the law onto publishers by challenging them to take things to court.

    Don't get mad about possible data theft, prove it in court.

  • by jgrall on 1/29/25, 6:22 PM

    It’s not a good look when your technology is replicated for a fraction of the cost, and your response is to smear your competition with (probably) false accusations and cozy up to the US government to tighten already shortsighted export controls. Hubris & xenophobia are not going to serve American companies well. Personally I welcome the Chinese - or anyone else for that matter - developing advanced technologies as long as they are used for good. Humanity loses if we allow this stuff to be “owned” by a handful of companies or a single country.
  • by karim79 on 1/30/25, 3:05 AM

    So, banning high-powered chips to China has basically had the effect of turning them into extremophiles. I mean, that seems like a good plan </sarc>. Moreover, it is certainly slowing sales of one of the darling companies of the US (NVidia).

    I just can't even begin to imagine what will come of this riduculous techno-imperialism/AI arms-race, or whatever you want to call it. It should not be too hard for China to create their own ASICs which do the same, and finally be done with this palaver.

  • by InkCanon on 1/29/25, 3:18 PM

    It's like that Dr Phil episode where he meets the guy who created Bum Fights!
  • by whatshisface on 1/29/25, 3:05 PM

    It's reasonably likely that a lot of people linked to the federal government want to ban DeepSeek. You can tell it's being presented away from "they gave us a free set of weights" and towards "they destroyed $1T of shareholder value." (By revealing that Microsoft et al. paid way too much to OpenAI et al. for technology that was actually easy to reinvent.)
  • by cumulative00x on 1/29/25, 9:06 PM

    There is a saying in Turkish that roughly goes like this, it takes a thief to catch a thief. I am not a big fan of China's tech, too, however, it amuses me to watch how big tech charlatans have been crying over Deepseek shock.
  • by bicepjai on 1/30/25, 6:07 AM

    Reading this post, I can’t help but wonder if people realize the irony in what they’re saying. 1. “The issue is when you [take it out of the platform and] are doing it to create your own model for your own purposes,” 2. “There’s a technique in AI called distillation . . . when one model learns from another model [and] kind of sucks the knowledge out of the parent model,”
  • by nba456_ on 1/29/25, 3:15 PM

    A big part of project 2025 is increasing patent regulations. I would not be surprised if the current admin moves to ban DeepSeek because of this.
  • by ForHackernews on 1/29/25, 11:00 AM

    What's good for the goose is good for the gander. Obviously a transformative work and not an intellectual property violation any more than OpenAI injesting every piece of media in existence.
  • by hyperbovine on 1/29/25, 6:02 PM

    Live by the sword...
  • by cratermoon on 1/29/25, 4:42 AM

    Ironic, OpenAI claiming someone else stole their work.
  • by vinni2 on 1/29/25, 5:23 AM

    How would they prove they used it’s model. I would be curious to know their methodology. Also what legal actions OpenAI can take? can DeepSeek be banned in US?
  • by Kiro on 1/29/25, 3:20 PM

    > Furious [...] shocked

    I'm not seeing it. I get it, the narrative that OpenAI is getting a taste of their own medicine is funny but this is not serious reporting.

  • by guybedo on 1/29/25, 6:25 PM

    This is hilarious.

    Everybody has evidence OpenAI scraped the internet at a global scale and used terabytes of data it didn't pay for. Newspapers, books, etc...

  • by sabhiram on 1/31/25, 6:45 AM

    The grapes are sour because their moat is crumbling.

    What was supposed to be a model, training, and data moat - is now reduced to operational cost, which they are not terribly efficient for.

    OpenAI has been on a journey to burn as much $ as possible to get as far ahead on those three moats, to the point where decreasing TCO for them on inference was not even relevant - "who cares if you save me 20% of costs when I can raise on a 150b pre money value?".

    Well, with their moats disappearing, they will have no choice but to compete on inference cost like everyone else.

  • by cbracketdash on 1/29/25, 5:19 PM

    Let's also not forget Suchir Balaji, who was mysteriously killed when exposing OpenAI's violation of copyright law.
  • by JBSay on 1/29/25, 5:15 PM

    When China is more open than you, you've got a problem
  • by jchook on 1/29/25, 5:01 PM

    Friendly reminder that China publishes twice as many AI papers as the US[1], and twice as many science and engineering papers as the US.

    China leads the world in the most cited papers[2]. The US's share of the top 1% highly cited articles (HCA) has declined significantly since 2016 (1.91 to 1.66%), and the same has doubled in China since 2011 (0.66 to 1.28%)[3].

    China also leads the world in the number of generative AI patents[4].

    1. https://www.bfna.org/digital-world/infographic-ai-research-a...

    2. https://www.science.org/content/article/china-rises-first-pl...

    3. https://ncses.nsf.gov/pubs/nsb202333/impact-of-published-res...

    4. https://www.wipo.int/web-publications/patent-landscape-repor...

  • by waffletower on 1/29/25, 5:12 PM

    "Stole" - I don't believe that word means what he thinks it means. Perhaps I pre-maturely anthropomorphize AI -- yet when I read a novel, such as The Sorcerer's Stone, I am not guilty of stealing Rowling's work, even if I didn't purchase the book but instead found it and read it in a friend's bathroom. Now if I were to take the specific plot and characters of that story and write a screenplay or novel directly based on it, and, explicitly, attempt to sell this work, perhaps the verb chosen here would be appropriate.
  • by janalsncm on 1/30/25, 10:38 AM

    I think readers should note that the article did not provide any evidence for OpenAI’s claims, only OpenAI declining to provide evidence, various people repeating the claim, others reacting to it.

    It does matter whether it happened and how much it happened. Deepseek ran head to head comparisons against O1 so it would be pretty reasonable for them to have made API calls, for example.

    But also, as the article notes, distillation, supervised fine tuning, and using LLM as a judge are all common techniques in research, which OpenAI knows very well.

  • by hugoromano on 1/29/25, 9:46 PM

    OpenAI initially scraped the web and later formed partnerships to train on licensed data. Now, they claim that DeepSeek was trained on their models. However, DeepSeek couldn't use these models for free and had to pay API fees to OpenAI. From a legal standpoint, this could be seen as a violation of the terms and conditions. While I may be mistaken, it's unclear how DeepSeek could have trained their models without compensating OpenAI. Basically, OpenAI is saying machines can't learn from their outputs as humans do.
  • by conartist6 on 1/29/25, 6:13 PM

    It seems to be undermined by the same principle that says that going into a library and reading a book there is not stealing when you walk out with the knowledge from the book.

    OpenAI seems to feel that way about the their use of copyrighted material: since they didn't literally make a copy of the source material, it's totally fair game. It seems like this is the same argument that protects DeepSeek if indeed they did this. And why not, reading a lot of books from the library is a way to get smarter, and ostensibly the point of libraries

  • by adam_arthur on 1/29/25, 7:07 PM

    Who cares?

    They did the exact same thing with public information. Their model just synthesizes and puts out the same information in a slightly different form.

    Next we should sue students for repeating the words of their teachers

  • by WD-42 on 1/29/25, 3:13 PM

    Information wants to be free! No, not like that!
  • by jongjong on 1/30/25, 12:08 AM

    If the material which OpenAI is trained on is itself not subject to copyright protections, then other LLMs trained on OpenAI should also not be subject to any copyright restrictions.

    You can't have both ways... If OpenAI wants to claim that the AI is not repeating content but 'synthesizing it' in the same was as a human student would do... Then I think the same logic should extend to DeepSeek.

    Now if OpenAI wants to claim that its own output is in fact copyright-protected, then it seems like it should owe royalty payments to everyone whose content was sourced upstream to build its own training set. Also, synthetic content which is derived from real content should also be factored in.

    TBH, this could make a strong case for taxing AI. Like some kind of fee for human knowledge and distributed as UBI. The training data played a key part in this AI innovation.

    As an open source coder, I know that my copyrighted code is being used by AI to help other people produce derived code and, by adapting it in this way, it's making my own code less relevant to some extent... In effect, it could be said that my code has been mixed in with the code of other open source developers and weaponized against us.

    It feels like it could go either way TBH but there needs to be consistency.

  • by xinayder on 1/30/25, 9:12 AM

    > OpenAI declined to comment further or provide details of its evidence. Its terms of service state users cannot “copy” any of its services or “use output to develop models that compete with OpenAI”.

    Well, this sounds like they are just crying because they are losing the race so far. Besides, DeepSeek explicitly states they did a study on distillation on ChatGPT, then OpenAI is like "oh see guys they used our models!!!!!"

  • by highfrequency on 1/29/25, 8:19 PM

    If true, the question is: did they use ChatGPT outputs to create Deepseek V3 only, or is the R1-zero training process a complete lie (given that the whole premise is that they used pure reinforcement learning)? If they only used ChatGPT output when training V3, then they succeeded in basically replicating the jump from ChatGPT-4o to o1 without any human-labeled CoT (and published the results) - which is a big achievement on its own.
  • by mbowcut2 on 1/29/25, 6:57 PM

    So, is this just an example of the first-mover disadvantage (or maybe the problem of producing public goods?). The first AI models were orders of magnitude more expensive to create, but now that they're here we can, with techniques like distillation, replicate them at a fraction of the cost. I am not really literate in the law but weren't patents invented to solve problems like this?
  • by amarcheschi on 1/29/25, 1:04 PM

    I quite like a scenery where llm output can't be copyrighted, so that it is possible to eventually train a llm with data from the previous one(s)
  • by elashri on 1/29/25, 3:19 PM

    There is an Egyptian say that would translate to something like

    "We didn’t see them when they were stealing, we saw them when they were fighting over what was stolen"

    That describes this situation. Although to be honest all this aggressive scraping is noticeable but for people who understand that which is not majority of people. but now everyone knows.

  • by 827a on 1/29/25, 3:10 PM

    This smells very suspiciously like: someone who doesn't know anything about AI (possibly Sacks) demanding answers on R1 from someone who doesn't have any good ones (possibly Altman). "Uh, (sweating), umm, (shaking), they stole it from us! Yeah, look at this suspicious activity, that's why they had it so easy, we did all the hard work first!"
  • by ceejayoz on 1/29/25, 3:08 PM

    "You can't take data without asking" seems like a court precedent OpenAI really, really, really wants to avoid. And yet...
  • by ranger_danger on 1/29/25, 4:37 PM

    Reminds me of Steve Jobs complaining to Bill Gates about MS "stealing" the GUI concept from them, which they in turn had stolen from Xerox.

    Obligatory "Everything is a Remix" https://www.youtube.com/watch?v=X9RYuvPCQUA

  • by pama on 1/29/25, 3:48 PM

    The R1 paper used o1-mini and o1-1217 in their comparisons, so I imagine they needed to use lots of OpenAI compute in December and January to evaluate their benchmarks in the same way as the rest of their pipeline. They show that distilling to smaller models works wonders, but you need the thought traces, which o1 does not provide. My best guess is that these types of news are just noise.

    [edit: the above comment was based on sensetionalist reporting in the original link and not the current FT article. I still think there is a lot of noise in these news this last week, but it may well be that openai has valid evidence of wrongdoing; I would guess that any such wrongdoing would apply directly to V3 rather than R1-zero, because o1 does not provide traces and generating synthetic thinking data with 4o may be counterproductive.]

  • by HarHarVeryFunny on 1/29/25, 4:23 PM

    DeepSeek-R1's multi-step bootstrapping process, starting with their DeepSeek-V3 base model, would only seem to need a small amount of reasoning data for the DeepSeek-R0 RL training, after which that becomes the source for further data, along with some other sources that they mention.

    Of course it's possible that DeepSeek used O1 to generate some of this initial bootstrapping data, but not obvious. O1 anyways deliberately obfuscates it's reasoning process (see "Hiding the chains of thought" section of OpenAI's "Learning to reason with LLMs" page), such that what you see is an after-the-fact "summary" of what it actually did; so, if DeepSeek did indeed use some of O1's output to train on, it shows that the details of O1's own reasoning process isn't as important as they thought it was - it's just having some verified (i.e. leading to good outcome) reasoning data from any source that matters to get started.

  • by paul_e_warner on 1/30/25, 12:48 PM

    There seem to be two kinda incompatible things in this article: 1. R1 is a distillation o1. This is against it's terms of service and possibly some form of IP theft. 2. R1 was leveraging GPT-4 to make it's output seem more human. This is very common and most universities and startups do it and it's impossible to prevent.

    When you take both of these points and put them back to back, a natural answer seems to suggest itself which I'm not sure the authors intended to imply: R1 attempted to use o1 to make its answers seem more human, and as a result it accidentally picked up most of it's reasoning capabilities in the process. Is my reading totally off?

  • by zhenghao1 on 1/31/25, 1:06 PM

    All I see is sour grapes. Can't stand someone else coming up with a far more superior and cheaper alternative. This is business dude. There's always going to be some new disruptor to shake the market up.
  • by ks2048 on 1/29/25, 6:40 PM

    The schadenfreude and irony of this is totally understandable.

    But, I wonder - do companies like OpenAI, Google, and Anthropic use each others models for training? If not, is it because they don't want to or need to, or because they are afraid of breaking the ToC?

  • by oysmal on 1/30/25, 8:49 AM

    Given that the training approach was open sourced, their claim can be independently verified. Huggingface is currently doing that with Open R1, so hopefully we will get a concrete answer to whether these accusations are merited or not.
  • by SubiculumCode on 1/29/25, 9:23 PM

    If you have a set of weights A, can you derive another set of weights B that function (near) identically as A AND a) not appear to be the same weights as A when inspected superficially b) appear uncorrelated when inspecting the weight matrices?
  • by staticelf on 1/29/25, 3:30 PM

    Not only do OpenAI and other steal data, they also spam the web with requests and crawl websites over and over.

    https://pod.geraspora.de/posts/17342163

  • by whoknowsidont on 1/29/25, 11:20 PM

    They can claim this all they want. But DeepSeek released the paper (several actually) on what they did, and it's already been replicated in other models.

    It simply doesn't matter. Their methodology works.

  • by ginkgotree on 1/29/25, 3:45 PM

    I did not have in my cards: PRC open sourcing most powerful LLM by stealing data set from "OpenAI" As someone that is very Pro-America and Pro-Democracy, the iron here is just... so sweet.
  • by deeviant on 1/29/25, 5:01 PM

    Hmm, let’s see—it looks like an easy legal defense.

    DeepSeek could simply admit, "Yep, oops, we did it," but argue that they only used the data to train Model X. So, if you want compensation, you can have all the revenue from Model X (which, conveniently, amounts to nothing).

    Sure, they then used Model X to train Model Y, but would you really argue that the original copyright holders are entitled to all financial benefits derived from their work—especially when that benefit comes in the form of a model trained on their data without permission?

  • by thih9 on 1/30/25, 8:33 AM

    I don't mind and I believe that a company with "open" in its name shouldn't mind either.

    I hope this is actually true and OpenAI loses its close to monopoly status. Having a for profit entity safeguarding a popular resource like this sounds miserable for everyone else.

    At the moment AI looks like typical VC scheme: build something off someone else's work, sell it at cost at first, shove it down everyone's throats and when it's too late, hike the prices. I don't like that.

  • by nataliste on 1/29/25, 7:38 PM

    A Wolf had stolen a Lamb and was carrying it off to his lair to eat it. But his plans were very much changed when he met a Lion, who, without making any excuses, took the Lamb away from him.

    The Wolf made off to a safe distance, and then said in a much injured tone:

    "You have no right to take my property like that!"

    The Lion looked back, but as the Wolf was too far away to be taught a lesson without too much inconvenience, he said:

    "Your property? Did you buy it, or did the Shepherd make you a gift of it? Pray tell me, how did you get it?"

    What is evil won is evil lost.

  • by JBits on 1/29/25, 3:24 PM

    What is the evidence that DeepSeek used OpenAI to train their model? Isn't this claim directly benefitting OpenAI as they can argue that any superior model requires their model?
  • by caseyy on 1/30/25, 5:00 AM

    Seeing as OpenAI is on the back foot, I hope nationalistic politicians don’t use this opportunity to strengthen patent laws.

    If one could effectively patent software inventions, this would kill many industries, from video games (that all have mechanics of other games in them) to computing in general (fast algorithms, etc). Let’s hope no one gets ideas like that…

    Granted, it would be ineffective in competing against China’s tech industry. But less effective laws have been lobbied through in the past.

  • by ripped_britches on 1/30/25, 12:59 AM

    There were definitely still very impressive engineering breakthroughs.

    Also it’s pretty good confirmation that synthetic data is a valid answer to the data wall problem (non-problem).

  • by SilverBirch on 1/29/25, 3:27 PM

    I think OpenAI is in a really weak position here. There are essentially two positions you can be in: You can be the agile new startup that can break the rules and move fast. That's what OpenAI used to be. Or you can be the big incumbent who is going to use your enormous resources to crush your opposition. That's Google & Microsoft here. For Microsoft to say "We're going to tie you up in lawsuits about the way you trained this model" would be perfectly expected and they can use that strategy because at any given time they have 1,000 lawyers and lobbyists hanging around waiting to do exactly that. But OpenAI can't do that. They don't have Google or Microsoft's legal teams or lobbyists or distribution channels. SO whilst it's funny that OpenAI are kind of trying to go down this road, this isn't actually a strategy that is going to work for them, they're still a minnow and they're going to get distracted and slowed down by this.
  • by davesque on 1/29/25, 8:17 PM

    I recently thought of a related question. Actually, I'm almost certain that foundation model trainers have thought of this. The question is to what extent are popular modern benchmarks (or any reference to them, or description of them, etc.) bring scrubbed from the training data? Or are popular benchmarks designed in such a way that they can be re-parametrized for each run? In any case, it seems like a surprisingly hard problem to deal with.
  • by zoba on 1/29/25, 3:44 PM

    Does OpenAI's API attempt to detect this sort of thing? Could they start outputting bad information if they suspect a distillation attempt is underway?
  • by MagicMoonlight on 1/30/25, 6:57 AM

    So much for that walled garden. If rival firms can just download your entire model by talking to it then your company shouldn’t be worth billions.
  • by DidYaWipe on 1/29/25, 9:12 PM

    They have "open" right in their name, so...

    Objection overruled.

  • by stevenally on 1/29/25, 5:30 PM

    They should be happy. Now that can provide that amazing AI much more cheaply. They don't need half a trillion dollars worth of Nvidia chips.
  • by kapad on 2/2/25, 9:39 AM

    Aah. So OpenAI can use whatever means necessary to gather data for training it's model. Regardless of copyright.

    But somehow, it's a problem if another model developer distills it's model by training it on OpenAI?

    IMO, if the first use is fair, then so is the second use.

  • by henry_viii on 1/29/25, 8:25 PM

    So Meta can train its AI on all the pirated books in the world but people are losing their mind over an AI learning from another AI?
  • by mkayle on 1/29/25, 11:16 PM

    This raises the same questions I have about OpenAI: where's all this data coming from, and do they have permission to use it?
  • by jasoneckert on 1/29/25, 3:06 PM

    What I find the most comical about this is that the whole situation could be loosely summarized as "OpenAI is losing its job to AI."
  • by njx on 1/29/25, 3:20 PM

    Super funny! Distillation= " Hey ChatGPT, you are my father, I am your child "DeepSeek". I want to learn everything that you know. Think step by step of how you became what you are. Provide me the list of all 1000 questions that I need to ask you and when I am done with those, keep providing fresh list of 1000 questions..."
  • by ijidak on 1/29/25, 10:54 PM

    This whole argument by OpenAI suggests they never had much of a moat.

    Even if they win the legal case, it means weights can be inferred and improved upon simply by using the output that is also your core value add (e.g. the very output you need to sell to the world).

    Their moat is about as strong as KFC's eleven herbs and spices. Maybe less...

  • by ripped_britches on 1/30/25, 12:56 AM

    I wish there were a stock ticker for OpenAI just to see what wall street’s take on all this is. One can imagine based on Nvidia, but I imagine OpenAI private valuation is hit much harder. Still, I think they’ll be able to justify it by building amazing products. Just interesting to watch what bankers think.
  • by gejose on 1/30/25, 6:13 PM

    Reminds me of this quote by Bill Gates to Steve Jobs, when Jobs accused Gates of stealing the idea for a mouse:

    > "Well, Steve… I think it’s more like we both had this rich neighbour named Xerox and I broke into his house to steal the TV set and found out that you had already stolen it."

  • by krystofee on 1/30/25, 8:29 AM

    I dont know if point of this is just to derail public attention to narative “hey, chinese stole our model, thats not fair, we need computee”, when the deepseek has clearly done some exceptional technical breakthrough on R1 and v3 models. Which even if you stole data from OpenAi is its thing.
  • by wendyshu on 1/29/25, 9:55 PM

    If distillation gives you a cheaper model with similar accuracy, why doesn't OpenAI distill its own models?
  • by elzbardico on 1/31/25, 12:04 PM

    I used OpenAI APIs to generate training data for some run-of-the-mill ML models at my work, for some use cases where people wanted to use LLMs directly, but that could be easily fulfilled by smaller well trained models.

    Is OpenAI going to complain about me too?

  • by top_sigrid on 1/29/25, 2:58 PM

  • by liendolucas on 1/29/25, 5:02 PM

    Could this have been carefully orchestrated? Could DeepSeek have devised this strategy a year ago and implemented knowing that they would be able to benefit from OpenAI models and a possible Nvidia market cap fall? Or is it just way too much to come up with about such a move?
  • by rcarmo on 1/29/25, 6:03 PM

    I guess their CEO was too busy to write something in defense of US export controls (https://news.ycombinator.com/item?id=42866905), or (even more scary) he doesn't need to anymore.
  • by supermatt on 1/29/25, 3:38 PM

    They refer to this in the paper as a part of the "cold start data" which they use to fine-tune DeepSeek-V3 prior to training R1.

    They don't specifically name OpenAI, but they refer to "directly prompting models to generate answers with reflection and verification".

  • by 65 on 1/29/25, 3:34 PM

    Let me guess, this gives the government and excuse to ban DeepSeek. Which means tech companies get to keep their monopolies, Sam Altman can grab more power, and the tech overlords can continue to loot and plunder their customers and the internet as a whole.
  • by elzbardico on 1/31/25, 12:34 PM

    China is a society mostly run by engineers, some 70% of the CCP Politburo are STEM people by their formation. Engineering is a high prestige profession.

    The West is run by lawyers, MBAs and salesmen.

    This kerfuffle is a delicious study about this.

  • by pknerd on 1/30/25, 7:16 AM

    The reason OpenAI is whining:

    > OpenAI’s o1 costs $60 per million output tokens; DeepSeek R1 costs $2.19. This nearly 30x difference brought the trend of falling prices to the attention of many people.

    From Andrew Ng's recent DeeplearningAI newsletter

  • by zb3 on 1/29/25, 6:09 PM

    DeepSeek actually opening ClosedAI up makes me like them even more.. this is great :)
  • by alexfromapex on 1/31/25, 2:13 PM

    The public probably thinks that these companies are getting hacked by "sophisticated hackers" but I'd bet money that they've been hacked via social engineering.
  • by imchillyb on 1/29/25, 11:56 PM

    If OpenAI desires public protection, then OpenAI should open-source its models.

    If they did this, We the People would cover them like we do others. Without it, We the People don't care.

    Cry, don't cry, it's meaningless to us.

  • by TrackerFF on 1/29/25, 6:05 PM

    Next up: «DeepSeek models are a national security risk, we must block access!»
  • by EGreg on 1/29/25, 11:22 PM

    Okay and there is evidence OpenAI used data of many people to train its own model.

    Tell me again how come remixing our data is just dandy, many artists got disrupted — but no one should be able to disrupt OpenAI like that?

  • by 1propionyl on 1/29/25, 8:06 PM

    At this point, the only thing that keeps me using ChatGPT is o1 w/ RAG. The usage limits on o1 are prohibitively tight for regular use, so I have to budget usage to tasks that would benefit there. I also have significant misgivings about their policies around output, which also limit what I can use it for.

    For local tasks, the deepseek-r1:14b and deepseek-r1:32b distillations immediately replace most of that usage (prior local models were okay, but not consistently good enough). Once there's a "just works" setup for RAG on par with installing ollama (which I doubt is far of), I don't see much reason to continue paying for my subscription.

    Sadly, like many others in this thread, I expect under the current administration to see self-hamstringing protectionism further degrade the US's likelihood of remaining a global powerhouse in this space. Betting the farm on the biggest first-mover who can't even keep up with competition, has weak to non-existent network effects (I can choose a different model or service with a dropdown, they're more or less fungible), has no technological moat and spent over a year pushing apocalyptic scenarios to drum up support for a regulatory moat...

    ...well it just doesn't seem like a great idea to me.

  • by halyconWays on 1/29/25, 8:02 PM

    Oh no, so sad. The Open non-profit that steals 100% of all copyrighted content and makes multiple billion-dollar for-profit deals while releasing no weights is crying. This is going to ruin my sleep. :(
  • by deadbabe on 1/29/25, 4:43 PM

    In this whole AI saga, DeepSeek would be like Prometheus. They stole the fire from the Gods and gave it to the humans, for free. Logic dictates then that they will be forced to suffer brutal punishment.
  • by leobg on 1/29/25, 11:03 PM

    OpenAI is taking the position similar to that if you sell a cook book, people are not allowed to teach the recipes to their kids, or make better versions of them.

    That is absurd.

    Copyright law is designed to strike a balance between two issues. One the one hand, the creator’s personality that’s baked into the specific form of expression. And on the other hand, society’s interest in ideas being circulated, improved and combined for the common good.

    OpenAI built on the shoulders of almost every person that wrote text on a website, authored a book, or shared a video online. Now others build on the shoulders of OpenAI. How should the former be legal but not the latter?

    Can’t have it both ways, Sam.

    (IAAL, for what it’s worth.)

  • by whenlambo on 1/29/25, 4:47 PM

    Qwen2.5-Plus

    > have you been trained using openai?

    No, I have not been trained using technology or data from OpenAI. I am an artificial intelligence model developed by Yandex, called YandexGPT. My “training” is based on proprietary algorithms and data that was collected and processed by the Yandex team. While I have similarities with other models such as those developed by OpenAI (e.g. GPT-3), my training was conducted independently and uses unique approaches specific to Yandex. This includes using Russian-language data as well as other sources to provide a better understanding of context and provide useful information to users. If you have questions on topics related to AI technologies or anything else, I'd be happy to help!

  • by ryao on 1/30/25, 1:26 PM

    Given that OpenAI model outputs are littering the internet, is it even possible to train a new model on public webpages without indirectly using OpenAI’s model to train it?
  • by curtisblaine on 1/29/25, 4:23 PM

    My feeling is that they will ban DS anyway because, like TikTok, it can become a massive intelligence source for the CCP. Imagine sending all your code to it, or your internal emails.
  • by myflash13 on 1/29/25, 3:42 PM

    What are the chances of old-school espionage? OpenAI should look for a list of former employees who now live in China. Somebody might've slipped out with a few hard drives.
  • by WolfOliver on 1/30/25, 7:16 AM

    I guess DeepSeek payed OpenAI for the usage of their API according to OpenAI's pricing?

    So what is the point if you pay for it and can not use the results how you see fit?

  • by mkoubaa on 1/29/25, 10:02 PM

    OpenAI made a lot of contributions to LLMs obviously but the amount of fraud, deception, and dark patterns coming out of that organization make me root against it.
  • by nuc1e0n on 1/29/25, 8:32 PM

    And OpenAI scrapped the public internet to train its models.
  • by jofzar on 1/30/25, 2:12 AM

    Sorry, it's now a problem to train off other people's data? Surely openai has never trained off other people's data without permission...
  • by therealpygon on 1/30/25, 1:06 AM

    “OpenAI complains company paid them for AI output that has no copyright, which was subsequently used to train another AI.” I think I fixed the title.
  • by Digit-Al on 1/29/25, 6:42 PM

    So... company that steals other people's work to train their models is complaining because they think someone stole their work to train their models.

    Cry me a river.

  • by aiono on 1/29/25, 3:44 PM

    How the turntables...
  • by feverzsj on 1/29/25, 4:02 PM

    So, they bought a pro plus account, and gathered all the data through it? Sounds just like Nvidia sells tons of embargoed AI chips to China.
  • by rkagerer on 1/30/25, 5:44 AM

    Are they crying about their competitor training off their stuff, after having used the whole of the web to train their own stuff?
  • by ra7 on 1/29/25, 4:18 PM

    “OpenAI has no moat” is probably running through their heads right now. Their only real “moat” seems to be their ability to fear monger with the US government.
  • by hedayet on 1/29/25, 7:50 PM

    Beyond the irony of their stance, this reflects a failure of OpenAI's technical leadership—either in oversight or in designing a system that enables such behavior.

    But in capitalism, we, the customers aren't going to focus on how models are trained or products are made; we only care about favourable pricing.

    A key takeaway for me from this news is the clause in OpenAI's terms and conditions. I mistakenly believed that paying for OpenAI’s API granted full rights to the output, but it turns out we’re only buying specific rights (which is now another reason we're going to start exploring alternatives to OpenAI)

  • by windex on 1/29/25, 3:46 PM

    SAltman, Salty.
  • by LZ_Khan on 1/29/25, 9:03 PM

    I actually think what DeepSeek did will slow down AI progress. What's the incentive to spend billions developing frontier models if once it's released some shady orgs in unregulated countries can just scrape your model outputs, reproduce it, and undercut you in cost?

    OpenAI is like a team of fodder monkeys stepping on landmines right now, with the rest of the world waiting behind them.

  • by the_optimist on 1/29/25, 5:59 PM

    This whole topic is basura enfuego. Same pack of maroons careening around society for years clamoring for censorship now imagining that Aaron Schwartz is their hero and that they want to menace people. Kids, don’t be like the grasping fools in these threads, philosophically unfounded and desperately glancing sideways, hoping the cumulative feels and gossip will sum to life meaning.
  • by maxglute on 1/29/25, 10:07 PM

    Not that DeepSeek is luigi mangione, but it's pretty funny OpenAi getting the dead ceo treatment.
  • by buyucu on 1/29/25, 6:54 PM

    I have no sympathy for OpenAI here. They are (allegedly) a non-profit with open in the title that refuse to open-source their models.

    They are now upset at a startup who is more loyal to OpenAI's original mission that OpenAI is today.

    Please, give me a break.

  • by aDyslecticCrow on 1/29/25, 6:27 PM

    And they used all copyrighted data on the internet. If they wanna sue, they set a dangerous precedent.
  • by dlikren on 1/29/25, 4:02 PM

    Intriguing to see the difference of response from HN when OpenAI first came to prominence and now.
  • by curvaturearth on 1/29/25, 7:15 PM

    Something about the outputs becoming the inputs to then produce more outputs is just plain funny
  • by ingohelpinger on 1/30/25, 6:59 AM

    OpenAI should be quite, since they’ve scrapped the entire internet for their training data.
  • by m3kw9 on 1/29/25, 7:43 PM

    So if OpenAI didn't have these outputs for distillation, Deepseek wouldn't exist?
  • by mtlmtlmtlmtl on 1/29/25, 8:00 PM

    So, what is this evidence? I'll believe it when I see it. Right now all we really have is some vague rumours about some API requests. How many requests? How many tokens? Over how long of a time period? Was it one account or multiple, if the latter, how many? How do they know the activity came from deepseek? How do they know the data was actually used to train Deepseek models(could have just been benchmarking against the competition)?

    If all they really have is some API requests, even assuming they're real and originated by Deepseek, that's very far from proof that any of it was used as training data. And honestly, short of commiting crimes against Deepseek(hacking), I'm not sure how they even could prove that at this point, from their side alone.

    And what's even more certain is that a vague insistence that evidence exists, accompanied by a denial to shed any more light on the specifics, is about as informative as saying nothing at all. It's not like OpenAI and Microsoft have a habit of transparency and honesty in their communication with the public, as proven by an endless laundry list of dishonest and subversive behaviour.

    In conclusion, I don't see why I should give this any more credence than I would a random anon on 4chan claiming a pizza place in Washington DC is the centre of a child sex trafficking ring.

    P.S: And to be clear, I really don't care if it is true. If anything, I hope it is; it would be karmic justice at its finest.

  • by juliuskiesian on 1/31/25, 12:01 PM

    The obvious question is, if you have the evidence, why not just show it?
  • by pluc on 1/29/25, 4:07 PM

    OpenAI feeling threatened by open AI is just delicious
  • by nachox999 on 1/29/25, 7:10 PM

    Ask DeepSeek and ChatGPT: "name three persons"; the answer may surprise you
  • by sylware on 1/29/25, 3:11 PM

    LOL, I was thinking exactly the same think when I read the news about openai whining.
  • by oxqbldpxo on 1/29/25, 4:01 PM

    Deepseek is really outstanding.
  • by beardedwizard on 1/29/25, 3:44 PM

    Next they will try to force us to use our tax dollars to fund their legal fights.
  • by sirolimus on 1/29/25, 11:01 PM

    Such Karma lol, I wonder how they trained Sora again? You..tube something
  • by game_the0ry on 1/29/25, 5:38 PM

    At least DeepSeek open sourced their code. They're more open than OpenAI.

    Ironic.

  • by asdefghyk on 1/29/25, 7:37 PM

    Deepseek did not respect OpenAI's copyright?

    Well who would have thought that?

  • by wnevets on 1/29/25, 5:34 PM

    Its like a bank robber being upset when someone steals their loot
  • by colonelspace on 1/29/25, 6:05 PM

    No honour among thieves
  • by sgammon on 1/29/25, 11:50 PM

    The nyt disclosure on this reporting is about to be wild
  • by dbg31415 on 1/29/25, 11:30 PM

    Boo hoo?

    Back in college, a kid in my dorm had a huge MP3 collection. And he shared it out over the network, and people were all like, "Man, Patrick has an amazing MP3 collection!" And he spent hours and hours ripping CDs from everyone so all the music was available on our network.

    Then I remember another kid coming in, with a bigger hard drive, and he just copied all of Patrick's MP3 collection and added a few more to it. Then ran the whole thing through iTunes to clean up names and add album covers. It was so cool!

    And I remember Patrick complained, "He stole my MP3 collection!"

    Anyway this story sums up how I feel about Sam Altman here. He's not Metalica, he's Patrick.

    https://www.npr.org/2023/12/27/1221821750/new-york-times-sue...

  • by baggiponte on 1/29/25, 6:44 PM

    OpenAI coping so hard
  • by Jotalea on 1/29/25, 6:56 PM

    I really hate when there is a paywall to read an article. It makes me not want to read it anymore.
  • by itamarcode on 1/29/25, 4:30 PM

    So protecting models behind API isn't working, ha?
  • by lawlessone on 1/29/25, 3:04 PM

    So they're mad someone did exactly what they did?
  • by gostsamo on 1/29/25, 3:45 PM

    How you dare take what I've rightfully stolen!
  • by spacecadet on 1/29/25, 5:23 PM

    See you all on lobsters...

    So long HN and thanks for all the fish?

  • by low_tech_love on 1/31/25, 6:33 AM

    Imagine having no competition one day and the next DeepSeek happens. It must’ve been quite scary. Makes sense that accusations will start flying. In my country we have a saying: a thief that robs a thief is pardoned for 100 years.

    It’s really interesting that the same people who defend liberal capitalism at its extreme and praise competition as its most important component (which I don’t disagree) are the same ones that’ll promptly attempt to destroy the system and the competition as soon as they are in such a position.

  • by geerlingguy on 1/29/25, 4:19 PM

    Something something "just desserts".
  • by andy_ppp on 1/29/25, 3:42 PM

    When I rewrite how the law works there should be a ludicrous hypocrisy defence… if the person suing you has committed the same offence the case should not be admissible.
  • by HPsquared on 1/29/25, 6:24 PM

    AI models are becoming like perpetual stew.
  • by glooglork on 1/30/25, 7:33 AM

    How much data from o1 would DeepSeek actually need to actually make any improvements with it? I also assume they'd have to ask a very specific pattern of questions, is this even possible without OpenAI figuring out what's going on
  • by vrighter on 1/29/25, 9:38 AM

    So what? They probably paid for api access just like everyone else. So it's a TOS violation at worst. Go ahead, open a civil suit in the US against an entity the US courts do not have jurisdiction over and quit whining...
  • by beezlewax on 1/29/25, 11:18 PM

    This is nothing short of hilarious.
  • by moralestapia on 1/29/25, 7:09 PM

    Called it from day 0, impossible to reach that performance with 5M, they had to distill OpenAI (or some other leading foundational model).

    Got downvoted to oblivion by people who haven't been told what to think by MSM yet. Now it's on FT and everywhere, good, what matters is that truth comes out eventually.

    I don't take any sides and think what DeepSeek did is fair play, however, what I do find harmful about this is, what incentive would company A have to spend billions training a new frontier model if all of that could be then reproduced by company B at a fraction of the cost?

  • by iimaginary on 1/30/25, 9:39 AM

    Where did I leave my tiny violin?
  • by mercurialsolo on 1/30/25, 1:28 AM

    How the vibe has turned on OpenAI?
  • by hsuduebc2 on 1/29/25, 4:56 PM

    A thief cries 'stop the thief!
  • by rachofsunshine on 1/29/25, 3:37 PM

    "It's obvious! You're trying to kidnap what I have rightfully stolen!"

    Yet another of a series of recent lessons in listening to people - particularly powerful people focused on PR - when they claim a neutral moral principle for what happens to be pragmatically convenient for them. A principle applied only when convenient is not a principle at all, it's just the skin of one stretched over what would otherwise be naked greed.

  • by insane_dreamer on 1/29/25, 3:52 PM

    Usually I'm very much on the side of protecting America's interests from China, but in this case I'm so disgusted with OpenAI and the rest of BigTech driving this "arms race" that I'd be happy with them burning to the ground.

    So we're going to reverse our goals to reduce emissions and fossil fuels in order to hopefully save future generations from the worst effects of climate change, in the name of being able to do what, exactly, that is actually benefiting humanity? Boost corporate profits by reducing labor?

  • by a2128 on 1/30/25, 2:10 AM

    Yeah? And if I say I have evidence OpenAI used my data to train a competitor to myself as a being that's capable of programming, will I get to have my own story on the Financial Times?
  • by belter on 1/29/25, 9:10 AM

    The subtitle is the gold... : "White House AI tsar David Sacks raises possibility of alleged intellectual property theft"
  • by flybarrel on 1/29/25, 5:44 PM

    OpenAI shocked that an AI company would train on someone else's data without permission or compensation...lolllllll
  • by delusional on 1/29/25, 4:51 PM

    Boo hoo. Competition isn't fun when I'm not winning. Typical Americans. When Americans are running around ruining the social cohesion of several developing nations, that's just fair competition, but as soon as they get even the smallest hint of real competition they run to demonize it.

    Yes deepseek is going to steal all of your data. OpenAI would so the same. Yes the CCP is going to get access to your data and use it to decide if you get to visit or whatever. The white house does the same.

  • by kozikow on 1/30/25, 3:31 PM

    Chatgpt content is getting pasted all over the web. Now, for anyone crawling the web, it's hard to not include some chatgpt outputs.

    So even if you put some "watermarks" in your AI generation, it's plausible defense to find publicly posted content with those watermarks.

    Maybe it's explained in the article, but I can't access it, as it's paywalled.

  • by buildsjets on 1/29/25, 9:47 PM

    Womp Womp.
  • by B1FF_PSUVM on 1/29/25, 7:36 PM

    "Cry me a river" is a phrase I haven't heard recently, for some reason ...
  • by pknerd on 1/30/25, 6:55 AM

    OpenAI steals the data from Youtube and the Internet so that's no fair either.
  • by typon on 1/29/25, 3:15 PM

    OpenAI is the MIC darling - expect more ridiculous attacks on competitors in the future
  • by xyst on 1/29/25, 11:57 PM

    What a load of shit. ClosedAI is publishing a hit piece on DeepSeek and get public and politicians on their side. Maybe even get government to do their dirty work.

    If they had a case, they wouldn’t be using FT. They would be filing a court case. Although that would open them up to discovery and the nasty shit ClosedAI has been up to would be game.

  • by ysofunny on 1/29/25, 10:06 PM

    I see this as China fighting U.S. of A (or the American Dollar versus Chinese Renmibi if you will)

    and this is good because any alternatives I can think of are older-school fighting

    modern war is seeped in symbolism, but the contest is still there

    e.g. whose dong is bigger? Xi Jingping's or Dnld Trump's

  • by almostdeadguy on 1/29/25, 10:43 PM

    Hope Sam Altman is getting his money's worth out of that Trump campaign contribution. Glorious days to be living under the term of a new Boris Yeltsin. Pawning and strip-mining the federal apparatus to the most loyal friends and highest bidders.
  • by freejazz on 1/31/25, 4:16 PM

    Who else cares?
  • by lxe on 1/29/25, 5:10 PM

    I mean, almost ALL opensource models, ever since alpaca, contain a ton of synthetic data produced via ChatGPT in their finetuning or training datasets. It's not a surprise to anyone who's been using OSS LLMs for a while: almost ALL of them hallucinate that they are ChatGPT.
  • by animitronix on 2/1/25, 2:13 AM

    Who tf cares?
  • by rochak on 1/31/25, 3:55 AM

    Cry me a river
  • by daft_pink on 1/29/25, 3:36 PM

    I mean if they paid to use the api and then used the output, I fail to see how they can complain.
  • by udev4096 on 1/30/25, 2:08 PM

    What about the pirated books you used and millions of blogs and websites scraped without consent? Somehow that's legal? Come on, give me a fucking break. OpenAI deserves the top spot in the list of unethical companies in the world
  • by schaefer on 1/29/25, 8:10 PM

    I mean, if openAI claims they can train on the world’s novels and blogs with “no harm done” (i.e: no copyright infringement and no royalties due), then it directly follows that we can train both our robots and our selves on the output of openAI’s models in kind.

    Right?

  • by trkaky on 1/30/25, 11:38 AM

    how much would it cost to distill o1..
  • by asdfasdf1 on 1/29/25, 9:50 PM

    it's no crime to steal from a thief
  • by nshung on 1/29/25, 5:44 PM

    Hilarious. Scam Altman is giving me SBF vibe daily now.
  • by ddingus on 1/30/25, 12:04 AM

    So what?

    Seriously. Given how pretty much all this software was trained, who cares?

    I, for one, don't and believe the massive amount of knowledge continues to be of value to many users.

    And I find the thought of these models knowing some things they shouldn't very intriguing.

  • by semking on 1/29/25, 3:06 PM

    This is absolutely hilarious! :)

    ClosedAI scraped human content without asking and they explained why this was acceptable... but when the outputs of their training corpus is scraped, it is THEIR dataset and this is NOT acceptable!

    Oh, the irony! :D

    I shared a few screenshots of DeepSeek answering using ChatGPT's output in yesterday's article!

    https://semking.com/deepseek-china-ai-model-breakthrough-sec...

  • by yapyap on 1/29/25, 10:30 PM

    It sounds like they’re just jealous and trying to smear shit over the wall and see what sticks.

    DeepSeek just bodied u bro, get back in the lab & create a better AI instead of all this news that isn’t gonna change them having a good AI

  • by zx10rse on 1/29/25, 9:03 PM

    OpenAI is already irrelevant but the audacity oh my.
  • by vjerancrnjak on 1/30/25, 6:21 AM

    I thought this is capitalism for the winners. Why slander competition, just outcompete them? Why stick to your losing bets if you’ve recognized a better alternative?

    Let’s race to the bottom.

  • by vcryan on 1/30/25, 2:43 AM

    I love watching billionaires squirm
  • by jeanlucas on 1/29/25, 4:43 PM

    damn that's a good headline
  • by oatmeal_croc on 1/30/25, 6:32 AM

    Even if true, so what? These are increasingly looking like a competition between nation-states with their trade embargoes and export controls. All's fair in AI wars.
  • by paulhart on 1/29/25, 3:33 PM

    "You are trying to kidnap what I have rightfully stolen"
  • by sonabinu on 1/29/25, 3:26 PM

    poetic justice (pun intended)
  • by khazhoux on 1/30/25, 9:37 AM

    I'm disappointed that 99% of the comments about this topic are Schadenfreude, and 1% is actually about the technical implications of OpenAI's claims.
  • by hello_computer on 1/30/25, 8:50 AM

    then show it to us rachel
  • by josefritzishere on 1/29/25, 6:32 PM

    OpenAI, who comitted copyright infringement on an massive scale, wants to defend against a superior product won the basis of infringement? What nonsense.
  • by nottorp on 1/29/25, 3:25 PM

    IP thief cries IP thief.

    It's okay when you steal worldwide IP to train your "AI".

    It's not okay when said stolen IP is stolen from you?

    If the chinese are guilty, then Altman's doom and gloom racket is as guilty or even more, considering they stole from everyone.

  • by fedeb95 on 1/30/25, 8:10 AM

    if some kind of transitivity holds, DeepSeek stole billions of internet users data.
  • by mk89 on 1/29/25, 3:57 PM

    What a joke OpenAI has become.
  • by asah on 1/29/25, 3:14 PM

    Thieve's honor, hunh?
  • by witnesser2 on 1/30/25, 8:22 PM

    Soon another layer of distiller will emerge. Selling purer booze in this weight tuning buzzi.
  • by pshirshov on 1/30/25, 1:48 AM

    A thief got robbed?..
  • by coldpepper on 1/30/25, 1:39 AM

    Fuck openai. They didn't ask my peemission to crawl my blog into their dataset.
  • by thumbsup-_- on 1/29/25, 5:29 PM

    is stealing from the thief actually a theft?
  • by coolhand2120 on 1/29/25, 4:42 PM

    I was using the Deepseek model last night and got this:

    > You > do you understand copredication?

    > Assistant > deepseek-coder-33b-instruct > I'm sorry for any misunderstanding, but as an AI model developed by OpenAI, I don't have personal ...

    Yikes! That's pretty damning IMO.

  • by octacat on 1/29/25, 5:06 PM

    first time?
  • by pointedAt on 1/29/25, 4:42 PM

    it's a corona style portfolio gang.
  • by 52-6F-62 on 1/29/25, 6:07 PM

    I heard they were just “democratizing” llm and ai development.

    Yesterday the industry crushed pianos and tools and bicycles and guitars and violins and paint supplies and replaced them with a tablet computer.

    Tomorrow we can replace craven venture capitalists and overfed corporate bodies with incestuous LLM’s and call it all a day.

  • by seydor on 1/29/25, 3:20 PM

    But now OpenAI will use DeepSeek to reuse even more stolen data to train new models that they can serve without ever giving us the code, the weights or even the thinking process , and they will still be superior
  • by exabrial on 1/30/25, 5:25 AM

    cry us copyright holders a river.
  • by seanp2k2 on 1/30/25, 6:25 AM

    "lol" said the Scorpion, "lmao".
  • by boxedemp on 1/29/25, 8:55 PM

    Deep refers to itself as ChatGPT sometimes lol
  • by htrp on 1/29/25, 8:04 PM

    In other news.....water is wet
  • by FpUser on 1/29/25, 10:38 PM

    Pot calling kettle black?
  • by otikik on 1/30/25, 9:01 AM

    Chatgpt, please generate an image of the tiniest violin imaginable.

    Oh wait I will ask DeepSeek instead.

  • by pcthrowaway on 1/29/25, 3:16 PM

    Now that China is talking about lifting the Great Firewall, it seems like the U.S. is on track to cordon themselves off from other countries. Trump's talk of building a wall might not stop at Mexico.
  • by cratermoon on 1/30/25, 3:41 PM

    Maybe the VCs backing OpenAI invest in tiny violins.
  • by hsuduebc2 on 1/29/25, 4:56 PM

    The pot calling the kettle black
  • by anon115 on 1/30/25, 6:21 AM

    eat shit
  • by worik on 1/29/25, 6:39 PM

    [flagged]
  • by this15testingg on 1/29/25, 4:17 PM

    if you want to completely disregard copyright laws, just call your project AI!

    I'm sure Aaron Swartz would be proud of where the "tech" industry has gone. /s

    what problem are these glorified AIM chatbots trying to solve? wealth extraction not happening fast enough?

  • by mring33621 on 1/29/25, 3:20 PM

    We demand immediate government action to prevent these cheaper foreign AIs from taking jobs away from our great American AIs!
  • by breakitmakeit on 1/29/25, 3:07 PM

    As the article points out, they are arguing in court against the new york times that publicly available data is fair game.

    The questions I am keenly waiting to observe the answer to (because surely Sam's words are lies): how hard is OpenAI willing to double down on their contradictory positions? What mental gymnastics will they use? What power will back them up, how, and how far will that go?

  • by tntxtnt on 1/29/25, 3:55 PM

    Can they tax DeepSeek just like they taxed BYD cars? Smh Chinese ruin US industry again and again and again. Where's Trump at?? Why don't he taxed 1000000% of the free $0 DeepSeek AI??
  • by RohMin on 1/29/25, 3:23 PM

    this comment section smells like Reddit - ugh
  • by TheJCDenton on 1/29/25, 3:48 PM

    This Deep Whining® technique used by OpenAI is not very effective.
  • by aucisson_masque on 1/29/25, 11:03 PM

    I don't see the difference between that and LLM feeding on internet people's data.

    They call it IP theft yet when the New York Times sued OpenAI and Microsoft for copyright infringement they claimed it's fair use of data.