from Hacker News

Why training AI can't be IP theft

by OuterVale on 4/12/25, 12:52 PM with 96 comments

by blagie on 4/12/25, 1:49 PM
I asked AI to complete an AGPL code file I wrote a decade ago. It did a pretty good job. What came out wasn't 100% identical, but clearly a paraphrased copy of my original.
Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.
If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.
On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.
Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I'm not a big fan of copyright law as written. We should have a discussion of what it should look like. That's a big discussion. I'll make a few claims:
- We no longer need to encourage technological progress; it's moving fast enough. If anything, slowing it down makes sense.
- "Fair use" is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds
- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it's important I know what those are.
That's the bigger discussion to have.
by basch on 4/12/25, 1:47 PM
"I think the unambiguous answer to this question is that the act of training is viewing and analysis, not copying. There is no particular copy of the work (or any copyrightable elements) stored in the model. While some models are capable of producing work similar to their inputs, this isn’t their intended function, and that ability is instead an effect of their general utility. Models use input work as the subject of analysis, but they only “keep” the understanding created, not the original work."
The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?
Calling it "training" instead of compression lets the author play semantic games.
by TimorousBestie on 4/12/25, 1:55 PM
The assumption that human learning and “machine learning” are somehow equivalent (in a physical, ethical, or legal sense—the domain shifts throughout the essay) is not supported with evidence here. They spend a long time describing how machine learning is different from human learning on a computational level, but that doesn’t seem to impact the rest of the argument.
I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.
by hyperman1 on 4/13/25, 10:39 AM
In the EUCD, a copy in RAM falls under copyright, but there is an exception defined (art 5) if the copy is transitory and the target use is legal under copyright. Neither is true for AI, so this article is probably wrong in the EU.
Apart from that, I wonder uf an AI is learning in the legal sense of the word. I'd suspect removing copyright trough learning is something only humans can do, seen trough legal glasses. An AI would be a mechanical device creating a mashup of multiple works, and be a derived work of all of them.
Main problem with this rebuttal is how you prove the AI copied your work specifically, and finding out which of the zillions of creative works in that mashup are owned by who.
by gavinhoward on 4/12/25, 2:00 PM
Copyright reserves most rights to the author by default. And copyright laws thought about future changes.
Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.
The one courts love the most is if the copy is used to create something commercial that competes with the original work.
From near the top of the article:
> I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”
So essentially, the author admits that AI fails this test.
Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.
The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.
Edit: still reading, but the author talks about enumerated rights. Most lawsuits target the distribution of model outputs because that is reproduction, an enumerated right.
Edit 2: the author talks about sunstantive differences, admits they happen aboit 2% of the time, but then seems to argue that means they are not infringing at all. No, they are infringing in those instances.
Edit 3: the author claims that model users are the infringing ones, but at least one AI company (Microsoft?) had agreed to indemnify users, so plaintiffs have full right to go after the company instead.
by djoldman on 4/12/25, 1:56 PM
There are a few stages involved in delivering the output of a LLM or text-to-image model:
1. acquire training data
2. train on training data
3. run inference on trained model
4. deliver outputs of inference
One can subdivide the above however one likes.
My understanding is that most lawsuits are targeting 4. deliver outputs of inference.
This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.
The issue of whether or not it's legal to train on training data to which one does not hold copyright is probably moot - businesses don't care too much about what you do unless you're making money off it.
by EdwardDiego on 4/12/25, 1:54 PM
That's a lot of words to justify what I presume to be the author's pre-existing viewpoint.
Given that "training" on someone else's IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word "can't".
by ConspiracyFact on 4/15/25, 12:09 AM
The problem is that model outputs are wholly derivative. This is easy to see if you start with a dataset of one artistic work and add additional works one at a time. Clearly, at the start the outputs are derivative. As more inputs are added, there’s no magical transformation from derivative to non-derivative at any particular point. The output is always a deterministic function of the inputs, or a deterministic output papered over with randomness.
“But,” you say, “human art is derivative too in that case!”
No. A human artist is influenced by other artists, yes, but he is also influenced by the totality of his life experience, which amounts to much more in terms of “inputs”.
by prophesi on 4/12/25, 2:42 PM
I think it can be IP theft, and also require labor negotiations. And global technical infrastructure for people to opt-in to having their data trained on. And a method for creators to be compensated if they do opt-in and their work is ingested. And ways for their datasets to be audited by third parties.
It sounds like a pipedream, but ethical enforcement of AI training across the globe will require multifaceted solutions that still won't stamp out all bad actors.
by light_hue_1 on 4/12/25, 2:14 PM
This it totally the wrong analysis.
Think of AI tools like any other tools. If I include code I'm not allowed to use, like reading a book I pirated, that's copyright infringement. If I include an image as an example in my image editor, that's ok if I am allowed to copy it.
If someone decides to use my image editor to create an image that's copyrighted or trademarked, that's not the fault of the software. Even if my software says "hey look, here are some cool logos that you might want to draw inspiration from".
People are getting too hung up on the AI part. That's irrelevant.
This is just software. You need a license for the inputs and if the output is copyrighted that's on the user of the software. It's a significant risk of just using these models carelessly.
by alganet on 4/13/25, 4:51 PM
That's a lot of text.
Where is AI disruptive? If it is disruptive in some area, should we apply old precedents to a thing so radically new? (rethorical).
Good fresh training data _will end_. The entire world can't feed this machine as fast as it "learns".
To make a farming comparison, it's eating the seeds. Any new content gets devoured before it has a chance to grow and give fruit. Furthermore, people are starting to manipulate the model instead of just creating good content. What exactly will we learn then? No one fucking knows. It's a power grab free for all waiting to happen. Whoever is poor in compute resources will lose (people! the majority of us).
If I am right, we will start seeing anemic LLMs soon. They will get worse with more training, not better. Of course they will still be useful, but not as a liberating learning tool.
Let's hope I am not right.
by bionhoward on 4/13/25, 6:48 PM
Did the article mention the part about how these companies turn around and say you’re not allowed to use the output to develop competitive models? I couldn’t find mention of this
by Calwestjobs on 4/12/25, 2:16 PM
look, quickest example if it IS or it IS NOT ip theft is - go to any image generation ML wizardry prompt machine and ask it this :
"generate image of jack ryan investigating nuclear bomb. he has to look like morgan freeman."
(and do it quickly before someone in FAANGM manually plays with something altering result of that prompt)
problem is opposite, is "original" work IP a original in itself or is it just remix
or someone just gave lawyer some generic text and make it arbitrarily protected for adding 0.000000001% to previous work.
by EPWN3D on 4/12/25, 2:40 PM
I couldn't get through it, did he actually make an argument eventually?
by hulitu on 4/13/25, 8:26 PM
> Why training AI can't be IP theft
Because Microsoft is part of BSA. /s
If you steal our software, it is theft. If we still your software, it is fair use. Can we train AI on leaked Windows source code ?
by techpineapple on 4/12/25, 1:56 PM
“If humans were somehow required to have an explicit license to learn from work, it would be the end of individual creativity as we know it“
What about text books, in order to train on a textbook, I have to pay a licensing fee.
by re-thc on 4/12/25, 12:59 PM
The argument in the article breaks down by taking marketing by definition and try to apply it to a technical argument.
You might as well start by saying that the "cloud" as in some computers really float in the sky. Does AWS rain?
This "AI" or rather program is not "training" or "learning" - at least not the way these laws conceived by humans were anticipated or created for. It doesn't fit the usual dictionary term of training or learning. If it did we'd have real AI, i.e. the current term AGI.