by OuterVale on 4/12/25, 12:52 PM with 96 comments
by blagie on 4/12/25, 1:49 PM
Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer "memorize" my work. That's a form of copy.
If I've "learned" Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.
On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.
Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I'm not a big fan of copyright law as written. We should have a discussion of what it should look like. That's a big discussion. I'll make a few claims:
- We no longer need to encourage technological progress; it's moving fast enough. If anything, slowing it down makes sense.
- "Fair use" is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds
- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it's important I know what those are.
That's the bigger discussion to have.
by basch on 4/12/25, 1:47 PM
The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?
Calling it "training" instead of compression lets the author play semantic games.
by TimorousBestie on 4/12/25, 1:55 PM
I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.
by hyperman1 on 4/13/25, 10:39 AM
Apart from that, I wonder uf an AI is learning in the legal sense of the word. I'd suspect removing copyright trough learning is something only humans can do, seen trough legal glasses. An AI would be a mechanical device creating a mashup of multiple works, and be a derived work of all of them.
Main problem with this rebuttal is how you prove the AI copied your work specifically, and finding out which of the zillions of creative works in that mashup are owned by who.
by gavinhoward on 4/12/25, 2:00 PM
Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.
The one courts love the most is if the copy is used to create something commercial that competes with the original work.
From near the top of the article:
> I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”
So essentially, the author admits that AI fails this test.
Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.
The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.
Edit: still reading, but the author talks about enumerated rights. Most lawsuits target the distribution of model outputs because that is reproduction, an enumerated right.
Edit 2: the author talks about sunstantive differences, admits they happen aboit 2% of the time, but then seems to argue that means they are not infringing at all. No, they are infringing in those instances.
Edit 3: the author claims that model users are the infringing ones, but at least one AI company (Microsoft?) had agreed to indemnify users, so plaintiffs have full right to go after the company instead.
by djoldman on 4/12/25, 1:56 PM
1. acquire training data
2. train on training data
3. run inference on trained model
4. deliver outputs of inference
One can subdivide the above however one likes.
My understanding is that most lawsuits are targeting 4. deliver outputs of inference.
This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.
The issue of whether or not it's legal to train on training data to which one does not hold copyright is probably moot - businesses don't care too much about what you do unless you're making money off it.
by EdwardDiego on 4/12/25, 1:54 PM
Given that "training" on someone else's IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word "can't".
by ConspiracyFact on 4/15/25, 12:09 AM
“But,” you say, “human art is derivative too in that case!”
No. A human artist is influenced by other artists, yes, but he is also influenced by the totality of his life experience, which amounts to much more in terms of “inputs”.
by prophesi on 4/12/25, 2:42 PM
It sounds like a pipedream, but ethical enforcement of AI training across the globe will require multifaceted solutions that still won't stamp out all bad actors.
by light_hue_1 on 4/12/25, 2:14 PM
Think of AI tools like any other tools. If I include code I'm not allowed to use, like reading a book I pirated, that's copyright infringement. If I include an image as an example in my image editor, that's ok if I am allowed to copy it.
If someone decides to use my image editor to create an image that's copyrighted or trademarked, that's not the fault of the software. Even if my software says "hey look, here are some cool logos that you might want to draw inspiration from".
People are getting too hung up on the AI part. That's irrelevant.
This is just software. You need a license for the inputs and if the output is copyrighted that's on the user of the software. It's a significant risk of just using these models carelessly.
by alganet on 4/13/25, 4:51 PM
Where is AI disruptive? If it is disruptive in some area, should we apply old precedents to a thing so radically new? (rethorical).
Good fresh training data _will end_. The entire world can't feed this machine as fast as it "learns".
To make a farming comparison, it's eating the seeds. Any new content gets devoured before it has a chance to grow and give fruit. Furthermore, people are starting to manipulate the model instead of just creating good content. What exactly will we learn then? No one fucking knows. It's a power grab free for all waiting to happen. Whoever is poor in compute resources will lose (people! the majority of us).
If I am right, we will start seeing anemic LLMs soon. They will get worse with more training, not better. Of course they will still be useful, but not as a liberating learning tool.
Let's hope I am not right.
by bionhoward on 4/13/25, 6:48 PM
by Calwestjobs on 4/12/25, 2:16 PM
"generate image of jack ryan investigating nuclear bomb. he has to look like morgan freeman."
(and do it quickly before someone in FAANGM manually plays with something altering result of that prompt)
problem is opposite, is "original" work IP a original in itself or is it just remix
or someone just gave lawyer some generic text and make it arbitrarily protected for adding 0.000000001% to previous work.
by EPWN3D on 4/12/25, 2:40 PM
by hulitu on 4/13/25, 8:26 PM
Because Microsoft is part of BSA. /s
If you steal our software, it is theft. If we still your software, it is fair use. Can we train AI on leaked Windows source code ?
by techpineapple on 4/12/25, 1:56 PM
What about text books, in order to train on a textbook, I have to pay a licensing fee.
by re-thc on 4/12/25, 12:59 PM
You might as well start by saying that the "cloud" as in some computers really float in the sky. Does AWS rain?
This "AI" or rather program is not "training" or "learning" - at least not the way these laws conceived by humans were anticipated or created for. It doesn't fit the usual dictionary term of training or learning. If it did we'd have real AI, i.e. the current term AGI.