from Hacker News

Things are about to get worse for generative AI

by eddyzh on 12/30/23, 10:17 AM with 755 comments

by ctoth on 12/30/23, 5:01 PM
Everybody just buying into the corporate narrative that anyone can actually own these sorts of things.
Who truly owns the tales of Snow White and Cinderella?
These stories didn't originate with Disney; they are part of a rich tapestry of folklore passed down through generations. Disney's success was partly built on adapting these existing narratives, which were once shared and reshaped by communities over centuries.
This conversation shouldn't just be about the technicalities of AI or the legalities of copyright; it should be about understanding the deep roots of our shared culture.
At its core, culture is a communal property, evolving and growing through collective storytelling and reinterpretation.
The current debate around AI and copyright infringement seems to overlook this fundamental aspect of cultural evolution. The algorithms might be new, but the practice of reimagining and repurposing stories is as old as humanity itself.
By focusing solely on the legal implications and ignoring the historical context of cultural storytelling, we risk overlooking the essence of what it means to be a creative society.
As a large human model, (no really I could probably lose some weight) I think it's just silly how we're all sort of glossing over the fact that Disney built their house of mouse on existing culture, on existing stories, and now the idea that we might actually limit the tools of cultural expression to comply with some weird outdated copyright thing is just...bonkers.
by Havoc on 12/30/23, 12:08 PM
To me that’s the wrong question.
Everyone knew it was trained on copyrighted material and capable of eerily similar outputs.
But it’s already done. At scale. Large corps committing fully. There is no chance of that toothpaste going back in the tube.
It’s a bit like when big tech built on aggressive user data harvesting. Whether it’s right, ethical or even legal is academic at this stage. They just did it - effectively without any real informed consent by society. Same thing here - 9 out of 10 people on street won’t be able to tell you how AI is made let alone comment on copyright.
So the right question here is what now. And I suspect much like tracking the answer will be - not much.
by niemandhier on 12/30/23, 2:12 PM
Should not be a problem in the EU. Article 3 and 4 of the „ Copyright in the Digital Single Market“ Directive already regulate this.
Summary by Wolters Kluwer: […] Everyone else (including commercial ML developers) can only use works that are lawfully accessible and where the rightholders have not explicitly reserved use for text and data mining purposes.
AFAIK they are discussing something like a robot.txt to flag stuff as „not for training“. You will probably be expected to implement some safeguards and of course the end user will have to be careful in his use of the generated things.
Source at Kluwers: https://copyrightblog.kluweriplaw.com/2023/02/20/protecting-...
EU Legal Text: https://eur-lex.europa.eu/eli/dir/2019/790/oj
by koliber on 12/30/23, 2:26 PM
The responsibility for ensuring that copyrights were not violated fall on the person publishing the work. Whether they drew something themselves, hired an apprentice artists with no legal training to draw something, took a photograph of something, or used AI to create an image should not matter.
Why does anyone assume that ChatGPT or other tools would NOT produce previously-copyrighted content?
I can see a naive assumption that since it is “generated” it’s original. However that assumption falls apart as soon as you replace “ChatGPT” with “junior artist”. Tell them to draw a droid from a sci-fi movie, don’t mention anything else. Don’t say anything about copyrights. Don’t tell them that they have to be original. What would you expect them to produce?
by appplication on 12/31/23, 3:55 AM
There are an alarming number of responses seemingly completely unaware of the core thrust of the article (and NYT lawsuit). ChatGPT was able to reproduce and publish significant portions of NYT articles, completely verbatim for hundred-to-thousand word stretches.
It’s not derivative work. We’re way past that. NYT has an exceptionally strong case here and anyone arguing about the merits of copyright is way off the mark. This court case is not going single-handedly to undo copyright. OpenAI has very little going for them other than “this is new, how were we to know it could do this”. So knowing that, the currently trained models are in a very sticky situation.
Further, I don’t see NYT settling. The implications are too large, and if they settle with OpenAI, they will have a similar case pop up with every other model. And every other publisher of digital content with have a similarly merited case. This is an inflection point for generative AI, and it’s looking like it will be either much more expensive or much more limited than we originally thought.
A side effect of this: I am predicting that we will start to see a rise in “pirate” models. Models who eschew all legality, who are trained in a distributed fashion, and whose weights are published not by corporations but by collectives (e.g. torrent models). There is a good chance we see these surpass the official “well behaved” models in effectiveness. It will be an interesting next few years to see this play out.
by marckrn on 12/30/23, 12:22 PM
I might be a bit idealistic, but I've always believed that the core purpose of art and publishing should be to influence culture and society, not just to make a heap of money. That's why I feel original work needs its protection, but it should enter the public domain much sooner to fuel creativity and inspiration. We should be thinking in terms of a few years for this transition, not decades.
by keiferski on 12/30/23, 12:01 PM
These don't seem all that difficult to fix to me. Most of the examples are not really generic, but are shorthand descriptions of well-known entities. "Video game plumber" is practically synonymous with "Mario" and anyone that has the slightest familiarity with the character knows this.
Likewise, how difficult is it to just use descriptive tools to describe Mario-like images [1] and then remove these results from anyone prompting for "video game plumber"?
1. The describe command can describe an image in Midjourney. I imagine other AI tools have similar features: https://docs.midjourney.com/docs/describe
by WhiteNoiz3 on 12/30/23, 2:21 PM
As I understood it, the legal precedent for generative AI is the same one that allows google to scrape websites in order to index them for search for the common good. Google also can display cached versions of websites which is the original content of those sites. No one is going to say that google is copyright infringement just because it is showing content from other websites verbatim. So I think this is a weak argument. AI would be useless if we had to scrub all cultural references and popular IP's (even not so popular ones).
Personally, I think generative AI should be able to provide links to similar source material in the training data.. This would be the barest way to compensate those who have contributed to training the AI. I don't think generative AI is sustainable in the long term if it ends up killing all the websites/artists that created the original material. Plus I think having sources adds a layer of transparency and aids users in understanding when content is hallucinated vs. not. People should be able to opt out of having their content used for training and be able to confirm that it has been removed for future iterations. Let's be honest that AI companies are just trying to avoid lawsuits by keeping it secret. These are areas where I think regulation can help rather than worrying about doomsday scenarios.
by preommr on 12/30/23, 12:01 PM
We need clearer laws that only apply to Generative AI. Too many comparisons and parallels are being drawn to actual people. "Like what if someone learned how to draw by watching trademarked material, and then accidentally produced it" But these models aren't people and they exist in a category of their own.
I do think it's somewhat trademark infringement by these models, also that it should be allowed and that ultimate responsibility should be on the person using the images in a final work meant for consumption by the general public as stand alone media.
by FridgeSeal on 12/30/23, 2:45 PM
I am beginning to think that in these discussions these models are functioning more like an obscuring factor than anything else and the discussion is getting bogged down in that, and not the crux of the argument.
They’re giving people plausible deniability in the “chain of responsibility”, and I think if we took away “LLM” and replaced it with “fairground sideshow magic box” the argument that LLM’s are somehow special and deserving of exemptions disappears real quick.
by dang on 12/30/23, 6:43 PM
Related ongoing thread:
NY times is asking that all LLMs trained on Times data be destroyed - https://news.ycombinator.com/item?id=38816944 - Dec 2023 (93 comments)
Also:
NY Times copyright suit wants OpenAI to delete all GPT instances - https://news.ycombinator.com/item?id=38790255 - Dec 2023 (870 comments)
NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT - https://news.ycombinator.com/item?id=38784194 - Dec 2023 (84 comments)
The New York Times is suing OpenAI and Microsoft for copyright infringement - https://news.ycombinator.com/item?id=38781941 - Dec 2023 (861 comments)
The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work - https://news.ycombinator.com/item?id=38781863 - Dec 2023 (11 comments)
by kranke155 on 12/30/23, 1:48 PM
The generative AI rollout has taught me what happens when the interests of the many intersect with the destruction of the few.
You get steamrolled for defending yourself while you overhear above applause to those who have robbed you of your future.
by aimor on 12/30/23, 4:59 PM
I did an interesting thing and looked at how well the Llama2 models could compress text. For example, I took the first chapter of the first Harry Potter book and recorded the index of the 'correct' predicted token. The original text, compressed with 7zip (LZMA?) to about 14kB. The Llama2 encoded indexes compressed to less than 1kB. Then, of course, I can send that 1kB file around and decode the original text. (Unless the model behaves differently on different hardware, which it probably does)
What I get from this is that Llama2 70B contains 93% of Harry Potter Chapter 1 within it. It's not 100% (which would mean no need to share the encoded indices) but it's still pretty significant. I want to repeat this with the entire text of some books, the example I picked isn't representative because the text is available online on the official website.

by beginning_end on 12/30/23, 11:34 AM

This perspective on regulation was interesting: https://drafts.interfluidity.com/2023/12/28/how-to-regulate-...

    "Congress should declare that big-data AI models do not infringe copyright, but are inherently in the public domain.

    Congress should declare that use of AI tools will be an aggravating rather than mitigating factor in determinations of civil and criminal liability."

by wslh on 12/30/23, 5:57 PM
While different, I find this discussion about AI and copyrights as an evolution of the war that never was: Google/FB converting in the portal/proxy for content and while it is not generative AI you can find copyrighted images just using Google Images or as an snippet in the normal search engine. I mention Google because it is the de facto monopoly but this applies to a lot of aggregators.
I know we are talking about different technologies but it seems all these people were very silent and find some opportunity in having this war with OpenAI (not an endorsement) but not fighting others.
I am not making an statement about the morals of AI and aggregators/search engines (super interesting discussion that in a way was happening for long) but I am surprised that organizations are "just" waking up. It seems they just see it is a much simple and cheap fight.
by clbrmbr on 12/30/23, 12:58 PM
Am I the only one believing that copyright has long outlived its usefulness? After all, copyright is not some natural law or mathematical consequence, but rather a social convention that made sense in the era of the printing press.
by rmholt on 12/30/23, 1:00 PM
I feel like the outcome is obvious, there will be a finite list of IPs who's owners have enough money to actually sue, which will get filtered out of the output of publicly available models. They will just slap a detector model on the end of the generator to filter them out.
Private models will not care, nor will things change for IP owners with lesser power.
by CTmystery on 12/30/23, 11:41 AM
> My guess is that none of this can easily be fixed. Systems like DALL-E and ChatGPT are essentially black boxes. GenAI systems don’t give attribution to source materials because at least as constituted now, they can’t.
Is it necessary to fix in the model itself? It seems a gate in the post processing pipeline that checks for copyright infringement could work, provided they can create another model that identifies copyrighted work (solving the problems of AI with more AI :/)
by AlienRobot on 12/30/23, 12:40 PM
An argument I've seen made in pro of AI in past threads about this is that "scraping is legal."
Yeah, downloading the content of a webpage may be legal, but redistributing it isn't.
I wish people stopped trying to make these things seem more important than they really are just because IT people call them "technologies". Blockchain isn't a technology. HTML isn't a technology. React isn't a technology. And AI is now not a technology.
When I see ChatGPT or OpenAI, I don't think of "technology". I think of a program. Software. Because that's what it is. You don't say "none of the laws that exist in this world apply to this" every time you release new software.
I bet many people can't tell the difference between a quick answer from Google and a text generated by ChatGPT on Bing. They just see the output.
All that amazing capability of generative AI? That got old fast. It was groundbreaking for one instant. Now it's just an app that generates images. Just another piece of software. Nothing special about it.
Torrenting and other p2p file transfer protocols didn't get a pass for inventing groundbreaking ways to break the law. I don't think OpenAI will get a pass for doing the same.
by davidy123 on 12/30/23, 12:23 PM
The solution could be great. I really don't like the way culture always goes to the same tropes, calling any potential innovation "out of Star Trek" (with attendant distorted expectations), right down to expecting an interface based on literal hand-waving in Minority Report. If copyright held works ("USS Enterprise") could be removed, yet the actual essential concepts (space ship, naming things) retained, it would be a tremendous breakthrough.
I think what NYT &c want is for large companies like Apple to pay them for access to their works. This to me is the wrong path, just leading to more silos and walled gardens, special access for the elite.
An alternative is base models trained on Wikipedia and public domain (science journals, etc). Foundations could support high quality, well rounded current events reporting. Wikimedia provides a good model for this, with referenced summaries that I don't think can be said to reasonably violate copyright. The models would need to be improved to support references, or RAG attribution would have to be widely used when bringing in works that have a current copyright.
by dawnim on 12/30/23, 11:49 AM
This feels like another area where piracy will surely be superior in case things like this land on the disallowed side of regulation. The model trained on all data will outperform the model trained on a legal subset of data. Whether or not you use it to produce potentially infringing content is another point. Performance will likely improve from having references to copyrighted material and people capable of doing so, myself included, would probably prefer to interact with the non limited model. Perhaps time to update the laws or at least move liability from the creator of the model to the user. No one is going after pencil makers but I can draw a pretty good Mickey Mouse with access to one. Feels like me generating C3P0 and claiming ownership is my problem, not OpenAIs.
by pointlessone on 12/30/23, 12:21 PM
If any of those results would be deemed infringing we can bid farewell to all fanart ever. Likewise, to all fanfiction. Or any original work that was merely heavily inspired by previous works. Like a lot of modern fantasy is basically Tolkien fan fiction. Or is Gandalf close enough to Merlin to claim prior art that is in public domain?
by Aerroon on 12/30/23, 11:56 AM
Aren't some of the examples basically asking for that content?
Ask someone about two Italian brothers in a video game with a red and green hat that have M and L on them. What do you think you would get?
If I describe "imagine a comic book duck that swims in a sea of gold in his vault" you would immediately think of Scrooge McDuck, no?
by mensetmanusman on 12/30/23, 1:27 PM
The world is a big place.
China can't produce LLMs because of inconvenient truths.
The US can't produce LLMs because of copyright.
Decentralized open source LLMs might exist that could work, but they won't have the giant GPU clusters.
A rich country with lax rule of law wins? Maybe that's why Sam went to the Saudis?
by jpeter on 12/30/23, 11:44 AM
If I prompt "golden droid from classic sci-fi movie", what else am I asking for if not Star Wars?
by bambax on 12/30/23, 12:02 PM
This only mentions ChatGPT (and M$ by association) but how would this impact "open" models? Even if their makers are somehow prevented from updating them, the models themselves are already in the wild...?
by redcobra762 on 12/30/23, 1:56 PM
This operates similarly to importing an image into Photoshop. You can do whatever you like with images privately, or with gen AI, but the game ends when you try to use those images commercially.
Not sure how this “gets worse” or better for anyone. The current state of things seems generally fine, and there’s a real possibility the courts see it that way too.
by continuational on 12/30/23, 11:38 AM
(Asking Dall-E about the bot image in the article)
Me: Who owns the rights to this bot?
Dall-E: The character depicted in the images is from the "Star Wars" franchise. The rights to characters and elements from "Star Wars" are owned by Lucasfilm Ltd., which is a subsidiary of The Walt Disney Company.
Perhaps it is able to tell, if you ask it?
by ponorin on 12/30/23, 2:21 PM
this is exactly what i predicted: the current generative ai is basically rewarded based on how much it convinces people to be a real thing. it very much has the ability to copy verbatim unlike how most human memories work. without fundamental shift in the methodology of machine learning the fault can only be hidden, not solved. a cat and mouse game where one cat has to fight tens of thousands of mouse. it's also very telling how the discussion quickly turns into "maybe society needs to adapt" when so called technological innovation is involved. copyright problem should be solved for artists, not for datacentres. for now it's a handful of famous IPs, but what's stopping from generative ai to snatch some random indie artist's property and copying it ad infinitum?
by smrtinsert on 12/30/23, 5:42 PM
The NYTimes case is a clear one because they are delivering nearly the same content as an end product to users. The others seem like dead ends. The infringer would be the prompter, not the AI which operates more like a search engine. This is Napster all over again, what a phenomenal waste of time and money, where the artist will definitely come out with 0 at the end of it and a few corporations control everything - not to mention, there's nothing stopping anyone from releasing a tool that will crawl all spongebobs, generate your model for you and allow you to produce locally copyright infringing material it to your hearts content locally. You could drown yourself in local spongebobs.
by DigitallyFidget on 12/30/23, 4:41 PM
Per United States law, imagery/art/music/text/photography generated by non-human means (such as machinery, animals, or generative AI) cannot hold copyright. https://copyright.gov/comp3/chap300/ch300-copyrightable-auth... Section 306 on page 7.
I'm not sure how it'll hold up in law to claim copyright violations against something that wasn't created by a person. It'll really depend on the lawyers and judge's interpretation of written law. But I'm curious to see what comes of this.
by 1shooner on 12/30/23, 4:35 PM
Imagine a future where copyright registration involves contributing your IP to a public adversarial model, which is then a regulated layer in future generative model licensing.
by Hugsun on 12/30/23, 12:02 PM
There are good arguments for the copyright infringement belonging to the user, not the model maker, in this thread.
One issue with that is that there is not a reliable way to determine if copyright is being infringed.
Even if models could be used responsibly, there might not be a reasonable expectation that most people will. If infringement is so easy and avoiding it relatively hard.
I'm not sure what legal prescriptions should be made on this basis, but it's an interesting thought.
by golol on 12/30/23, 2:46 PM
How about this: Image generators should be treated like random google image search. They sample randomly from the distribution of publicly viewable images. Google does it exactly while Image generators do it in an interpolative way. Google images produced copyrighted works most of the time, an image generator only sometimes. Neither should be liable if someone sells a copyrighted work that was produced to someone else.
by shkkmo on 12/30/23, 7:43 PM
It seems like this article makes a basic copyright mistake. I don't see any evidence that these are " reproductions" of source material like since no source image is linked to compare.
Instead, these are derivative works. We already have a flourishing culter of derivitave works, such as fan art that exist in various shades of legal greyness.
Some derivative works are fair use, some are not.
The position of the Author here seems to be that generative AI should not be capable of creating any derivitave works, or should only be able to do so it it can accurately identify which are fair use and which aren't (which seems like an impossibly tall bar.) This stance seem like a giant attack on fair use that significantly expands the power of copyright.
To me, the takeaway from this is different. This makes clear that there is currently a risk when using AI generated art that you could end up unintentionally creating and publishing a derivative work unintentionally and thus without evaluating if that work constitues fair use.
by qgin on 12/30/23, 5:37 PM
Things are about to get a lot worse for generative AI in the United States
They are about to be infinitely better for generative AI in China.
by karmakaze on 12/30/23, 7:17 PM
It shouldn't matter how the images/etc are created. The problem comes about when it's used as an original work by the person that's doing so.
Imagine instead of AI/ML, we have a mechanical-turk-like service that produces output from descriptions. The service makes no claims that the generated outputs are not similar to any copyrighted works. The only claim the service makes is that they themselves claim no copyright on the output. It's then up to the user of the service to determine if the output is suitable for their intended use.
Whether such a service itself is legal is a separate matter. For that matter, say you outsourced the artwork to a person who again gave you infringing work. The user of that output is still in violation. With AI/ML we're basically outsourcing to a 'service' that is known to sometimes output copyrighted work so with the user knowing that, are responsible for fair usage.
by docdeek on 12/30/23, 12:11 PM
How is this different to Googling “robot cop” or “video game plumber” and being served copyrighted material?
Is it because Google will link to the image source? Or does the infringement begin when I use the image for gain, or claim it as my own? Perhaps it is because Google was allowed to crawl the page with the original image, so presenting them with a link is fine?
by legendofbrando on 12/30/23, 4:11 PM
Surely one answer is to train (or aggressively fine-tune) a new model that doesn’t (or refuses) to produce these outputs and then - as exists already, augment that model’s understanding of copyrighted material by having it Bing/Google search as a RAG process that requires the end user to log into accounts at the New York Times (and other accounts) with their paid sub. This broadly replicates the process a person could do today when they read the internet and summarize it while paying rights holders.
Expensive to do but hardly the end of Generative AI or OpenAI should that be the difference between having a business or being sued out of existence. Never underestimate people who have a clear economic interest especially when their own existence is at stake.
by sjducb on 12/31/23, 1:06 PM
I think it’s a question of what counts as publication.
I think that an AI model is analogous to an employee. Imagine I ask my employee to write an article, and they just copy an existing one from the times. That’s plagiarism and bad work, not copyright infringement.
If I then decide to publish the plagiarised article, then I have committed copyright infringement.
I once ran into this exact problem with a human. I hired a designer to make some artwork for an app. When I launched the app it turned out that the human had just copied the artwork from another game. It’s my problem that I hired an idiot, and my problem that my app was infringing the copyright of another app. (We redesigned the graphics very quickly)
by jlnthws on 12/30/23, 7:31 PM
We could get inspiration from the case of the record industry against Napster, or cabs VS Uber. Both parties are somehow abusing their position, but the world is moving on. Rent seeking is probably not an absolute source of wealth after all.
by null_point on 12/31/23, 2:56 PM
I suspect this may delay some short term progress by creating pressure on AI labs to train their models from data curated or synthesized in a way that is contentious of copyright law.
There is already troves of data that are fair game for training, but even "corrupted" data sets can probably be used if used intelligently. We've already seen examples of new models effectively being trained off of GPT-4. That approach with filters for copyrighted material might allow for data that is sufficiently "scrambled". Not to say building such a filter is definitely easy, but seems plausible.
by KETpXDDzR on 12/30/23, 11:40 PM
I'd expect "Open"AI et al to lobby heavily towards an "AI-generated content is excluded from copyright infringement". I think it's possible that they'll introduce a "generative AI" tax. Charge x cents per generated text/image and distribute the fund to all media companies.
In Germany you pay some amount extra on top of the sales price of anything that can store data (CX, DVD, USB sticks, HDDs, ...). This is then distributed to all companies that could be impacted by software piracy. I'm still not sure if that's legal considering the Geneva convention disallows collective punishment.
by airesearcher on 12/30/23, 12:06 PM
I think there is another way to solve this. Someone should train an LLM on copyrighted images. Then use that as a second pass on any image generated by the primary LLM to check if it might contain copyrighted images, and blur the copyrighted parts(or change them sufficiently).
Another change could be to the license agreement of LLMs - they could have the user assume liability for any material produced instead of the provider assuming liability. The user would agree that getting the rights for any copies and distribution of copyrighted materials is their sole responsibility instead of the provider.
by 8note on 12/31/23, 7:17 AM
"from classic sci-fi movie"
How could you put that as the prompt without intending to infringe? Anything pulled from a classic sci-fi movie would be infringement. The term droid is also star wars specific?
Id consider the "red soda" one as grounds that the Coca-Cola brand has become generic and that it's synonymous with soda. Same thing with Mario too. There is so much non-nintendo content made featuring Mario the plumber that you could get that without training directly on Nintendo's artwork
by wouldbecouldbe on 12/30/23, 12:12 PM
What about non-mit source code, 100% it's trained on those as well.
by asylteltine on 12/30/23, 2:21 PM
I certainly hope so. You can’t just steal content and call it “””AI”””
by josh-sematic on 12/30/23, 1:44 PM
Gary Marcus is growing his subscriber base using images of copyrighted IP (C3PO, Mario, etc.). Fair use? Then why is the tool he used to produce those materials not also fair use of the IP? My take is that either we say the models are like people (do we penalize people for learning from IP and letting that influence what they subsequently produce?) or we say they are like tools (do we penalize Adobe because Photoshop makes it easier to make a picture of Mario on the Death Star?).
by ur-whale on 12/30/23, 3:22 PM
It's not for generative AI that thing are about to get a lot worse.
It is in fact the very notion of Copyright is breathing its last breath, and it is fantastic to be alive to see it happen.
by dmbche on 12/30/23, 5:50 PM
Hey so the problem isn't the output of the LLMs but the input - the data they are trained on is stolen (big suprise, you can't claim fair use when using something commercially, like training your LLM).
The output is irrelevant.
Edit1: If you want to verify this, check out all the lawsuits against AI companies : it's always about using their copywritten goods. Any discussion about the output is to talk about the amount of damage done to the copyright holder, not if damage exists or not.
by roenxi on 12/30/23, 2:23 PM
Based on the rate of progress; I think this makes little difference to AI progress in the medium-long term.
At the moment, we don't have hardware that can do what humans do (process video feed from eyeballs and build up a world model). I imagine that we'll cross that barrier cheaply in the coming decades, at which point copyright becomes moot. AIs will be able to develop their own styles and world understanding from scratch, then generate original work.
by Paradigma11 on 12/30/23, 1:35 PM
So, whats the plan?
Content creators/artists compete globally. The only thing harsh regulations will do is create an unlevel playing field where artists from noncaring countries will have big advantages over artists from the west, which will be driven into illegality to compete.
In the end products will have to be classified anyway if they are infringing on copyright and/or were being built by an LLM. Most likely automated by another LLM.
by nojs on 12/30/23, 12:15 PM
In practice, what happens next when websites all start to block openai by default (or change their TOS to disallow OpenAI’s crawlers)?
It seems like there’s little incentive not to do this, because unlike Google OpenAI isn’t bringing any traffic or eyeballs. It may end up being a default setting in Wordpress for example.
But OpenAI presumably can’t afford to pay every single long tail source of content on the whole internet — so how does this end?
by zarzavat on 12/30/23, 11:52 AM
This for me does not make sense as a copyright violation. It’s like saying that Adobe is in trouble because you drew something infringing in Photoshop. If you prompt the model with the intention of creating something infringing by mentioning the name of the characters and the work, and you get something infringing out, then it’s you who have infringed the copyright, not the maker of the tool.
by digitcatphd on 12/30/23, 2:45 PM
Rather than attempting to combat our obvious future, they should spend this effort to find ways to monetize and succeed in this new environment.
by hahajk on 12/30/23, 2:56 PM
> And a whole universe of potential trademark infringements with this single two-word prompt: animated toys
If you flood the market and dominate children's culture with toys from your TV shows, you absolutely cannot complain when your toys are considered iconic enough to be the generic "animated toy". These images don't replace or substitute the things they are depicting.
by karmakaze on 12/30/23, 7:28 PM
The real 'problem' is how do we navigate the present and near future where much more than physical labor is being automated? This is where we need sustainable solutions. The rough road on the way should also be smoothed out so as not to disrupt so many lives, but it's good to keep a perspective what and why we're doing these things.
by SubiculumCode on 12/30/23, 11:52 AM
Attribution weights could be the basis of new type of copyright asset licensing scheme. For all those tech employees who fed the company's model, a license in perpetuity to at least a portion of that value...but only if you fight for it. They are training to replace you, watching your every move, your thought processes, ready to make you a function call.
by efields on 12/30/23, 1:39 PM
It’s more interesting to me how these entities that operate the models start making money from them. They are a money pit and there’s not enough $20/month subscribers on earth to support them.
Enterprises that make content with this also don’t want to infringe on copyright. The AI companies don’t have a good story here. The value has not become evident after years.
by tim333 on 12/30/23, 7:56 PM
They are just going to have to inform the AI in some sense of the current copyright situation and ask it not to infringe.
It's the same for human writers. If you are writing an article for Wikipedia say, you should read relevant source articles and then rewrite in a way that isn't a copy and paste beyond a few words.
by _giorgio_ on 12/30/23, 4:45 PM
This guy built a career around nonsensical and catastrophic endings.
Everything that he sees has mysterious flaws that never happen.
by intrasight on 12/30/23, 12:56 PM
Just make LLMs be like your average human and forget details. I know that it's easier to say than to do, but so are many things worth doing. I can't plagiarize - my language and visual memory doesn't work that way. Such an LLM will have to "create" and answer from more fuzzy memory.
by caeril on 12/30/23, 2:53 PM
Wow. I feel really sorry for these giant corporations who have wielded armies of lawyers against fanfic artists to prevent fair use, and to prevent trademarks and patents from expiring on the timelines enshrined by law.
Can we all have a moment of silence for poor Bob Iger? Maybe we can start a GoFundMe to help him out?
by rolisz on 12/30/23, 12:33 PM
Simple fix (at least for ChatGPT): ask it to avoid drawings with similarities to copyrighted characters.
by t_mann on 12/30/23, 12:05 PM
The article kind of amplified my regrets/anxiety for not getting a copy of books3 and the likes while it was easy. I didn't have an immediate use case, and I don't now, thought I'd wait until actually need it, but it feels like a window is closing here.
by logicchains on 12/30/23, 11:44 AM
I predict this could be a boon for generative AI because restricting it to training on copyright-expired media would produce a higher quality training corpus, as low-quality material from so long ago is unlikely to have been preserved, leaving only higher-quality material.
by vimax on 12/30/23, 11:45 AM
Maybe Disney and the record labels shouldn't be claiming so much of public culture as their own.
by Alifatisk on 12/30/23, 11:59 AM
Did ClosedAi (OpenAi) ever confirm or deny that they trained their models on copyrighted materials?
by goertzen on 12/30/23, 6:26 PM
No they are not.
This is a negotiation tactic by the NYT to drive up the licensing price. Period.
The Napster/Music Industry analogy has no resemblance to this situation.
The only meaningful question that might be answered as a result of this is, what permission and access rights do crawlers have to content that is publicly and legally available.
by quonn on 12/30/23, 12:18 PM
Maybe the way to go is to do pre-training on copyrighted data, then to thoroughly shake things up so that hopefully only some useful abstract structure of world knowledge remains and then train that on carefully selected licensed data.
by airstrike on 12/30/23, 5:44 PM
I have no patriotic skin in the game, being neither American, nor European, nor Chinese, but this copyright issue seems overblown to me and like the perfect way to hand the leadership in generative AI over to China
by ultrablack on 12/30/23, 3:33 PM
We are all trained on copyrighted input. That is not a problem. What is a problem is if you reproduce it and try to claim copyright for that. If someone wants to create their own image of Mario in an AI, so what?
by amai on 12/30/23, 6:45 PM
Should the NYT not sue https://commoncrawl.org/ ? OpenAI just used the data from commoncrawl for training.
by Avicebron on 12/30/23, 12:50 PM
I'm surprised this is presented as a revelation? I did pretty much this same experiment ages ago as part of a suite of tests comparing the efficacy of different sized models..
by renewiltord on 12/30/23, 11:52 AM
You can try, but I have Mistral on my local computer and it doesn't need the Internet. And people have pirate dumps they're going to run this stuff through.
I'll just do it myself.
by amelius on 12/30/23, 12:09 PM
Just like we have the uncanny valley for robots, LLMs are in the unoriginality valley. Only when we get out of it will the copyright issues go away.
by smitty1e on 12/30/23, 12:01 PM
The DALL-E/*GPT revolution sounds like the death of personal and corporate property.
That's gonna leave a Marx[1].
[1] https://youtu.be/7WDKivqFOgA?si=nWq5aeKA4dLytX3Z
by ofslidingfeet on 12/30/23, 11:36 PM
I'm still waiting for people to figure out the whole point of an automated process is that it behaves the same way each time.
by penjelly on 12/30/23, 11:46 AM
> My guess is that none of this can easily be fixed.
also my concern, except it feels like many of LLMs "problems" cant be easily fixed
by zanfr on 12/30/23, 4:41 PM
no matter how you look at it; the cat is out of the bag. OpenAI could be censored but you can't censor the opensource
by Log_out_ on 12/30/23, 4:00 PM
That sound, as if layers and layers of renteering aristocracy were forced to work again against their will.
by AC_8675309 on 12/30/23, 3:16 PM
So the models overfit the training data, essentially memorizing, instead of generalizing?
by wayeq on 12/30/23, 7:24 PM
We need to figure out how to ever so gradually move toward a post-copyright economy.
by SKILNER on 12/30/23, 6:29 PM
I don't understand the glee so many people have over this. I love being able to use Generative AI tools. How is it different than if I asked a person to draw these pictures for me? I know someone will gleefully clobber this question with a legal answer, but God, let's move forward, hunh?
by throwuwu on 12/30/23, 5:16 PM
Copyright is fucked. Even if Open AI somehow loses this and has to delete GPT4 and their training data, the generative AI cat is so far out of the bag that it’s gone on to live a full life and have many grandkittens. It’s already easy to install and run generative models and it’s just going to get easier and the models will keep getting better. These lawsuits are futile and won’t matter in 2 years or less.
by RecycledEle on 12/30/23, 7:34 PM
If we get rid of unconstitutional copyrights in the US, this ges away.
Recall that according to the US Constitution, copyright can only be on on "science and the useful arts."
Alternately, we could restore a reasonable limit to the duration of copyrights, like 14 years.
by pxoe on 12/30/23, 3:11 PM
there's an easy fix. the easiest. just don't use data that you don't have the rights to use. apparently that's just impossible.
"but what if we want to scrape the entire web and something makes it in anyway? see, that is impossible". well that's just saying "fuck it" and using bad data anyway. that's not an actual effort to "not use data you can't use" - there was just no way there'd be a 'rights cleared' way to use the entire web anyway. that is impossible. using a clean dataset is not impossible. it's very possible.
by RandomGerm4n on 12/30/23, 11:57 AM
Perhaps we should simply take this as an opportunity to finally abolish copyright. Smaller artists mainly earn their money with commissions. They are paid to do a very specific thing. Whether there is a copyright on the result is irrelevant. Someone else who would "steal" the image and use it without payment would apparently have fewer requirements. The person could have simply taken any AI image. Therefore, the artist in the scenario would not receive any money from the second person anyway.
Apart from this, it is mainly large companies that benefit from copyright laws. Why should we have laws that restrict progress just so large capitalist companies can maximize their profits?
by skybrian on 12/30/23, 1:26 PM
I wonder what Adobe Firefly does with these prompts?
by oglop on 12/30/23, 3:13 PM
So what? I feel like I’m taking crazy pills when I read these things. You all do realize the same thing happens in your mind with those same prompts right? That’s kinda how it works. Who is surprised by this? Yeah no shit it can kinda reproduce the text it was trained on, so do I! That’s how that works. And the NYT knew for a long ass time this thing was ingesting. Literally saw this in the marketing when I signed up last year.
I wasn’t shocked when I noticed I could query it about ANY math textbook I owned and it could talk with me about it. I did t bitch and gripe, I enjoyed it and have conversations.
Anyway, I’m in the minority I guess. I love that I can talk with it about books and news.
by freddealmeida on 12/30/23, 11:59 AM
not in japan.
by Joel_Mckay on 12/30/23, 12:24 PM
If ML cannot create copyrightable or patented material under current legal precedent, than shouldn't the prompt output be considered public domain regardless of content semblance?
The paradox should still violate Trademarks due to similarity, but likely cannot infringe on copyright content under prior legal opinion... if at least 80% different from prior art. The lawyers are likely going to have to do a special firm survey to figure this one out.
Bag of popcorn ready =)
by yieldcrv on 12/30/23, 5:21 PM
a lot worse for cloud providers hosting generative AI
the models can be fine
by gfodor on 12/30/23, 5:58 PM
Gary Marcus is the master of AI FUD
by octacat on 12/30/23, 5:14 PM
I am expecting politicians would do some nice mental gymnastics regarding regulating this. All major IT companies are doing genai now and nobody wanna hurt the companies.
by Intox on 12/30/23, 11:44 AM
Or... things are about to get worse for copyright holders.
I don't see any developped country pressing the brake on AGI in the near future to protect a few copyright holders from getting "stolen" in hypothetic scenarios.
by Baldbvrhunter on 12/30/23, 10:34 AM
I imagine the argument might be like this:
I hire a session musician to play on my new single, paying him $100. I record the whole session.
I ask him to play the opening to "Stairway to Heaven" and he does so.
"Well, I can't use that as a sample without paying"
"Ok play something like Jimmy Page"
"Hmm, still sounds like Stairway to Heaven"
"Ok, try and sound less like Stairway to Heaven but in that style"
"Great, I'll use that one"
and I release my song and get $5,000 in royalties.
Should I be sued for infringement, or the guitarist?
The problem, I suppose, is that if I had said "play something like 70s prog rock" and he played "Stairway to Heaven" and I didn't know what it was and said "great, I'll use that".
Should I be sued for infringement, or the guitarist?
by iainctduncan on 12/30/23, 5:27 PM
I am constantly suprised by the amount of apologizing for generative AI infringement here. The fact that it's already being done and is a technical breakthrough is irrelevant to existing copyright law. "We are big and innovative" may hold weight with legislators, but it won't with the courts.
Remember when everyone and their dog discovered sampling in the late 80's and they all thought they could get away with it because it didn't seem like infringement to the samplers? The courts had no qualms about slapping record labels for putting out records with unlicensed samples in them. Albums even got pulled off shelves while licenses were sorted out.
These companies are charging for a service that returns copyrighted content, full stop. You can't do that whether you are AI or someone drawing Mario and selling the pictures on iStock, or putting out records that sample someone else's work without permission. It took a while in the case of sampling, but it sure as hell happened.
by sjfjsjdjwvwvc on 12/30/23, 12:05 PM
Please ban all these AI companies, at this point I have enough OSS models, don’t really need any hosted service anymore.
IMO would be best if this stays a highly illegal technology that is only available to a few weirdo nerds /s
by jdjdjdkdksmdnd on 12/30/23, 12:09 PM
people are so naive. AI is a matter of national security now. its over. they exposed civilians to nuclear radiation for the nuclear bomb. and you think the state would let this get in the way of the AI arms race which they are anxiously anticipating? nope
by whodidntante on 12/30/23, 1:43 PM
Simple solution, when gpt-5 comes out, just rename it Claudine, and the NYT will drop their suit