from Hacker News

New York Times considers legal action against OpenAI as copyright tensions swirl

by 8ig8 on 8/16/23, 10:40 PM with 369 comments

  • by pimterry on 8/17/23, 8:44 AM

    Honestly, I think generative AI losing a massive copyright showdown is inevitable at this stage.

    It's extremely easy to get the latest generation of AIs to produce outputs that in many fields sans-AI would be trivially considered as IP infringement.

    While there are many interesting reasonable legal & technical arguments that it's not, the result completely undermines copyright protections regardless. If that's accepted at scale, copyright in practice will change completely. In effect, the choices are "block this, or entirely destroy copyright protections in many industries". You can't allow this without eventually allowing everybody to simulate their own NY Times reporters, produce their own Marvel movies, and create their own Taylor Swift albums.

    If you do allow that, the many many affected industries have catastrophic problems.

    Problematic though copyright laws are, I see no world where all those protections go away any time soon, and so if the courts don't agree to protect copyright already in this scenario, then it will eventually be legislated to make that happen. AI consuming copyrighted data and producing an output has to be considered a derivative work (or indeed, the model itself will be considered a derivative work) or IP protections are effectively broken.

    There's a grace period now while we work our way there, but the politics is pretty clear and with no plausible path to "let's drop copyright completely" ASAP, I just don't see any other result in the medium term. Doesn't mean the end of generative AI by any means, just a slowdown as we move to a world where you need to negotiate rights and buy data to feed it first, instead of scraping everybody else's for free.

  • by ojosilva on 8/17/23, 9:29 AM

    IANAL, but copyright protections are pretty much tied to content and format and not to the idea itself, with the intent of preventing (or putting a price on) the copying of original works. The Times will have a very hard time proving that their content is being re-marketed by OpenAI. Having a competing product based on your ideas.

    Compare:

    "Steve Jobs [was] a tyrant": https://www.nytimes.com/2011/10/07/technology/steve-jobs-def...

    Against:

    "Whether to describe SJ as a tyrant is a matter of perspective...": https://chat.openai.com/share/28633f0c-007f-48b6-a615-1581c3...

    The general way LLMs work do not preserve content in it's original form: the ideas they contain are extracted and clustered statistically - as a ELI5 refresher, an LLM reads 2 million NY Times articles and records that after the word "Steve" there are a lot of "Jobs" followed by a lot of "was a genius/tyrant", "founded Apple", etc. Then LLMs recreate the user question "Who was Steve Jobs?" using this complex net of token/word stats. Is that fair use? I think OpenAI lawyers will not even tap the fair use question, they will simply state that no copy happened, just a statistical collection of words from various sources.

    And importantly: no LLM source is really prevalent, so the end result cannot be even be traced back to the source, especially if multiple, similar news sources are being fed to training. I have no idea how the Times is going to prove that its _theirs_ news.

  • by rurp on 8/16/23, 11:59 PM

    I don't think it's an exaggeration to say that LLMs might lead to the end of the open web, or at least a drastically reduced version of it. So much of these model's utility is in directly competing with the producers of the training data. Content creators and aggregators are seeing more and more reason to restrict and limit access, to avoid having AI companies consume all of their data and then be the ones making money from it going forward.

    I fear that LLMs are going to cause the internet to be a much worse and less open space.

  • by akolbe on 8/17/23, 7:11 AM

    There is a very real risk that we end up with an inferior product cannibalizing a superior one and driving it out of business.

    Moreover, AI would seem to be even more susceptible to capture and manipulation than conventional media.

    When it's a question of guiding thought I prefer the humanities to tech. (Same with art.)

  • by bubblethink on 8/16/23, 11:55 PM

    >If, when someone searches online, they are served a paragraph-long answer from an AI tool that refashions reporting from The Times, the need to visit the publisher's website is greatly diminished, said one person involved in the talks.

    If, when someone reads a newspaper, they are served a paragraph-long answer from an NYTimes reporter that refashions reporting from local sources, the need to interact with the local sources is greatly diminished.

  • by machdiamonds on 8/17/23, 8:40 AM

    Don't humans operate similarly? We gain knowledge through experiences. These AI models effectively condense a vast amount of experience data into weights. Considering the global race in AI advancements, I'm skeptical about the success of these copyright claims. I do find it hypocritical that OpenAI says that other LLMs can't be trained on data generated by their LLMs.
  • by PeterisP on 8/17/23, 9:02 AM

    I think that the proper outcome for all of this would be acknowledgement that the current copyright laws very poorly regulate this aspect, that the key parts of any such legal action are at the not-really-described edges of law because these edges weren't relevant until now; and so instead of waiting for courts ruling on how law-as-written-now applies and accepting these rulings, we will likely get some new legislation explicitly setting what the legal norms should be.

    In the short term, of course, the existing law matters, but the main discussion should be not on how to apply existing law but how to ensure that the new laws match what we-the-people would want.

  • by baby-yoda on 8/17/23, 8:26 AM

    These mega LLMs that can autonomously roam the web and consume original content are basically the "I made this" meme[0] and having some legal precedent would be good for all users of the web.

    [0] - https://knowyourmeme.com/memes/i-made-this

  • by jiscariot on 8/17/23, 12:03 AM

    My concern isn't copyright law, but that if trained on the NYT, these llms are going to be favorable to starting conflicts in the middle-east.

    https://fair.org/home/20-years-later-nyt-still-cant-face-its...

  • by oefrha on 8/17/23, 10:15 AM

    Hopefully soon enough (within a decade?) we’ll all be able to run large language models on cheap consumer devices, and model weights containing everything including NYT will be floating around in the form of warez readily consumed by anyone with a modicum of savvy, whether NYT likes them or not. They can’t stop progress.
  • by brrrrrm on 8/16/23, 11:01 PM

    What’s going to be the name used for the laws that attempt to tackle machine paraphrasing?
  • by toss1 on 8/17/23, 2:26 PM

    >>A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff.

    This seems to me to be completely standard in the newspaper industry. Many times every week, I see stories in the form "The [Major_News_Outlet] reports that [Event_X occurred] or [their investigation revealed Y] and here are the details [...].

    Copyright protects the expression of an idea, not the idea itself. If you write a history of Issac Newton or the invention of semiconductors, I cannot copy that wholesale and sell it as mine, but nothing prevents me writing my own version, even using the same facts and citing your work.

    I'm quite sure that I could provide a service where a bunch of workers read NYT articles and write brief summaries. I'm not sure they would even need citations, as long as we don't copy chunks wholesale.

    If OpenAI is simply parroting the words of the NYT articles without Fair Use constraints (short blurbs), it seems they have a problem. If they are fully re-writing them into short non-copying summaries, it seems the NYT has a problem.

    It'll be interesting to see how the courts sort this out.

  • by t_luke on 8/17/23, 10:44 AM

    The precedent people should be paying much more attention to is sampling in music. When it first arose, it really wasn’t clear what status it had. There was at least a decade when people basically thought it was legal to use small samples of other recordings because they were small and the new use turned them into something unrecognisably different. Which was kind of logical, actually, but turned out not to be true!

    The current legal requirement to get clearance for all samples only arose after a bunch of court cases in the late 80s/ early 90s, mostly involving quite obscure musicians.

    There are a lot of people on here who assume that ‘logic will prevail’ in the courts on questions like use of copyrighted data in training data. History shows that this really isn’t a safe assumption. The courts have historically been extremely favorable to copyright holders. It would be foolish to underestimate the legal risk to openai et al here

  • by 8ig8 on 8/16/23, 10:59 PM

    Interesting point…

    > A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff.

  • by CatWChainsaw on 8/22/23, 11:27 AM

    "I'll keep saying it every time this comes up.

    I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted writing verbatim."

    Personally I think they argue that way because they get off on being contrarian out of spite, but to me it's just a signal of maliciousness and stupidity all at once.

  • by FrustratedMonky on 8/16/23, 11:09 PM

    If a human reads something, it goes into their brain, and it becomes an influence on future works they produce.

    This doesn't mean that 'copywrite' extends into my brain. A company can't copywrite what I'm thinking about. And what if I do try to paraphrase something from memory, from a few sources, and happen to spit out a very similar sentence from memory. Am I breaking the law?

    To go further. Since all knowledge is pretty much fed into a human from hundreds of books, movies, TV, internet, all pumped into a human from birth. Then everything in the brain is a product of something with a copywrite. So anything produced is some amalgamation of copywrites.

    Why not use similar argument for AI. It is clear when asking it to do something like "write a screen play for Othello using dialog like Tarantino, but with bit of style like Baz Luhrmann". That what it produces is 'as unique as a human' would be, or just as filled with things that have copywrites.

  • by voytec on 8/17/23, 8:16 AM

    > if a federal judge finds that OpenAI illegally copied The Times' articles to train its AI model, the court could order the company to destroy ChatGPT's dataset, forcing the company to recreate it using only work that it is authorized to use.

    I'd like to see it happening but it sounds unrealistic.

  • by whywhywhywhy on 8/17/23, 9:33 AM

    If writing a few paragraphs around something someone else said is copyrightable to you then isn’t GPT writing a few paragraphs around your work copyrightable to OpenAI too…
  • by mediumsmart on 8/17/23, 3:29 PM

    I think anybody should have the right to protect the word combinations they own by not publishing them on the internet.
  • by pierrefermat1 on 8/17/23, 8:54 AM

    https://www.youtube.com/watch?v=MFKV48ikV5E

    Relevant to the article: Large Language Models Meet Copyright Law at Simons

  • by mensetmanusman on 8/17/23, 11:53 AM

    “In the end lawyers saved humanity from an all powerful AI.”
  • by bubblethink on 8/17/23, 12:02 AM

    I think all OpenAI needs to do is scan physical newspapers and OCR them. No ToS to agree to, and no ToS on print editions.
  • by zb3 on 8/17/23, 2:21 PM

    It's time to abolish copyright.
  • by exabrial on 8/18/23, 12:57 AM

    Good. Literally anyone’s copyrighted comments on the internet should get a settlement
  • by olgeni on 8/17/23, 9:47 AM

    They own copyright on hallucinating weapons of mass destruction? :D
  • by robbywashere_ on 8/17/23, 2:53 PM

    incoming backroom payment deals with publishers. "OpenAI now features training data from our partners X, Y, and Z"
  • by villgax on 8/17/23, 8:53 AM

    Sue these hypocritical fair-use citers who prevent people from training on their own outputs. Force them to reveal their entire training set for generating oblong statments
  • by honeybadger1 on 8/17/23, 11:51 AM

    A skirmish to not use our collective acquired knowledge and hide it behind selfish...capitalist gain.
  • by vldchk on 8/17/23, 12:58 PM

    While (in general) I agree with arguments against “copywriting hell”, in particular this case it is not about copywriting itself, but about the consequences of GenAI to entire industry.

    Journalists exist not without a reason, yes they work with facts and very often — open facts, but they still assemble those facts in certain way to construct a narrative, connect dots and tell us some story (not counting cases when journalist works with their sources and produce a unique inside information). Then OpenAI comes, says “thank you very much” and assemble all of journalists work into one Uber Knowledgeable Journalist who can answer all of your questions.

    So far so good, we create a public good service, and copywriters are in shambles.

    Until you start making money on it.

    That’s where the problem.

    If OpenAI would be a non profit organization like Wiki Foundation, who just wants to make internet as better place — not much arguments you can find to support NYT lawsuit. But monetization changes everything.

    Basically NYT is not worried about re using its text as itself, it is worried that no one will want to visit NYT no more and will pay Microsoft/Google and get all answers from them.

    Let’s put an example. There were a famous story when FT journalist discover a massive fraud in Wirecard accounting and essentially lead to a death of this organization. That articles were a result of multi-year reporting work when journalist piece by piece and step by step collect facts, meet people, and eventually spot the gap. Now, in age of Bard/Bing/ChatGPT, you don’t need to read original article to know all of this. You can ask search engine or Chatbot and get essential re phrasing of an original reporter work. You don’t need no more to go to FT, pay them for paywall, watch their ads, etc. Effectively FT make a huge investment into their people to allow them spend 2 years on this issue and report it and now have a 0 leads to their website because all of them are eaten by Google and Microsoft who will sell you their ads and retain you in their monetized products.

    Imagine that you built a for-profit paid library for some task. You make a code available through paywall and ask people to pay you to get to it and solve their problems. Then Microsoft comes, sneak beyond paywall, scrap your code and publish it recompiled and slightly optimized version in open access, so no one longer ever need to go on your website but ask Microsoft to show them your code.

    Would you be happy?

    All of this cases for me make this case not such easy and straightforward as it seems to be “bad copywriters against progress of humanity”.

    At the end of the day, if NYT/FT/New Yorker and others will stop publishing their work and fire all journalists, will ChatGPT tell us same depth level stories as we read there?

  • by aero-glide2 on 8/17/23, 8:33 AM

    Copyrights and patents are holding back humanity.
  • by gmerc on 8/17/23, 8:41 AM

    Social media, especially Facebook and Google News devalued news by commoditizing it.

    News is trying to avoid the next generation of tech doing that to the long tail of data.