from Hacker News

Ask HN: If AI models use copyrighted data, should they be open source?

by anoncow on 3/20/25, 1:50 PM with 2 comments

  • by BobbyTables2 on 3/20/25, 1:58 PM

    Gee, wish I could abuse copyrighted data for profit…

    What makes AI so special?

    Radio stations are important. Few would say they shouldn't exist. Yet, we don’t give them a hallpass to violate copyright laws. They make agreements and pay royalties like the rest of us.

    Ironically, from a copyright perspective, there might be an even greater argument that AI models should be forced to license copyrighted data and allowed to stay closed source as part of their value proposition.

    Of course, in practice, they would just pirate the data and keep it closed source.

    What we really need is the AI model equivalent of “reproducible builds” in software. Or how else would one prove that the model was trained on a particular data set?

  • by anoncow on 3/20/25, 1:54 PM

    Sam Altman has argued that AI development in the U.S. will suffer if models are restricted from using copyrighted data, as such data is crucial for training high-quality AI systems. This raises an important question: If AI models are allowed to train on copyrighted material, should they also be open source? The argument for open-source AI is that if companies benefit from copyrighted content, the resulting models should be accessible to the public rather than being locked behind proprietary systems. I agree with Sam Altman and believe that access to copyrighted data is essential for AI progress. However, I am looking forward to an ethical solution that also helps creators of the copyrighted material, ensuring they are fairly recognized and compensated.