by urlwolf on 11/18/24, 8:34 AM with 18 comments
So far what we have learned is that robots.txt doesn't work; major sites are using login-only access with 2FA to have any hope to keep their content away from LLMs. I imagine the licenses would be one thing, but actually implementing/enforcing them might be a whole other can of worms!
by kouteiheika on 11/18/24, 11:55 AM
You best bet to fight back is to either try to poison your data, or to train your own models on their data.
by Ukv on 11/18/24, 11:52 AM
If machine learning is not found to be fair use, and your concern is the removal of attribution, then MIT license should be fine.
> So far what we have learned is that robots.txt doesn't work;
The companies training models I'm aware of[0][1][2] all respect robots.txt for their crawling. Can't necessarily guarantee that all of them do - but the fact that smaller players are likely to use CommonCrawl (which also follows robots.txt[3]) means it should catch the vast majority of cases and I'd recommend it if you don't want your work trained on.
> major sites are using login-only access with 2FA to have any hope to keep their content away from LLMs
I suspect it's more that users with accounts are more valuable than lurkers, and framing forced sign-up as protecting user data from LLMs is a convenient excuse.
[0]: https://platform.openai.com/docs/bots
[1]: https://support.anthropic.com/en/articles/8896518-does-anthr...
[2]: https://blog.google/technology/ai/an-update-on-web-publisher...
by krapp on 11/18/24, 11:09 AM
hehehheh's comment is your best option - poison your content when possible. It's still going to be consumed but at least you can make the LLMs choke on it. Second best option is to never post content to the free internet, but even that's just a temporary measure - all accessible data (including private data) will be assimilated eventually.. But expecting a license to work in a post LLM world is just naive.
by hehehheh on 11/18/24, 10:51 AM
by DamonHD on 11/18/24, 8:47 AM
by ranger_danger on 11/18/24, 7:13 PM
by brudgers on 11/18/24, 11:47 PM
Are you assuming out lawyering Google, OpenAI, etc. is only a can of worms?
A license is only as good as your legal wherewithal to enforce it. Good luck.