from Hacker News

Self-hosting keeps your private data out of AI models

by tabbott on 5/24/24, 4:44 PM with 45 comments

by oidar on 5/24/24, 5:52 PM
Self-hosting is just responsible computing now. The big companies are just too big to care about small businesses, and will use your data in any way that they please - take it or leave it. And it's cheaper to boot. A synology NAS or a raspberry pi 3 could cover 90% of what most internet services offer the average consumer/small business right now.
by simonw on 5/24/24, 6:09 PM
Since this seems to be written partly in response to (and honestly, to take advantage of) the recent Slack AI training panic, I took a look to see how Slack have updated their materials in response to that panic.
These documents are new in the last few days:
https://slack.com/blog/news/how-we-built-slack-ai-to-be-secu...
https://slack.com/intl/en-gb/blog/news/how-slack-protects-yo...
I think these updates are really good - Slack's previous messaging around this (especially the way they unclearly conflated older machine learning models with new policies for generative AI) was confusing and it wasn't surprising it caused a widespread panic.
It's now very clear what Slack were trying to communicate: they have older ML models for features like channel recommendations which work how you would expect such models to work. They have a separate "Slack AI" addon uou can buy that adds RAG features powered by a foundation model that is never further trained on user data.
I expect nobody will care. Once someone has decided that a company might "train AI" on private data you've already lost that person's trust. It's not clear to me if any company has figured out how they can overcome one of these AI training panics at this point.
I wrote a bit about this back in December when it happened to Dropbox - there is an AI trust crisis at the moment: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/
by sneak on 5/24/24, 5:57 PM
Another offender: codegpt.io ToS grants them an irrevocable perpetual sublicenseable license to all code they see from you. It’s insane what rights companies claim to your data.
by itronitron on 5/24/24, 5:44 PM
>> “To develop AI/ML models, our systems analyze Customer Data (e.g. messages, content, and files) submitted to Slack.” — Slack’s privacy principles,
by jmclnx on 5/24/24, 5:44 PM
2FA pushed me out of github, but M/S Copilot in github created the road out.
Now seems people's chats and posts are being used to train AI. I wonder how long before Cell Phone Providers start using Text Messages to train AI (or sell to AI people).
by matchagaucho on 5/24/24, 5:54 PM
To conflate Slack's T&C faux pas with "self host your own LLM" seems like a stretch.
IT Sec and Compliance must read the T&Cs and make better vendor selection choices.
by udev4096 on 5/24/24, 5:59 PM
This will continue to occur and may come as a "shock" to some companies that will ignorantly persist in using proprietary services unless a significant change in data collection from the service itself occurs, which should not be the primary motivation to switch to a self-hosted version in the first place
by airpoint on 5/24/24, 8:04 PM
Upgrading the UI from year 2000 and improving the UX would incite me to consider Zulip as a potential alternative.
Bashing a competitor in a blog post does not.
by WhackyIdeas on 5/24/24, 6:10 PM
Considering Microsoft are bringing in a ‘feature’ to record your desktop, I wouldn’t be surprised that an additional ‘feature update’ further down the line will simply take all those chats with your self-hosted AI models to train AI models.
So in my opinion, it just doesn’t matter if you are using self-hosted AI, the weakest link in your chain for keeping your data private is the very OS’s that you’ll be interacting with said self-hosted AI.
And with all the manufactured fear mongering going on around AI, that data will -already- be deliciously irresistible for prism-participating, lovable, trustable companies like Microsoft.
Sorry to burst some pretty bubbles for the lovely naive people.
by leobg on 5/24/24, 5:48 PM
Looks like content marketing to me. Both in purpose and motivation.
by codegeek on 5/24/24, 6:05 PM
"We don’t train LLMs on Zulip Cloud customer data, and we have no plans to do so. Should we decide that training our own LLMs is necessary for Zulip to succeed, we promise to do so in a responsible manner"
:). What a clever way to say that even though we don't do it today, we cannot guarantee that we will never do it on our cloud service. At least they are honest I guess.