from Hacker News

Launch HN: Flower (YC W23) – Train AI models on distributed or sensitive data

by niclane7 on 3/22/23, 1:38 PM with 69 comments

Hey HN - we're Daniel, Taner, and Nic, and we're building Flower (https://flower.dev/), an open-source framework for training AI on distributed data. We move the model to the data instead of moving the data to the model. This enables regulatory compliance (e.g. HIPAA) and ML use cases that are otherwise impossible. Our GitHub is at https://github.com/adap/flower, and we have a tutorial here: https://flower.dev/docs/tutorial/Flower-0-What-is-FL.html.

Flower lets you train ML models on data that is distributed across many user devices or “silos” (separate data sources) without having to move the data. This approach is called federated learning.

A silo can be anything from a single user device to the data of an entire organization. For example, your smartphone keyboard suggestions and auto-corrections can be driven by a personalized ML model learned from your own private keyboard data, as well as data from other smartphone users, without the data being transferred from anyone’s device.

Most of the famous AI breakthroughs—from ChatGPT and Google Translate to DALL·E and Stable Diffusion—were trained with public data from the web. When the data is all public, you can collect it in a central place for training. This “move the data to the computation” approach fails when the data is sensitive or distributed across organizational silos and user devices.

Many important use cases are affected by this limitation:

* Generative AI: Many scenarios require sensitive data that users or organizations are reluctant to upload to the cloud. For example, users might want to put themselves and friends into AI-generated images, but they don't want to upload and share all their photos.

* Healthcare: We could potentially train cancer detection models better than any doctor, but no single organization has enough data.

* Finance: Preventing financial fraud is hard because individual banks are subject to data regulations, and in isolation, they don't have enough fraud cases to train good models.

* Automotive: Autonomous driving would be awesome, but individual car makers struggle to gather the data to cover the long tail of possible edge cases.

* Personal computing: Users don't want certain kinds of data to be stored in the cloud, hence the recent success of privacy-enhancing alternatives like the Signal messenger or the Brave browser. Federated methods open the door to using sensitive data from personal devices while maintaining user privacy.

* Foundation models: These get better with more data, and more diverse data, to train them on. But again, most data is sensitive and thus can't be incorporated, even though these models continue to grow bigger and need more information.

Each of us has worked on ML projects in various settings, (e.g., corporate environments, open-source projects, research labs). We’ve worked on AI use cases for companies like Samsung, Microsoft, Porsche, and Mercedes-Benz. One of our biggest challenges was getting the data to train AI while being compliant with regulations or company policies. Sometimes this was due to legal or organizational restrictions; other times, it was difficulties in physically moving large quantities of data or natural concerns over user privacy. We realized issues of this kind were making it too difficult for many ML projects to get off the ground, especially in domains like healthcare and finance.

Federated learning offers an alternative — it doesn't require moving data in order to train models on it, and so has the potential to overcome many barriers for ML projects.

In early 2020, we began developing the open-source Flower framework to simplify federated learning and make it user-friendly. Last year, we experienced a surge in Flower's adoption among industry users, which led us to apply to YC. In the past, we funded our work through consulting projects, but looking ahead, we’re going to offer a managed version for enterprises and charge per deployment or federation. At the same time, we’ll continue to run Flower as an open-source project that everyone can continue to use and contribute to.

Federated learning can train AI models on distributed and sensitive data by moving the training to the data. The learning process collects whatever it can, and the data stays where it is. Because the data never moves, we can train AI on sensitive data spread across organizational silos or user devices to improve models with data that could never be leveraged until now.

Here’s how it works: (0) Initialize the global model parameters on the server; (1) Send the model parameters to a number of organizations/devices (client nodes); (2) Train model locally on the data of each organization/device (client node); (3) Return the updated model parameters back to the server; (4) On the server, aggregate the model updates (e.g., by averaging them) into a new global model; (5): Repeat steps 1 to 4 until the model converges.

This, of course, is more challenging than centralized learning: we must move AI models to data silos or user devices, train locally, send updated models back, aggregate them, and repeat. Flower provides the open-source infrastructure to easily do this, as well as supporting other privacy-enhancing technologies (PETs). It is compatible with PyTorch, TensorFlow, JAX, Hugging Face, Fastai, Weights & Biases and all the other tools used in ML projects regularly. The only dependency on the server side is NumPy, but even that can be dropped if necessary. Flower uses gRPC under the hood, so a basic client can easily be auto-generated, even for most languages that are not supported today.

Flower is open-source (Apache 2.0 license) and can be run in all kinds of environments: on a personal workstation for development and simulation, on Google Colab, on a compute cluster for large-scale simulations or on a cluster of Raspberry Pi’s (or similar devices) to build research systems, or deployed on public cloud instances (AWS, Azure, GCP, others) or private on-prem hardware. We are happy to help users when deploying Flower systems and will soon make this even easier through our managed cloud service.

You can find PyTorch example code here: https://flower.dev#examples, and more at https://github.com/adap/flower/tree/main/examples.

We believe that AI technology must evolve to be more collaborative, open and distributed than it is today (https://flower.dev/blog/2023-03-08-flower-labs/). We’re eager to hear your feedback, experiences regarding difficulties in training, data access, data regulation, privacy and anything else related to federated (or related) learning methods!

by guites on 3/22/23, 2:57 PM
Hey! Glad to see flower getting attention on hn.
I've been working on a project for over a year that uses flower to train cv models on medical data.
One aspect that we see being brought up again and again is how we can prove to our clients that no unnecessary data is being shared over the network.
Do you have any tips on solving that particular problem? I.e. proving that no data apart from model weights are being transferred to the centralized server?
Thanks a lot for the project.
edit: Just to clarify I am aware of differential privacy, I'm talking more on a "how to convince a medical institution that we are not sending its images over the network" level.
by JohnFen on 3/22/23, 10:24 PM
Isn't this still moving your data to a central repository? It's encoded in a neural net rather than in a more accessible form, but it's still being moved out of your control.
by cs02rm0 on 3/23/23, 9:17 AM
In the past, we funded our work through consulting projects, but looking ahead, we’re going to offer a managed version for enterprises and charge per deployment or federation.
Interesting.
Flower seems to fit well for people who are sensitive about their data and don't want to hand it over to a third party, but this seems to move towards a model where they have to hand that sensitive data over to a third party.
Perhaps that still works for the bulk of users, especially commercial rather than government. It's difficult to pursue both a managed solution and simultaneously maintain an open source offering without one departing from the other.
by dontreact on 3/22/23, 1:58 PM
There is so much hype around federated learning but often the hard and insurmountable part of this is federated labeling.
For example for your cancer use case, you have to convince multiple hospitals to feed the system labels and this is a very very tall ask.
For healthcare it’s also not clear how to get a regulatory clearance if you can’t actually test the performance of the federated deployments.
So while federated learning solves some problems generated by an unwillingness to share data, it doesn’t solve all of them. Describe the use cases of your product carefully.
by yawnxyz on 3/22/23, 7:07 PM
Hi! As someone new to all of this — how would I interact with the trained data after it's been trained?
Is it possible to create a conversation or QA style interaction with it? I see there's examples of "pytorch" but as a someone new— I'm not sure what that means in terms of public use cases.
I guess I'm asking is "ok I use Flower to train on a bunch of stuff... then what do I do with that?"
Thanks!
by jaggirs on 3/22/23, 1:58 PM
It has been shown that the input data can be reverse-engineered from the model weights. How do you deal with this issue?
by brookst on 3/22/23, 2:06 PM
Very interesting project. Your write up here does a much better job of explaining the market need and value prop than the GitHub readme.md… consider bringing some of this text over as the “why / what” story?
by photochemsyn on 3/22/23, 2:53 PM
This looks very interesting. I'd like to see a model trained on the complete body of scientific research literature from the past 100 years or so, I wonder if this approach could facilitate that?
by juanma91p on 3/22/23, 5:17 PM
Great to see Flower here! We use the framework for our projects because of its modularity, scalability, and ease to use. Another important aspect of FL, on top of the already mentioned privacy preservation, is network resource utilisation. By transferring only the weights of the model, less bandwidth is required, which can reduce network congestion. This is especially important given that it is expected that by 2030, more than 50 billion devices will be connected and transferring data.
by elijahbenizzy on 3/22/23, 4:27 PM
Congratulations! Really excited for you!
I love how you found a niche, valuable problem, built a framework, and are seeing a lot of success. A question (and I'm far from an expert so let me know if the assumptions are wrong):
It seems to me that the federated users have to be coordinated around timing for this to work. Otherwise this could take weeks/lots of slack messages for a single model to train. E.G. one team is having infra issues and doesn't get a job started, the other team is ready but then their lead goes on vacation, etc... In the internal-to-an-organization case this is probably fine (E.G. a hospital where the data has to be separated by patient/cohort), but if there are different teams managing the data then (a) have you seen this problem and (b) do you have tooling to fix it?
by techwizrd on 3/22/23, 2:53 PM
I've been working with Flower to implement and study Federated Learning for a few years, and have just started contributing back on Slack and Github. Congrats on launching on HN!
by northlondoner on 3/22/23, 10:43 PM
Many congratulations! Glad to hear about UK & EU collaborative innovation in open-source projects. Keep up the fantastic work!
Others asked similar question regarding comparable projects. What's your take on OpenFL from Intel? Do you think Flower moves into more commercial-MLOps direction? Looks like OpenFL particularly focused on to academic imaging community.
by jleguina on 3/27/23, 5:21 PM
This is a great project!
Have you though about what happens at inference? Suppose I train in a federated healthcare environment using PII features from patient records. Once I get the weights back how can I ever deploy it if I don't have access to the same features? The models would become highly coupled to the training environments no?
Best of luck!
by spangry on 3/22/23, 9:35 PM
Another interesting use case - government training models on legislatively protected data (e.g. tax data). Lots of data the government holds is governed by confidentiality restrictions built into legislation, limiting its utility. Sounds like federated learning could be a way around that.
by rjtc on 3/22/23, 8:01 PM
How is your approach different than tf federated or any of the other federated libraries out there?
by blintz on 3/22/23, 9:39 PM
This is really cool. Federated learning seems like it could unlock a lot of value in healthcare settings.
Have you had any luck convincing hospitals / insurers / etc that this satisfies HIPAA and is safe? How do you convince them?
by 7e on 3/23/23, 4:30 AM
Why not train 10000x faster using H100 secure enclaves with remote attestation? The FL window is closing, because it is a PITA to use and replacements are superior.