by mgreg on 1/17/24, 5:49 PM with 110 comments
by ssgodderidge on 1/17/24, 5:59 PM
This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.
by elashri on 1/17/24, 7:00 PM
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
[1] https://opendata.cern.ch/docs/terms-of-use
[2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...
by albert180 on 1/17/24, 7:04 PM
by anticorporate on 1/17/24, 6:18 PM
https://deepdive.opensource.org/
I encourage you to go check out what's already being done here. I promise it's way more nuanced than anything than is going to fit on a tweet.
by mgreg on 1/17/24, 6:09 PM
For an AI model that means the model itself, the dataset, and the training recipe (e.g. process, hyperparameters) often also released as source code. With that (and a lot of compute) you can train the model to get the weights.
by darrenBaldwin03 on 1/17/24, 8:28 PM
by tqi on 1/17/24, 8:32 PM
"it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data.
"impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models?
"you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people.
"A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack.
by andy99 on 1/17/24, 8:46 PM
https://www.marble.onl/posts/considerations_for_copyrighting...
by tbrownaw on 1/18/24, 2:32 AM
-- gplv3
These AI/ML models are interesting in that the weights are derived from something else (training set), but if you're modifying them you don't need that. Lots of "how to do fine-tuning" tutorials floating around, and they don't need access to the original training set.
by cpeterso on 1/17/24, 9:27 PM
Is training nondeterministic? I know LLM outputs are purposely nondeterministic.
by beardyw on 1/17/24, 7:17 PM
by declaredapple on 1/17/24, 8:17 PM
Many don't offer any information, some do offer information but provide no new techniques and just threw a bunch of compute and some data to make a sub-par model that shows up on a specific leaderboard.
Everyone is trying to save a card up their sleeve so they can sell it. And showing up on scoreboards is a great advertisement.
by pabs3 on 1/18/24, 12:10 PM
by nathanasmith on 1/18/24, 1:27 AM
by ramesh31 on 1/17/24, 6:15 PM
by belval on 1/17/24, 6:13 PM
Open-source means open source, it does not make reproducibility guarantees. You get the code and you can use the code. Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Getting training code for GPT-4 under MIT would be mostly useless, but it would still be open source.
by emadm on 1/17/24, 10:32 PM
by Der_Einzige on 1/17/24, 6:37 PM
by edoardo-schnell on 1/17/24, 10:07 PM
by fragmede on 1/18/24, 6:43 AM
by robblbobbl on 1/17/24, 9:25 PM
by RcouF1uZ4gsC on 1/17/24, 6:26 PM
1. Can I download it?
2. Can I run it on my hardware?
3. Can I modify it?
4. Can I share my modifications with others?
If those questions are in the affirmative, then I think most people consider it open enough, and it is a huge step for freedom compared to the models such as OpenAI.