from Hacker News

Ask HN: Crowdsourced Open LLM training data?

by Hedepig on 4/17/23, 8:59 AM with 0 comments

Please excuse me if I get any details incorrect, I am not a LLM or AI/ML expert.

To make large language models useful we fine tune them. The Alpaca project used an OpenAI model to create the fine tunes.

This appears to be fairly effective, costing a small amount as compared with hiring individuals to generate the data.

The shortcomings of this is the fact OpenAI have prohibitive clauses in their ToS with regards to models trained with outputs from their models.

A potential solution would be to have an (_actually_) open initiative where we anyone can submit training data.

There are challenges such as moderation and QA.

But in general is it worth attempting? Am I mistaking something here? And is there some project currently running that does this that I am unaware of?