by charlysl on 7/11/23, 6:19 PM with 70 comments
by sillysaurusx on 7/11/23, 7:57 PM
As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.
There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.
As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: https://twitter.com/theshawwn/status/1641804013791215619?s=6.... They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.
One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/
by Roark66 on 7/12/23, 6:41 AM
I hope there is a lot of text in languages other than English. As for example in my language (Polish) current SOTA models are very deffiecient. I have wondered why is that considering companies like (not at all)OpenAI claim to train on large datasets including in my language of interest. It turns out (and I learned this just yesterday) they used LLM translated English content that that used as other language training data. They used Azure translator which itself is a transformer model to generate content for gpt-3.5 for example. Also, I bet there is a lot of poorly machine translated content in their supposedly "original" data.
The result? You can use chatgpt to write you an email of any kind in English and you can copy/paste/send immediately. Try doing that in Polish... It will make sense, but the language used will use bad tone (too familiar in a business setting), bad words(words that exist, but no real person would use) and sentence layout that just plainly feels weird. I suspect this is even worse in many other languages.
by dang on 7/11/23, 8:45 PM
The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=36272365 - June 2023 (5 comments)
The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=25607809 - Jan 2021 (60 comments)
by cschmidt on 7/11/23, 8:24 PM
by Der_Einzige on 7/11/23, 7:54 PM
I'm still a tiny bit salty about that, but the pile is a wonderful dataset regardless.
by charlysl on 7/11/23, 9:42 PM
[1] https://stanford-cs324.github.io/winter2022/lectures/data/
by robertheadley on 7/11/23, 11:13 PM
by ryoshiro on 7/12/23, 7:17 AM