from Hacker News

The Pile: An 800GB dataset of diverse text for language modeling (2020)

by charlysl on 7/11/23, 6:19 PM with 70 comments

by sillysaurusx on 7/11/23, 7:57 PM
Author here. And by author I mean I created books3 (the books component of The Pile) while everyone else did the hard work of actually writing the paper, ha. Stella and Leo Gao in particular did so much wonderful work on the paper, though it couldn’t have happened without everyone’s contributions.
As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.
There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.
As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: https://twitter.com/theshawwn/status/1641804013791215619?s=6.... They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.
One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/
by Roark66 on 7/12/23, 6:41 AM
Great stuff, I skimmed the article searching for some table showing a breakdown of content by language, but I haven't found one.
I hope there is a lot of text in languages other than English. As for example in my language (Polish) current SOTA models are very deffiecient. I have wondered why is that considering companies like (not at all)OpenAI claim to train on large datasets including in my language of interest. It turns out (and I learned this just yesterday) they used LLM translated English content that that used as other language training data. They used Azure translator which itself is a transformer model to generate content for gpt-3.5 for example. Also, I bet there is a lot of poorly machine translated content in their supposedly "original" data.
The result? You can use chatgpt to write you an email of any kind in English and you can copy/paste/send immediately. Try doing that in Polish... It will make sense, but the language used will use bad tone (too familiar in a business setting), bad words(words that exist, but no real person would use) and sentence layout that just plainly feels weird. I suspect this is even worse in many other languages.
by dang on 7/11/23, 8:45 PM
Related:
The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=36272365 - June 2023 (5 comments)
The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=25607809 - Jan 2021 (60 comments)
by cschmidt on 7/11/23, 8:24 PM
If you’re looking at The Pile, you also might consider the Red Pajama dataset. A new cleaned version was released recently https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...
by Der_Einzige on 7/11/23, 7:54 PM
I came so close to getting my dataset DebateSum (https://huggingface.co/datasets/Hellisotherpeople/DebateSum) into the pile, but they decided at the last minute not to add it: https://github.com/EleutherAI/the-pile/issues/56
I'm still a tiny bit salty about that, but the pile is a wonderful dataset regardless.
by charlysl on 7/11/23, 9:42 PM
OP here. I learned about this while reading Stanford's LLM course's "Data" lecture [1]. Very interesting how it assesses the datasets used for GPT 2 and 3, etc, and how The Pile addresses their issues. A very interesting course!
[1] https://stanford-cs324.github.io/winter2022/lectures/data/
by robertheadley on 7/11/23, 11:13 PM
As long as LLMs and generative AI uses copywritten works for training, then they are going to be the enemy of creative people.
by ryoshiro on 7/12/23, 7:17 AM
Side Topic: In the leaked OpenAI GPT-training details, there are speculations that OpenAI trained on Libgen dataset. Is there a link to the dataset of Libgen, if so how big is it?