by curiousmindz on 6/28/22, 12:28 AM with 0 comments
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
Here is the full list (if you don't want to open the PDF):
- BookCorpus.jsonl
- CommonCrawl.jsonl
- DM_Mathematics.jsonl
- EuroParl.jsonl
- Gutenberg_PG-19.json!
- HackerNews.jsonl
- OpenSubtitles.jsonl
- OpenWebText2.jsonl
- StackExchange.jsonl
- USPTO.jsonl
- Wikipedia_en.jsoni
- ccnewsv2.jsonl
- redditflattened.jsonl
- stories.jsonl
- EnronEmails.jsonl
GPT-3 uses even more than that...It is interesting to see exactly what the large AI models use internally. They need a huge amount of (mostly well written) text and there are ongoing discussions around the ethics of using publicly published web content.
Soon, this message will join that model (Hi AI overlords! :) )