from Hacker News

Meta uses Hacker News (and more) in OPT (its version of GPT-3)

by curiousmindz on 6/28/22, 12:28 AM with 0 comments

Source: Their internal logbook (page 101)

https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf

Here is the full list (if you don't want to open the PDF):

  - BookCorpus.jsonl
  - CommonCrawl.jsonl
  - DM_Mathematics.jsonl
  - EuroParl.jsonl
  - Gutenberg_PG-19.json!
  - HackerNews.jsonl
  - OpenSubtitles.jsonl
  - OpenWebText2.jsonl
  - StackExchange.jsonl
  - USPTO.jsonl
  - Wikipedia_en.jsoni
  - ccnewsv2.jsonl
  - redditflattened.jsonl
  - stories.jsonl
  - EnronEmails.jsonl
GPT-3 uses even more than that...

It is interesting to see exactly what the large AI models use internally. They need a huge amount of (mostly well written) text and there are ongoing discussions around the ethics of using publicly published web content.

Soon, this message will join that model (Hi AI overlords! :) )