from Hacker News

Ask HN: Is anyone working on a Reddit archive?

by nidnogg on 6/30/23, 6:59 PM with 6 comments

Although I disagree with most of the criticisms leveraged towards the platform, I still readily depend on it for a lot of day-to-day resources and general questions, many times less technically oriented.

While I wouldn't mind losing the system there in the long run, I think the state of posts before this upheaval was very valuable as a reference.

Like the title says - has anyone done anything like a Reddit "takeout" yet?

  • by uniqueuid on 6/30/23, 7:03 PM

    Yes.

    There is the pushift dataset covering posts and comments through 2022 [1].

    And the ArchiveTeam has begun crawling reddit as well some time ago [2]

    [1] https://old.reddit.com/r/pushshift/comments/10bwxke/updated_...

    [2] https://news.ycombinator.com/item?id=36254172

  • by cookiengineer on 6/30/23, 8:15 PM

    I was focussing mostly on cyber security related subreddits because the vulnerability and exploit discussions were of great value to me.

    I built a little scraper in golang that stores the JSON data (instead of the HTML which the archive warrior stores) to save hdd storage. [1]

    The problem with reddit's API is that it only shows 1000 entries over 10 pages in every api. Meaning hot/top/new, and search results are limited. If you have more links related to the keyword, you won't discover more.

    So you need a very specific keyword list to be able to discover more posts, and search each subreddit for each entry in the keyword list.

    [1] https://github.com/cookiengineer/reddit-archivar

  • by minimaxir on 6/30/23, 7:02 PM

    Pushshift was the Reddit archive but apparently recent agreements with Reddit may have changed that.

    Anyone else creating a Reddit archive will likely get a C&D.

  • by simonblack on 6/30/23, 8:20 PM

    It's next on my list after I finish the MySpace archive.

    Seriously, why would anybody do this? Reddit has such a high noise-to-signal ratio that it would be a waste of resources. There may be value in keeping an archive of some individual subreddits, but not the main bulk of Reddit itself.