from Hacker News

I scraped all of OpenAI's Community Forum

by alt-glitch on 3/28/24, 2:44 PM with 59 comments

by xfalcox on 3/28/24, 4:17 PM
That's super cool, thanks for sharing! I will share this as an easy to follow example of what we can with AI.
> Allowing a Q&A interface using these embeddings over the post contents could speed up research over the community posts (if you know the right questions to ask :P). Let's view some posts similar to this one complaining about function calling
That's indeed a great thing to surface, and that's exactly how the the OpenAI forum selects the "Related Topics" to show at the end of every topic. We use embeddings for this feature, and the entire thing is open-source: https://github.com/discourse/discourse-ai/blob/main/lib/embe...
We also embeddings for suggesting tags, categories, HyDE search and more. It's by far my favorite tech of this new AI/ML gen so far in terms of applicability.
> Using Twitter-roBERTa-base for sentiment analysis, we generated a post_sentiment label (negative, positive, neutral) and post_sentiment_score confidence score for each post.
We do the same, with even the same model, and conveniently show that information on the admin interface of the forum. Again all open source: https://github.com/discourse/discourse-ai/tree/main/lib/sent...
Disclaimer: I'm the tech lead on the AI parts of Discourse, the open source software that powers OpenAI's community forum.
by wavyknife on 3/28/24, 4:12 PM
(disclaimer: I work for Discourse)
Discourse has an AI plugin that admins can run on their community to generate their own sentiment analysis (among other things), though it's not quite as thorough as this write up! https://meta.discourse.org/t/discourse-ai-plugin/259214
We're always interested to see how public data can be used like this. It's something that can be a lot more difficult on closed platforms.
by SunlitCat on 3/28/24, 3:51 PM
I didn't even knew they have community forums. Looking at the main homepage (openai.com), the only external links I can find are to chatgpt and their docs hosted on platform.openai.com. The other links lead to their socials, github and soundcloud (of all places).
Maybe I'm not looking thoroughly enough, so I may be wrong, tho!
by miduil on 3/28/24, 3:44 PM
That's an interesting write-up, I wonder how this would look for other big Discourse communities such as NixOS.
by klooney on 3/29/24, 1:20 AM
What's the "Day Knowledge Direction" cluster in the Atlas view?
by fzysingularity on 3/28/24, 5:21 PM
So epic, thank you for making this dataset available to everyone!
by alright2565 on 3/29/24, 3:00 PM
I saw this part:
> Every Discourse Discussion returns data in JSON if you append .json to the URL.
then this:
> Raw data was gathered into a single JSONL file by automating a browser using Playwright.
Kinda seems to me like having a whole browser instance for this isn't necessary? I would have been surprised if this .json pattern didn't continue for all pages, and it turns out that it does in fact also work for the topic list: https://community.openai.com/latest.json
The other place I've seen this sort of API pattern is reddit. For example, https://www.reddit.com/r/all.json or (randomly chosen) https://www.reddit.com/r/mildlyinfuriating/comments/1bqn3c0/...
by velid0 on 3/28/24, 3:53 PM
Now train a gpt based on the data :D
by garyiskidding on 3/29/24, 8:33 AM
This is really amazing. Pretty insightful. Thank you.
by xandrius on 3/28/24, 3:30 PM
Love it, just for the sole reason of turning something OpenAI made into a dataset for everyone else :D
by dorkwood on 3/28/24, 4:17 PM
I did a bit of data scraping for fun in the past, but I was never quite sure of the legality of what I was doing. What if I was breaking some law in some jurisdiction of some country? Was someone going to track me down and punish me?
OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.
by enonimal on 3/28/24, 3:56 PM
> Number of Posts with negative sentiment, grouped by Topic
> # 1 Result: Python Packaging
Checks out
by throwaway98797 on 3/28/24, 3:50 PM
did they have the right to use all thier data?
/s