by msp26 on 5/21/25, 3:12 PM
Fantastic. I wonder how many random technical info is buried in these servers. I hate what it's done for game modding.
by leotravis10 on 5/21/25, 2:46 PM
by AStonesThrow on 5/21/25, 4:39 PM
Now those of us who've been around the block know that Discord is merely the latest iteration on chat servers such as IRC.
I'm interested to know, from anyone here who's an IRC operator or server/network admin, how the IRC community deals with scraping and bots, because in the early 90s, it was never an issue of corporate Terms of Service or legalese, but typically handled by community standards, and probably, people did whatever they could get away with, and this needed to be anticipated and tolerated by the other participants in any given server or channel.
I doubt that IRC users, back in the day or in the present, have any illusions of privacy, when logging or reflecting or bouncing chats is more or less a built-in feature and an integral component of such a networked chat service.
by roskelld on 5/21/25, 4:35 PM
I don't know if Discord fixed it as I haven't checked in a few years, but I tinkered with scraping some public Discords and I found that I could see hidden channels, not the data, but the channel names, which could do things like reveal to me if the same Discord was used for in-house development if it was a product Discord. Not great.
by AStonesThrow on 5/21/25, 3:07 PM
It says they used ethical anonymization, but we’ve seen other scrapers are always completely in violation of Discord’s TOS.
So did Discord cooperate, or give special authorization for this collection? It wouldn’t appear that they could do so, if privacy belongs to their users at all.
by kd5bjo on 5/21/25, 3:07 PM
A quick read through of their anonymization process seems to indicate that they didn’t scan the message
contents for PII (other than usernames).
If true, that seems like a huge oversight. I also wonder what would happen if someone finds their information in the dataset and requests it to be removed per GDPR or other privacy legislation.
by gynvael on 5/23/25, 10:29 AM
by charcircuit on 5/21/25, 4:12 PM
>Data was collected through Discord's public API, adhering to ethical guidelines
How is it ethical to break Discord's terms of service? An ethical researcher would respect any contracts that they agreed to and would not violate them to collect more data.
by sneak on 5/21/25, 4:46 PM
Now imagine the data mining that Discord can do on the complete DM history of every user. It’s not e2ee, remember.
by recursive4 on 5/21/25, 3:31 PM
...When you realize GPT-5 is going to be trained on your meme preferences...
by SirMaster on 5/21/25, 3:02 PM
The biggest problem that sucks about discord is that it isn't normally publicly searchable. And it seems to be a modern replacement for internet forums which historically were publicly searchable and often had a lot of great information about various hobbies and things.
by giancarlostoro on 5/21/25, 3:04 PM
Pretty sure this violates Discord's Terms of Service, there was someone selling access to logs from servers the person running the website was joining on self-bots (TOS) and the person would just log all available data. Discord definitely got legal on them. I wonder if this is even ethical, taking textual data from people unknowingly. Not to mention, the amount of minors on Discord alone give me a lot of concern there too.
by candiddevmike on 5/21/25, 3:01 PM
> Usernames are replaced with consistent pseudonyms generated by the
mimesis library, ensuring that identifiers remain unique and contextually meaningful across records. Similarly, user IDs and message IDs are hashed using the SHA-256 algorithm and truncated to 12 characters. This deterministic hashing approach maintains linkage between related records while effectively masking the original identifiers. The global name field, deemed unnecessary for analysis, is entirely removed. Additionally, user IDs embedded within the content field are identified via regular expressions and replaced with their corresponding hash values.
Seems pretty thorough, though this is may end up being a good lesson for GenZ/A not to post things in public spaces on the internet.
by prmph on 5/21/25, 4:15 PM
Why is it so hard to export your own messages out of Discord, Slack, etc?
We have regressed from the open email standard and gone back to these opaque islands of data that do not adhere to any standard.
Slack refused to show me my own messages past a certain age unless I paid up, and eventually deleted them.
by pavel_lishin on 5/21/25, 3:17 PM
by zelifcam on 5/21/25, 4:51 PM
Discord was one of the most upsetting wrong turns made with the modern internet. It’s primary users at the time were children and now here we are.
by daft_pink on 5/21/25, 4:51 PM
Does anyone else think it’s super creepy that someone’s going through all our messages this way?
by spencerflem on 5/21/25, 3:10 PM
insane. people doing awful stuff like this is why the world is retreating into private group chats.
these researchers should be ashamed