A Brazilian team used Discord’s API to scrape 10% of its open servers.
Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active.
Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called "Searchcord" based on a different data set that shows non-anonymized chat histories.
I was hoping to play around with the dataset over the weekend to toy with some text-embedding techniques, but they’ve pulled the cord on the download links.
Anyone have a copy of the full archive they’re willing to share, or a magnet link?
Seriously. It's beyond painful when some open source project only uses Discord for communication. You have to hope that you post your question at a time when the right people are online, and that there's not a more interesting conversation going on, otherwise it just gets lost. Index that whole dataset.
I've always wanted to contribute to The Cutting Room Floor wiki but they hide registration behind a Discord server bot that will give the registration code.
I've seen a few projects doing just that with answeroverflow.com and they have come up in my web searches. Not really a solution but at least a stopgap.
I spent nearly three hours today between discord and matrix trying to figure out how to get these two pieces of software to talk using a certain protocol.
Imagine if there were online indexable platforms where people could publish this information so it’s easily accessible rather than having to scour through message logs hoping to find the right keywords. Such a technology surely doesn’t exist already, right?
I don't hate Discord, I simply hate that so many projects and companies have unanimously decided to use it as the wrong tool for the wrong job.
It's fine for its intended use case, which is bickering with my friends about video games and fiction, and spamming each other with .gifs and meme images.
Yeah, but then you have something like when people protest deleted their history on reddit which is fine as a protest tactic but leaves a hole where your specific question came up but now there's nothing there.
Lol, I've read this headline and thought "thank fuck, probably the only option to have Discord's content readable", I like how universal this opinion is
That’s good news. Internet archiving is an important endeavor because you never know when they‘ll pull the plug. Now it‘s a little more secured and probably far more useful than in Discord‘s hands alone.
If they aren't comfortable with their Discord messages being public, perhaps they shouldn't have posted those messages in a public forum that the public can access.
It wasn't the chats though. It was public servers that can be found through the discovery tab. I would love to be up and arms about this and convince people to switch but.. Looking at it objectively, this isn't terribly different from if they'd archived public subreddits and their posts.
So how does this work? Like how did they get those messages through API calls? Also, is this not something that Discord would dislike since it dilutes the value of their data horde?
Every time you post, you're posting so that Meta, Google, Reddit and every known retail store like Walmart, Target, Kroger, etc. can see it because they bought that info or harvested it themselves. I think these are great announcements so people can see who sees and manipulates you with your own contributions of data.