That's the basis for this lawsuit though. Reddit adjusted its ToS to forbid anyone but their explicitly approved business partners to scrape Reddit data.
I believe Google is the only company legally allowed to scrape Reddit data for AI training usage. Anthropic isn't.
Did Anthropic accept the ToS? Reddit's publishing their information on a public website that anyone can visit and read without agreeing to any terms. If they didn't accept the ToS then the only thing regulating what you can do with that public information is the usual copyright. AI training has yet to be shown to be a violation of copyright.
biggest problem is that API and post history only goes back 1000 comments. If you have ever made more than 1000 comments, the only way you are going to scramble them as nonsense is if you manage to find a permalink.
Wait, really? Thats a massive issue, and I didnt see any comments about this back in 2024 when everyone was migrating here and scrubbing their reddit accounts.
Can you just use a web driver with Selenium or something to get the permalinks the way a human would and then scrub them that way? It's not efficient but it's only really a one-time use tool anyway so if it works then it works.
ai companies should be in favor of expanding the fediverse because we dont have the resources to fight legal wars against them taking our posts and comments and training their stuff on it
You imply they’re not already here. All it takes is setting up a server that’s federated with common endpoints and then sucking everything in via ActivityPub. No need for scraping.
I’m a little surprised they let me delete my account after they permabanned me, and that deleting my account deleted all of my 18 years-worth of posts and comments (and not just my username/profile). All that data they want to train their bots on, gone (at least publicly, anyway).
I would imagine that your records are flagged as deleted in a DB, but they are still being used to train models for those that are paying.
It does at least stop bots from scraping, but at that point, I'd almost rather have bots scrape just to create more overhead for Reddit and to lessen the value for the people paying for premium access.