Those are truly useless to go against bad actors and is instead only annoying for the humans that read. And good actors with proper licenses won't be scraping Lemmy, Reddit or Twitter.
You just cannot prevent it on Lemmy because if an instance places filters like Anubis, another will not. And it is not feasable to mandate every instance to do so. Also, this is an open platform by nature and there is no group or company that can mandate rules of access. As you are limiting non-humans, you might also be limiting real users with peculiar configurations or under heavy privacy middlewares.
The point (as I see it) is not so much to stop scraping as it is to prevent bots from effectively DDOS-ing web services. As others have said ActivityPub content is public and there are ways to get it without slamming instances with scraper bots.
It is, I saw claudebot and gptbot scraping my instance, made a post about it on fuckai, but i have blocked all these bots now and my instance is a lot faster.
I don't host a Lemmy instance, but I post links in my comments. I sometimes generate and share unique-ish URLs to share updates with specific versions of my hobby projects. I've seen them queried a few times in my Apache logs by useragents claiming to be from OpenAI, Anthropic, etc. Also search engine crawler bots.
They don’t really need to scrape. They just have to set up their own federated instance and the ActivityPub protocol will willingly hand it all to them in a nicely parsable format.
I'm sure the AI devs so lazy they cannot train their AI on anything other than scraped HTML can set up a Lemmy instance and point their crawlers at that.