Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.

Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they're also people, aren't they?
- Put the entire datacenter buildings into prison
good luck with that! not only is a company doing it, which means no individual person will go to prison, but it's from a chinese company with no regard for any laws that might get passed
- The people determining US legislation have said, "how can we achieve skynet if our tech trillionaire company sponsors can't evade copyright or content licensing?" But they also say if "we don't spend every penny you have on achieving US controlled Skynet, then China wins."
  Speculating on "Huawei network can solve this", doesn't mean that all the bots are Chinese, but does confirm that China has a lot of AI research, and Huawei GPUs/NPUs are getting used, and successfully solving this particular "I am not a robot challenge".
  It's really hard to call "amateur coding challenge" competition web site a national security threat, but if you hype Huawei enough, then surely the US will give up on AI like it gave up on solar, and maybe EVs. "If we don't adopt Luddite politics and all become Amish, then China wins" is a "promising" new loser perspective on media manipulation.

When you realize that you live in a cyberpunk novel. The AI is cracking the ICE. https://cyberpunk.fandom.com/wiki/Black_ICE

I love seeing how much influence William Gibson had on cyberpunk.
- It's not intentional but the chap ended up writing works that defined both the Cyberpunk (Neuromancer) and Steampunk (The Difference Engine) genres.
  Can't deny that influence.
most the ICE I've read about are white.
haven't tried it, it's in the closed apples store.. but it's a start..
https://apps.apple.com/us/app/iceblock/id6741939020

Do we all want the fucking Blackwall from Cyberpunk 2077?

Fucking NetWatch?

Because this is how we end up with them.

....excuse me, I need to go buy a digital pack of cigarettes for the angry voice in my head.

Consider nicotine+
- What was that?
  I was sucking on my nicotine nipple, err, I mean my vape.
  (Hey, its a more affordable stimulant addiction than coffee now!)

Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

Just reading the Ender's game series, very on-point.
- As long as NetWatch keeps them behind the Blackwall, we're all good.
Reminds me of the "store data inside slow network requests for the in-transit duration". It was a fun article to read.
- Link, please?
I like the idea but couldn't you just go the more direct route and mine crypto?
Like a public service CAPTCHA / BOINC hybrid

I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they're now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

Soon everything on the internet will be behind a wall.

This isn't sustainable for the ai companies, when the bubble pops it will stop.
- In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible... which is what the scalpers are causing.
  Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they'd be a bit more careful about the damage they cause.
  But they aren't, because capitalism.
Not necessarily compromised, I saw a VPN provider (don’t remember the name) that offered a free tier where the client accepts being used for this.
And I suspect that in the future some VPN companies will be exposed doing the same but with their paid customers.
There are many commercial VPNs offering residential IPs. I doubt they use malware.

I really feel like scrapers should have been outlawed or actioned at some point.

But they bring profits to tech billionaires. No action will be taken.
- No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that's dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There's nothing we can do legally about Chinese scrapers.
I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?
The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.
- Does your tool respect the site’s robots.txt?
- Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet

If someone just wants to download code from Codeberg for training, it seems like it'd be way more efficient to just clone the git repositories or even just download tarballs of the most-recent releases for software hosted on Codeberg than to even touch the Web UI at all.

I mean, maybe you need the Web UI to get a list of git repos, but I'd think that that'd be about it.

Then they'd have to bother understanding the content and downloading it as appropriate. And you'd think if anyone could understand and parse websites in realtime to make download decisions, it be giant AI companies. But ironically they're only interested in hoovering up everything as plain web pages to feed into their raw training data.
- The same morons scrape Wikipedia instead of downloading the archive files which trivially can be rendered as web pages locally

Write TOS that state that crawlers automatically accept a service fee and then send invoices to every crawler owner.

Huawei is Chinese. There's literally zero chance a European company like Codeberg is going to successfully collect from a company in China over a TOS violation.
- True, but it can help limit the European AI scrapers too
Cloudflare had a similar idea: Introducing pay per crawl: Enabling content owners to charge AI crawlers for access

I just thought that having a client side proof-of-work (or even only a delay) bound to the IP might deter the AI companies to choose to behave instead (because single-visit-per-IP crawlers get too expensive/slow and you can just block normal abusive crawlers). But they already have mind-blowing computing and money ressources and only want your data.

But if there was a simple-to-use integrated solution and every single webpage used this approach?

Believe me, these AI corporations have way too many IPs to make this feasible. I've tried per-IP rate limiting. It doesn't work on these crawlers.
Solution was invented long ago. It's called a captcha.
A little bother for legitimate users, but a good captcha is still hard to bypass even using AI.
And I think for the final user standpoint I prefer to lose 5 seconds in a captcha, than the browser running an unsolicited heavy crypto challenge on my end.
- For years, we’ve written that CAPTCHAs drive us crazy. Humans give up on CAPTCHA puzzles approximately 15% of the time and, maddeningly, CAPTCHAs are significantly easier for bots to solve than they are for humans.
  https://blog.cloudflare.com/turnstile-ga/
  I hate captchas.
- AI is better at solving captchas than you.
Are you planning to just outright ban IPv6 (and thus half the world)?
Any IP based restriction is useless with IPv6
- Not really true, you can block ranges.
What if we had some protocol by which the proof-of-work is transferable? Then not only would there be a cost to using the website, but also the operator would receive that cost as payment.
- It's theoretically viable, but every time that has been tried has failed
  There are a lot of practical issues, mainly that it's functionally identical to a crypto miner malware

Are those blocklists publicly available somewhere?

I would hope not. Kinda pointless if they become public
- On the contrary. Open community based block lists can be very effective. Everyone can contribute to them and asphyxiate people with malicious intents.
  If you think something like, "if the blocklist is available then malicious agents simply won't use that ips" I don't think if that makes a lot of sense. As the malicious agent will know any of their IPs being blocked as soon as they use them.

Begun, the information wars have.

The wars have been fought and lost a while ago tbh

I run my own gitea instance on my own server and within the past week or so I've noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.

Just keeps getting hammered over and over by IPs in China.
Simple solution: Block Chinese IPs!
Are you using Anubis?
Why aren't you firewalling it to only allow your IP? Are you sharing your code with third parties?
- a few repos i've made available to the public, the wayland wm for example, so I haven't gotten around to blocking IPs just yet.

Seems like such a massive waste of bandwidth since it's the same work being repeated by many different actors to piece together the same dataset bit by bit.

Huh, why does Anubis use SHA256? It's been optimized to all hell and back.

Ah, they're looking into it: https://github.com/TecharoHQ/anubis/issues/94

It's being investigated at least, hopefully a solution can be found. This will probably end up in a constantly escalating battle with the AI companies. https://github.com/TecharoHQ/anubis/issues/978

Uuughhh I knew it'd always be a mouse and cat game, sincerely hope the Anubis devs figure out how to fuck up the AI crawlers again

Do they really hit that much? I might not have a popular opinion there, but if they don't have a performance impact then I probably wouldn't care

They're getting hammered again this morning.