The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

Fuck that. I don't need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn't want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an "inspect element" tag.
- That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.
- They already prosecute people under the unauthorized access provision. They just don’t prosecute rich people under it.
Right? Isn’t this a textbook DMCA violation, too?
- for us, not for them. wait until they argue in court that actually its us at fault and we need to provide access or else

It's difficult to be a shittier company than OpenAI, but Perplexity seems to be trying hard.

Step 1, SOMEHOW find a more punchable face than Altman
- put META android zuckerberg on or mechahitler musk.
- Altman’s face looks like it’s already been punched

This is a nice CloudFlare ad

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.
- DEATH TO CLOUDFLARE!
- I'm out of the loop, what's wrong with cloud flare?

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

yeah it's almost like there as already a system for this in place
- THE CAKE DAY IS NOW. (i dont have an image at hand)
And I'm assuming if the robots.txt state their UserAgent isn't allowed to crawl, it obeys it, right? :P
- No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.
Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.

Uh.. good?

Traveling snake oil salesman complains he can't pick people's locks.

rare cloudflare w

As far as security is concerned, their w's are pretty common tbh. It's just the whole centralization issue.

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?

- See: Facebook/Meta
puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based "AI" companies to oblivion. Side effect, might break the Internet.
- Worth it.
- The Internet was already ruined, cloudflare is just bandaids on top of band aids.

You could say they are... Perplexed.

That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

I'm still holding out for Stephen Hawking to mail out Demon Summoning programs.

You'd think that a competent technology company, with their own AI would be able to figure out a way to spoof Cloudflare's checks. I'd still think that.

Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.
- This is not about training data, though.
  Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.
  Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.
  I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.
  Though technically making all this happen flawlessly is quite a big task.
see, but they're not competent. further, they don't care. most of these ai companies are snake oil. they're selling you a solution that doesn't meaningfully solve a problem. their main way of surviving is saying "this is what it can do now, just imagine what it can do if you invest money in my company."
they're scammers, the lot of them, running ponzi schemes with our money. if the planet dies for it, that's no concern of theirs. ponzi schemes require the schemer to have no long term plan, just a line of credit that they can keep drawing from until they skip town before the tax collector comes
Perplexity: "But that would cost us moneeyyyy!"

Good. I went through my CF panel, and blocked some of those "AI Assistants" that by default were open, including Perplexity's.

CF panel? Your light bulb??
- CF == Cloudflare :)

ask AI how to do it?

They tried nothing & they're all out of ideas.

I don't like cloudflare but it's nice that they allow people to stop AI scrapping if they want to

CloudFlare has become an Internet protection racket and I'm not happy about it.
- It's been this from the very beginning. But they don't fit the definition of a protection racket as they're not the ones attacking you if you don't pay up. So they're more like a security company that has no competitors due to the needed investment to operate.
- they're good at protecting websites but damn, having a company being MITM feels so wrong

Well... Good.

This is why companies like Perplexity and OpenAI are creating browsers.

good, that means it’s working

I’m gonna be frustrated (though not surprised) if the response is anything other than this.

Skill issue. Cope and seethe

this made me lol

I set up a WAF for my company's publicly facing developer portal to block out bot traffic from assholes like these guys. It reduced bot traffic to the site by something like - I kid you not - 99.999%.

Fucking data vultures.

ahahahahah, great, fck AI

💁u
Here, you dropped this!

Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.
Cloudflare is just trying to keep the bots from overwhelming everything.
Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.
A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.
This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.
- A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.
  I think it's also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).
- they don't outright block ai crawlers. they added some new tools and options for managing or blocking ai bot traffic which the cloudflare customer can choose to use or to not use.
  im running a free educational resource and i let the crawlers hit my site all they want because its useful knowledge unavailable anywhere else and it's served to them from cloudflare's free tier cache. i just don't know why they have to read it ten thousand times a day.
- But the website owner can still choose to continue blocking them right? Without using additional stuff like Anubis that is.
Perplexity (an "AI search engine" company with 500 million in funding) can't bypass cloudflare's anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity's scrapers because they ignore robots.txt and mimic real users to get around cloudflare's blocking features. Perplexity argues that their scraping is acceptable because it's user initiated.
Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.
- …and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.
  That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

It seems like it's some kind of distraction to make people think things aren't as bad as they really are, it just sounds too far-fetched to me.

It's like a bear that has eaten too much and starts whining because a small rabbit is running away from him, even though the bear has already eaten almost all the rabbits and is clearly full.

- So that he doesn't have to run after the rabbits, he will learn to raise them and manage them with a fake smile, providing them with a stable life lol.
  Well, I think the thing is that we still live by the law: the strong do what they want, and the weak just whine and complain.

Cry more, Perplexity.

they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.

Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It's all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.
- Yeah and the worst part is it doesn't fucking work for the one thing it's supposed to do.
  The only thing it does is stop the stupidest low effort scrapers and forces the good ones to use a browser.
Recaptcha v2 does way more than check if the box was checked.
https://stackoverflow.com/a/27299487
- you're not wrong, but it also allows more than 99.8% of the bot traffic through too on text challenges. Its like the TSA of website security. Its mostly there to keep the user busy while cloudflare places itself in a man in the middle of your encrypted connection to a third party. The only difference between cloudflare and a malicious attacker is cloudflares stated intention not to be evil. With that and 3 dollars I can buy myself a single hard shell taco from tacobell.

Words cannot describe how much I hate this person

I hate that these bots ruin my read it later app. :(

Here comes the ridiculous offer to buy Google chrome with money they don't have: easy delicious scraping directly from the user source

Oh no!

I actually agree with them

This feels like cloudflare trying to collect rent from both sides instead of doing what’s best for the website owners.

There is a problem with AI crawlers, but these technologies are essentially doing a search, fetching a several pages, scanning/summarizing them, then presenting the findings to the user.

I don’t really think that’s wrong, it’s just a faster version of rummaging through the SEO shit you do when you Google something.

(I’ve never used perplexity, I do use Kagi’s ki assistant for similar search. It runs 3 searches and scans the top results and then provides citations)

What’s best for the website owners is to have people actually visit and interact with their website. Blocking AI tools is consistent with that.
- For a lot of AI search I actually end up reading the pages, so I don’t know how much this stops that
Well. Try running a web server and you'll find quite quickly that you get hit quick and hard by AI crawlers that do not respect server operators. Unlike web crawlers of old, these will hit a site over and over with sometimes 100s, even 1000s of requests per second to strip mine all the content they can find, as quickly as possible.
When you try to block them by user agent, they start faking real client user agents.
When you block the AS Numbers involved traffic starts to go down. But there's still a large number of non organic requests, coming from, well frankly everywhere. Cellular network in Brazil, cable internet in the USA, other non business subcribers in other countries around the world.
How do I know they're not organic? Turn on cloudflare managed challenge and they all go away.
So, personally that's my biggest beef against them. Yes ripping off data without permission is bad already, but this level of trying to bypass any clear sign we do not want you is far worse.
- Yeah that’s fair, and I do agree with Cloudflare stamping out that behaviour.
  What I’m trying to say is there are cases where AI agents act for the user in what the traditional user agent role of browsers would be.
  ETA: That doesn’t excuse things like not having a search index to prevent mass scale access, this would be near 1-1 access patterns per user, which would be infrequent/spaced out
Search engines been going relatively fine for decades now. But the crawlers from AI companies basically DDOS hosts in comparison, sending so many requests in such a short interval. Crawling dynamic links as well that are expensive to render compared to a static page, ignoring the robots.txt entirely, or even using it discover unlinked pages.
Servers have finite resources, especially self hosted sites, while AI companies have disproportinately more at their disposal, easily grinding other systems to a halt by overwhelming them with requests.
- that explains why cloudflare keeps asking your abot or not, making you do that captcha.
If a neighborhood is beset by roving bands of thieves, sooner or later strangers will be greeted by a shotgun rather than an invitation to tea, regardless of their intentions. Them's the breaks. Bots are going to take a hit now and their operators are just going to have to deal with it. Sucks when people don't play nice, but this is what you get.

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.

Linux and Firefox here. No problem at all with Cloudflare, despite having more or less as much privacy preserving add-on as possible. I even spoof my user agent to the latest Firefox ESR on Linux.
Something's may be wrong with your setup.
- Thats not how it works. Cf uses thousands of variables to estimate a trust score and block people so just because it works for you doesn't mean it works.
- I suspect a lot of it comes down to your ISP. Like the original commentor I also frequently can't pass CloudFlare turnstile when on Wifi, although refreshing the page a few times usually gets me through. Worst case on my phone's hotspot I can much more consistently pass. It's super annoying and combined with their recent DNS outage has totally ruined any respect I had for CloudFlare.
  Interesting video on the subject: https://youtu.be/SasXJwyKkMI
I'm on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.
- Thirded. All three (Linux, FF, nexus)
  ZERO ISSUES.
- "Wrong with my setup" - thats not how internet works.
  I'm based in south east asia and often work on the road so IP rating probably is the final crutch in my fingerprint score.
  Either way this should be no way acceptible.
- In my case, it's usually the VPN.
omg ur a hacker
Did you mean Edge on Windows? 'Cause if so, welcome in!
It happened to me before until I did a Google search. It was my VPN web protection. It was too " over protective".
Check your security settings, antivirus and VPN

Cry me a river

I can’t get over their CEO that looks like a nine year old. Not sure what it is about him

I think it's the beard, it makes his cheeks look puffed up a bit. His whole expression kinda looks like a grouchy toddler.

I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)

So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity's are awful for the web.

(Edited for clarity)

I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.
We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.
- What I meant with "things like this are awful for the web," I meant that automation through AI is awful for the web. It takes away from the original content creators without any attribution and hits their bottom line.
  My story was supposed to be one about responsible AI, but somehow I screwed that up in my summary.

Gee that's a real removed ain't it perplexity?

They do have a point though. It would be great to let per-prompt searches go through, but not mass scrapping

I believe a lot of websites don't want both though

Does it not need to be scraped to be indexed, assuming it’s semi-typical RAG stuff?
- I assume their script does some search engine stuff like query google or bing and then "scrap" the links they go on
  Some selenium stuff

Ooh, that's though sweetheart. If the owners of those servers want you to visit, they'll just choose another WAF than CF's.

All zero of them.

I really hope Cloudflare doesn't eventually evolve into a shitty ass company, so far I like them very much, and all this massive L for AI only improves my opinion on them.

next step: cloudflare sends hit squads to blow up the source of these slimy data grabber attacks

I don't see a problem here. Maybe Perplexity should consider the reasons WHY Cloudflare have a firewall...?