Is there a simple way to severly impede webscraping and LLM data collection of my website?
I am working on a simple static website that gives visitors basic information about myself and the work I do. I want this as a way use to introduce myself to potential clients, collaborators, etc., rather than rely solely on LinkedIn as my visiting card.
This may seem sound rather oxymoronic given that I am literally going to be placing (some relevant) details about myself and my work on the internet, but I want to limit the websites' access from bots, web scraping and content collection for LLMs.
Is this a realistic expectation?
Also, any suggestions on privacy respecting, yet inexpensive domains that I can purchase in Europe would be of super great help.
Why not add a basic http Auth to the site? Most web hosts provide a simple way to protect a site or directory.
You can have a simple username and pass for humans, but it will stop scrapers as they won't get past the Auth challenge unless they know the details.
I'm pretty sure you can even show login details in the Auth dialog, if you wanted to, rather than pre sharing them.
With htacces everyone can use the same credentials and you can have a message in the popup like ‚use username admin, passeword= what’s a duck? as the login‘
The other option would be an actual captcha