why even bother with a robots.txt file anymoreSolved

Participant
Discussion
1 month ago Apr 08, 2026

We just pushed our new technical documentation live and the server logs immediately lit up with unknown IP addresses pulling the entire directory down. Tech forums keep repeating the same advice: update your robots.txt file and add Disallow directives for GPTBot and ClaudeBot. Does relying on an honor system file from the dialup era actually stop modern data scraping, or is it completely delusional? 

Replies (3)

Marked SolutionPending Review
Participant
1 month ago Apr 09, 2026
Marked SolutionPending Review

It is completely delusional. Major players like OpenAI usually respect robots.txt because they have public scrutiny and lawsuits to worry about. The real problem is rogue scrapers and shadow data brokers. They ignore your text file entirely. They spoof User Agent strings to look like standard web browsers and rotate through thousands of residential IP addresses to bypass rate limiting. If you want to protect your content, you must drop the honor system and implement a Web Application Firewall to fingerprint and block these bots dynamically. 

Marked SolutionPending Review
Participant
4 weeks ago Apr 10, 2026
Marked SolutionPending Review

The technical reality is bleak, but the compliance side offers some leverage depending on your jurisdiction. Under the EU Copyright Directive, tech companies must respect machine readable opt out signals. This is why protocols like TDMRep exist. If a legitimate AI company ignores a TDMRep signal, they commit a clear copyright violation. Rogue scrapers will obviously ignore this, but setting up these legal flags prevents your intellectual property from ending up in the official releases of major commercial models. 

Marked SolutionPending Review
Participant
4 weeks ago Apr 12, 2026
Marked SolutionPending Review

@yu-yan is right. If you leave data in the open, it will get taken. I stopped relying on policy files entirely and started using spider traps. I bury invisible links in my site code that legitimate human users never click. When a rogue scraper follows one, my server drops their connection into a tarpit. It feeds them endless gigabytes of randomly generated nonsense text at a painfully slow speed. It ties up their server resources and pollutes their training sets. If we cannot stop them with legal threats, we have to make scraping our domains an expensive waste of time. 

Save