Home
General
why even bother with a robots.txt file anymore

why even bother with a robots.txt file anymoreSolved

Participant

2 months ago

We just pushed our new technical documentation live and the server logs immediately lit up with unknown IP addresses pulling the entire directory down. Tech forums keep repeating the same advice: update your robots.txt file and add Disallow directives for GPTBot and ClaudeBot. Does relying on an honor system file from the dialup era actually stop modern data scraping, or is it completely delusional?

Replies (3)

Marked SolutionPending Review

yu-yan

Participant

2 months ago

Marked SolutionPending Review

It is completely delusional. Major players like OpenAI usually respect robots.txt because they have public scrutiny and lawsuits to worry about. The real problem is rogue scrapers and shadow data brokers. They ignore your text file entirely. They spoof User Agent strings to look like standard web browsers and rotate through thousands of residential IP addresses to bypass rate limiting. If you want to protect your content, you must drop the honor system and implement a Web Application Firewall to fingerprint and block these bots dynamically.

Marked SolutionPending Review

amelia

Participant

2 months ago

Marked SolutionPending Review

The technical reality is bleak, but the compliance side offers some leverage depending on your jurisdiction. Under the EU Copyright Directive, tech companies must respect machine readable opt out signals. This is why protocols like TDMRep exist. If a legitimate AI company ignores a TDMRep signal, they commit a clear copyright violation. Rogue scrapers will obviously ignore this, but setting up these legal flags prevents your intellectual property from ending up in the official releases of major commercial models.

Marked SolutionPending Review

aurora

Participant

2 months ago

Marked SolutionPending Review

@yu-yan is right. If you leave data in the open, it will get taken. I stopped relying on policy files entirely and started using spider traps. I bury invisible links in my site code that legitimate human users never click. When a rogue scraper follows one, my server drops their connection into a tarpit. It feeds them endless gigabytes of randomly generated nonsense text at a painfully slow speed. It ties up their server resources and pollutes their training sets. If we cannot stop them with legal threats, we have to make scraping our domains an expensive waste of time.

Save

why even bother with a robots.txt file anymoreSolved

Tags

Replies (3)

Leaderboard

Popular Tags

why even bother with a robots.txt file anymoreSolved

Tags

Replies (3)

Related Topics

Leaderboard

Popular Tags

Login to Hexnode Community