
Efforts are underway to expand the Robots Exclusion Protocol and Meta Robots tags to prevent AI crawlers from using publicly available web content for training generative AI models. The proposal, spearheaded by Krishna Madhavan and Fabrice Canel of Microsoft, aims to create a simple, universal rule to block mainstream AI training bots. Since legitimate crawlers typically honor these protocols, this initiative is a game-changer for publishers seeking to protect their content from AI usage.
The Role of the IETF
The IETF, an international standards body established in 1986, is central to this development. The IETF formalized the Robots Exclusion Protocol (originally created in 1994) as an official standard in 2022, enhancing its functionality. Now, the IETF is positioned to extend this protocol further to address AI training concerns, continuing its mission of standardizing voluntary internet practices.
Three Methods to Block AI Training Bots
The draft proposal outlines three mechanisms for preventing AI bots from using website data for training:
- Robots.txt Protocols
- New rules will allow site owners to specify whether their content can be used for AI training. For instance:
- DisallowAITraining: Prevents AI crawlers from using data for training.
- AllowAITraining: Permits AI usage.
- These updates empower publishers to define clear boundaries for AI bot activity.
- HTML Meta Robots Tag
- Enhanced meta robots tags will allow publishers to specify crawler permissions directly in their web pages.Examples include:
- html
- Copy code
- <meta name=”robots” content=”DisallowAITraining”>
- <meta name=”examplebot” content=”AllowAITraining”>
- Application Layer Response Header
- Web servers can send headers in HTTP responses to signal AI training permissions. Proposed headers include:
- DisallowAITraining
- AllowAITraining
Why These Updates Matter
AI companies have faced lawsuits for using publicly available web data, often defending their actions as “fair use.” However, these new protocols offer a clear, enforceable framework for publishers to manage their content’s use. By aligning AI crawlers with the ethical standards adhered to by legitimate search engines, publishers gain greater control over how their data is used in AI model training.
These developments promise to safeguard content creators’ interests while maintaining a fair balance in the AI and web ecosystems.
If it still feels overwhelming or unclear, don’t worry—we’ve got you covered! Explore our monthly SEO packages and let our experts handle the heavy lifting for you.