Why Robots.txt and Metadata Aren't Enough: The Technical Case for a Text and Data Mining (TDM) Registry

Why Robots.txt and Metadata Aren't Enough: The Technical Case for a Text and Data Mining (TDM) Registry

Understanding the gap between web protocols and AI licensing needs

When publishers first learned that AI companies were training models on copyrighted content, the response was predictable: "Can't we just use robots.txt?" After all, search engines have respected these simple text files for decades, telling crawlers which pages to index and which to avoid.

The short answer is no.  Or at least, not in any way that solves the actual problem. But understanding why reveals something important about the infrastructure challenge the industry faces.

The Historical Dataset Problem

Most AI training doesn't work the way search indexing does. Google crawls your website today and respects the robots.txt file it finds there today. But AI models are trained on massive datasets that were assembled before current concerns about licensing emerged.

When an AI lab uses Common Crawl data from 2019, your 2024 robots.txt file is irrelevant. The content was already collected, stripped of its metadata, and bundled into training datasets that circulate independently of their original sources.

This isn't a technical failure. It's a timing problem. Publishers didn't know they needed to declare AI training preferences in 2019 because the question hadn't yet been asked. And even if they had, there was no standard way to express "yes to search indexing, no to AI training."

Robots.txt is a blunt instrument. It can say "don't crawl this" or "crawl this," but it can't express the granular preferences publishers need: "Yes to search, no to training. Yes to attribution tools, no to generative models. Yes for educational AI, no for commercial AI."

The Content Travels Problem

Even when robots.txt works perfectly on your own site, content doesn't stay there. Your books get excerpted on Reddit. Readers photograph pages and share them on Instagram. Academic papers quote passages. Review sites display sample chapters.

By the time an AI training dataset is assembled, your carefully crafted website permissions are irrelevant because your content appears in dozens of places you don't control. Each instance has been stripped of metadata, re-encoded, reformatted, and disconnected from its source.

This is why metadata embedded in files, like copyright pages in PDFs or EXIF data in images, also fails. The first time someone screenshots a page, copies text into a blog post, or converts a file format, that metadata disappears. What remains is the content itself, orphaned from its rights information.

The Discovery Problem

Suppose you solve both previous problems. You have a perfect robots.txt file from day one, and somehow your content never leaves your website. There's still a fundamental issue: AI companies can't discover your licensing terms in any structured way.

Each publisher's website is different. Terms of service are written in natural language that varies wildly between organizations. Some put AI policies in legal pages, others in press releases, others in blog posts. There's no programmatic way for an AI company building a training dataset to query thousands of publishers and get consistent, machine-readable answers about licensing availability.

Imagine if every store priced items differently. Some are priced in dollars, some in hours of labor, some in barter terms negotiated individually. You could still buy things, but you couldn't efficiently compare or build systems that scale across many vendors. That's the current state of AI content licensing.

What a Registry Changes

A TDM (Text and Data Mining) registry addresses each of these problems through a different technical approach:

Persistent Identification: Instead of relying on URLs or file metadata that disappear when content moves, the registry uses ISCC (International Standard Content Code),  a fingerprint generated from the content itself. A book retains its ISCC identifier even when it's been reformatted, compressed, or excerpted. This solves the "content travels" problem.

Forward-Looking Preferences: When you register content in a TDM registry, you're not trying to retroactively control past crawls. You're making a declaration that AI companies can check before including your content in new training datasets. This works with the timing of how AI development actually happens.

Structured, Queryable Permissions: Rather than natural-language terms of service, the registry uses standardized TDM attributes. An AI company can query the registry programmatically: "Which books in the history category allow generative AI training?" The answer comes back in a format that systems can act on immediately.

Retroactive Discoverability: Even for content that was collected before publishers understood the issue, ISCC fingerprinting makes it possible to identify works after the fact. An AI company can check whether content in their existing datasets is now registered with specific licensing terms, allowing them to address the historical dataset problem.

The Practical Difference

Here's a concrete example of how this changes publisher workflows:

Current approach: Post an AI policy on your legal page. Hope AI companies find it, read it, interpret it correctly, and honor it. No way to verify compliance. No mechanism for discovery if someone wants to license your content.

Registry approach: Register your catalog once with your TDM preferences. AI companies query the registry automatically before training. Your preferences are machine-readable, verifiable, and persistent regardless of where your content appears.

The difference isn't just technical elegance. It's the difference between hoping companies respect your wishes and having infrastructure that makes your preferences discoverable and actionable.

Not a Replacement, but a Complement

This doesn't mean robots.txt and metadata are useless. They still serve important functions for search engines, web crawlers, and systems designed around them. But AI licensing operates on different timescales, with different technical requirements, and across different types of content distribution.

A TDM registry doesn't replace existing web protocols. It fills a gap those protocols were never designed to address: making content licensing preferences discoverable and verifiable across the fragmented, historical, metadata-stripped landscape of AI training datasets.

Publishers who understood this early recognized that the question isn't "Why do we need something new?" but rather "How did we ever expect the old tools to solve this particular problem?"

The infrastructure challenge AI licensing presents requires infrastructure built for that purpose. Which is precisely what a TDM registry provides.


Amlet's TDM Registry uses ISCC technology to make publisher content discoverable and licensable in the AI economy, with persistent identifiers that work regardless of where content appears. Learn more about how registry-based licensing differs from traditional web permissions at amlet.ai.