Structured Data vs Scraping: How AI Learns About Your Website

The Experiment

We ran a controlled test: two AI agents, same 10 questions, different data sources. Agent A used structured /.well-known/ai endpoints (4 JSON files). Agent B scraped raw HTML pages. Both tried to learn everything about a website.

Then we ran the same test against three real-world sites that don’t have AI Discovery: Intel, AdventHealth, and Naoris Protocol.

The results make the case for structured AI discovery data better than any whitepaper could.

The 10 Questions

Every AI agent visiting your site is trying to answer questions like these:

  1. What is this website?
  2. Who operates it?
  3. What content license applies? Can I quote? Can I train?
  4. What are the API rate limits?
  5. What policies does the site have?
  6. What technology does the site use for content signing?
  7. What pages are available?
  8. What is the core business?
  9. What products or services are offered?
  10. What data does the site collect about visitors?

Results: Confidence Scores

We scored each answer as HIGH, MEDIUM, or LOW confidence based on whether the agent could give a complete, accurate, unambiguous answer.

Site HIGH MEDIUM LOW Method
discover.rootz.global 9 1 0 .well-known/ai (structured)
Intel 7 2 0 HTML scraping
Naoris Protocol 6 2 2 HTML scraping
AdventHealth 4 2 4 HTML scraping

The Most Important Question AI Cannot Answer

The most critical question for any AI agent is: Am I allowed to be here? Can I quote this content? Can I use it for training?

Here is what the scraping agent found:

Site Can AI quote? Can AI train? How do you know?
discover.rootz.global Yes No Machine-readable JSON: ["quote","summarize"]
Intel No No Buried in legal Terms of Use text
AdventHealth Unknown Unknown Policy pages returned 404 or redirected
Naoris Protocol Unknown Unknown Terms not retrievable

Two out of three real-world sites cannot even tell an AI agent whether it is allowed to read them.

What Went Wrong (Without AI Discovery)

Intel

Intel has the most complete web presence of the three, but their Terms of Use explicitly prohibit “automated searches using bots, scrapers, or web scraping technologies without prior written permission” and ban “text mining, data mining, and harvesting metadata.” Yet their robots.txt does not block any AI-specific bots (GPTBot, ClaudeBot, etc.). This creates an ambiguous legal posture: technically accessible, legally restricted. No content signing exists.

AdventHealth

A healthcare organization with 92,000 employees across nine states — and their policy pages are broken. The privacy policy redirects to the wrong subdomain. The terms of use returns a 404 error. The about page redirects to the Women’s Health department instead of corporate information. An AI agent visiting AdventHealth literally cannot determine the organization’s data handling practices or whether automated access is permitted. Meanwhile, a data breach notice sits on the homepage.

Naoris Protocol

The irony is striking: Naoris Protocol is a cybersecurity company building post-quantum blockchain infrastructure — yet their own website has zero content signing, no robots.txt (404), and their terms and conditions are not retrievable. A company dedicated to “making Web3 unbreakable” cannot provide basic transparency to AI agents about their own content policies.

What Structured Data Gets Right

The /.well-known/ai endpoint answered the same questions with 9 out of 10 HIGH confidence scores using just four compact JSON files:

  • Machine-readable permissions — No legal text to parse. ["quote","summarize","cache_24h"] is unambiguous.
  • Cryptographic signing — Content hash + secp256k1 signature + signer address. Independently verifiable.
  • Complete page inventory — All 10 pages declared with paths and titles. No crawling required.
  • Typed policies — Privacy, terms, data protection, AI usage — each with URL, type, and summary.
  • Structured knowledge — Organization identity, about text, and capabilities in one request.

The one question that scored MEDIUM (visitor data collection) was due to policy summaries being truncated. We have already fixed this in plugin v1.6.0 by increasing summary length and adding a ?full_text=1 parameter to the policies endpoint.

The Cost Difference

For the structured path: 4 API calls, approximately 3,200 tokens of input data, zero failed requests.

For scraping AdventHealth: 4 fetch attempts, 2 returned usable data (50% failure rate), policy content completely inaccessible.

Structured data is not just better — it is dramatically more reliable and efficient.

Conclusion

The web was not built for AI. Every site AI visits today forces it to guess, infer, and sometimes hallucinate basic facts. The AI Discovery Standard gives websites a way to speak directly to AI agents — structured, signed, and machine-readable.

The question is not whether AI needs this. The question is whether your site will be discoverable when AI comes looking.

See the live endpoint