Audit glossary

Every signal AIAuditFix checks, grouped by category, with what it means and why it matters for AI agents and crawlers. Category percentages are the weight each contributes to your overall score. Run an audit →

AI Crawler Access Content Discoverability Structured Data Agent Protocols Content Signals

AI Crawler Access

25% of overall

robots.txt Present crawler.robots_present weight 8

robots.txt is how you tell crawlers — including AI bots — what they may fetch. Without one, every crawler falls back to its own default behaviour.

robots.txt Parseable crawler.robots_parseable weight 5

A malformed robots.txt is ignored outright by some crawlers, so the rules you intended silently don't apply.

GPTBot Access Declared crawler.gptbot weight 8

GPTBot is OpenAI's crawler for ChatGPT. Whether you allow or block it should be a deliberate choice — not declaring a rule leaves it to OpenAI's default.

ClaudeBot Access Declared crawler.claudebot weight 8

ClaudeBot is Anthropic's crawler. Addressing it explicitly decides whether your content can be read and surfaced in Claude's answers.

PerplexityBot Access Declared crawler.perplexitybot weight 6

PerplexityBot feeds Perplexity's answer engine. Leaving it unaddressed means Perplexity applies whatever default it chooses, not what you'd choose.

Bytespider Addressed crawler.bytespider weight 4

Bytespider (ByteDance / TikTok) is a high-volume crawler with a mixed reputation. Decide explicitly whether to allow or block it rather than leaving it open.

CCBot Addressed crawler.ccbot weight 4

CCBot feeds Common Crawl, the open corpus that many AI models train on. Addressing it controls whether your content enters that training data.

Applebot-Extended Addressed crawler.applebot weight 4

Applebot-Extended specifically governs whether Apple may use your content for AI training — separate from ordinary Applebot search indexing.

AI Crawler Policy Consistent crawler.ai_policy_consistent weight 8

Addressing some AI crawlers but not others leaves coverage gaps. A consistent policy across all the major bots is clearer to operate and easier to defend.

Crawl Delay Not Excessive crawler.crawl_delay weight 5

A large Crawl-delay throttles how fast crawlers may fetch your pages. Above roughly 10 seconds it materially slows how quickly AI systems can index you.

Content Discoverability

25% of overall

llms.txt Present disco.llms_txt weight 15

llms.txt is an emerging convention (llmstxt.org) for telling language models which of your URLs matter most, in clean Markdown. Publishing one signals AI-readiness.

llms.txt Parseable disco.llms_txt_parseable weight 8

An llms.txt with no '# ' heading or no links can't be interpreted the way the spec intends — models get nothing structured from it.

llms.txt Quality disco.llms_txt_quality weight 10

A bare llms.txt (just a title) gives a model almost nothing. Good ones have a description and several sectioned links to your most useful pages.

llms-full.txt Present disco.llms_full_txt weight 6

llms-full.txt is the companion file carrying your expanded content in one fetch, for models that want the full text rather than a link index.

Sitemap Present disco.sitemap weight 8

A sitemap tells crawlers which URLs exist and how recently they changed — the difference between your pages being discovered and being missed.

Sitemap in robots.txt disco.sitemap_in_robots weight 5

Declaring the sitemap inside robots.txt makes it discoverable by any crawler that reads robots, without it having to guess the conventional path.

Meta Description disco.meta_description weight 8

AI summaries and search snippets frequently use the meta description. Without one, the model takes whatever text appears first — often menu links or boilerplate.

OG Title disco.og_title weight 5

og:title controls the headline shown when your link is shared or surfaced by an assistant or social platform.

OG Description disco.og_description weight 5

og:description is the blurb shown in link previews across social, chat, and AI tools.

Canonical URL disco.canonical weight 6

When the same content is reachable via several URLs, the canonical link tells crawlers which is authoritative and prevents duplicate-content dilution.

Homepage Indexable disco.noindex weight 8

A noindex directive on the homepage removes it from search and AI indexes entirely. On a homepage that's almost always a leftover staging-config mistake.

HTTPS (Indexing Preference) disco.https weight 6

AI crawlers and search engines prefer and prioritise HTTPS URLs; HTTP-only content can be deprioritised as a quality signal.

Structured Data

20% of overall

Schema.org Markup Present schema.present weight 15

Schema.org JSON-LD lets crawlers and language models understand your page as structured entities — an Organization, a Product, an Article — not just prose.

Schema.org JSON Valid schema.parseable weight 10

A JSON-LD block that doesn't parse is ignored entirely; the markup might as well not be there.

Recognised Schema Type schema.type_detected weight 12

Models recognise the standard schema.org types (Organization, WebSite, Article, Product…). Custom or invented @type values are ignored.

Organisation Identity schema.org_identity weight 10

An Organization node with name, url, and description tells AI who you are in a machine-readable form it can quote with confidence.

WebSite Schema schema.website_declared weight 8

A WebSite node helps assistants understand the site as a whole and can enable sitelinks-search-box behaviour.

BreadcrumbList schema.breadcrumbs weight 5

BreadcrumbList markup tells AI where a page sits in your site's structure — useful context for navigation and summarisation. We only scan the homepage, which is the root and has no parent path, so absence here is reported as a skip (not a fail) — but if you do publish one, we credit it as a positive signal.

Multiple Schema Types schema.multiple_types weight 5

Richer markup — multiple types or an @graph — gives models more entity context than a single bare node.

No Obvious Schema Errors schema.no_errors weight 10

Missing required fields (an Article with no headline, a Product with no name) make the markup unreliable, and AI may discard it rather than risk bad data.

Entity Facts in Visible Text schema.entity_facts_visible weight 10

AI retrieval often strips <script> JSON-LD before the model reads a page, so facts that live only in your schema can be invisible at answer time. We take the facts your Organization node declares — name, address, founding date, contact — and check each one is also readable in the rendered text.

Agent Protocols

15% of overall

MCP Server Card agent.mcp_card weight 20

An MCP Server Card at /.well-known/mcp/server-card.json (SEP-1649) declares the tools your site exposes to AI agents — the emerging standard for agent-tool discovery. See modelcontextprotocol.io.

MCP Card Quality agent.mcp_card_quality weight 10

A useful MCP card needs a name, description, and a tools array; without them an agent can't tell what your site can actually do.

A2A Agent Card agent.a2a_card weight 15

An A2A Agent Card at /.well-known/agent-card.json describes your agent's skills so other agents can interoperate with it.

A2A Card Quality agent.a2a_card_quality weight 8

An A2A card needs name, description, and a skills/capabilities field to be actionable by another agent.

OpenAPI Spec Linked agent.openapi weight 15

A published OpenAPI description lets agents discover and call your API programmatically, without a bespoke integration for each one.

OAuth Discovery agent.oauth_discovery weight 10

OAuth Authorization Server Metadata lets agents discover how to authenticate to your API automatically instead of needing manual setup.

OAuth Protected Resource (RFC 9728) agent.oauth_protected_resource weight 10

OAuth Protected Resource Metadata (RFC 9728) is the resource-server sibling of OAuth discovery: it tells agents which authorization servers issue tokens for your API, so they can pick the right one without manual configuration.

WebMCP Signal agent.webmcp weight 8

WebMCP lets a page expose tools to an in-browser AI agent at runtime via document.modelContext.registerTool(). It's a runtime JavaScript API, so a static scan can only detect the API surface in your page source, not verify live tool registration.

API Docs Linked agent.api_docs_linked weight 14

Linking your API documentation from the homepage is the low-tech baseline that lets agents (and developers) find how to integrate with you.

API Catalog (RFC 9727) agent.api_catalog weight 10

An API Catalog at /.well-known/api-catalog (RFC 9727) is the standardised entry point for automated API discovery: a linkset+json document that points agents at your OpenAPI spec, docs, and status endpoint without them guessing paths.

Commerce Signals agent.commerce_signals weight 5

Storefront markers on the homepage (Product schema, a visible cart/checkout/shop link, og:type=product) signal to AI shopping agents that this site sells things — context for whether commerce-specific protocols like UCP are even relevant. Non-commerce sites skip this cleanly without affecting the score.

UCP Capabilities (emerging) agent.ucp_capabilities weight 5

The Universal Commerce Protocol (Google, Shopify, Etsy, Wayfair, Target, Walmart +20 endorsers, announced Jan 2026) lets AI shopping agents discover and negotiate with merchants through a JSON capability profile at /.well-known/ucp. Skipped for non-commerce sites.

Content Signals

15% of overall

Page Title Present content.title weight 8

The <title> is the single most-used signal for what a page is about — across search, link sharing, and AI summarisation.

H1 Heading Present content.h1_present weight 8

An <h1> states the page's primary topic inside the document itself — the first structural signal a parser reads.

Single H1 content.h1_single weight 6

Multiple <h1> elements create ambiguity about the page's primary topic for machines that rely on heading structure.

Language Declared content.lang weight 10

The lang attribute tells crawlers and assistants what language the content is in, which affects how it's processed and surfaced.

Content-Language Header content.content_language weight 5

The Content-Language response header reinforces the document language at the HTTP level for clients that read headers rather than markup.

Markdown Content Negotiation content.markdown_negotiation weight 15

Returning Markdown when an agent sends Accept: text/markdown gives models clean text without HTML noise — an emerging best practice for agent-friendly sites.

Link Response Headers content.link_header weight 8

RFC 8288 Link response headers let agents discover related resources — sitemap, llms.txt, your API — directly from the response, without parsing the page.

Content/Code Ratio Signal content.text_ratio weight 8

A page that is almost all JavaScript shell with little server-rendered text is hard for non-executing crawlers to read — they see an empty page.

Character Encoding Declared content.encoding weight 6

Declaring UTF-8 prevents garbled characters when crawlers and models decode your content.

Favicon Present content.favicon weight 6

A favicon is a small but real signal AI agents use for visual identification of your site in lists and previews.

Heading Hierarchy content.structured_headings weight 10

Headings that descend in order (h1 → h2 → h3 without skipping) give machines a clean outline of how the content is organised.

Image Alt Texts content.image_alts weight 5

Alt text is how non-visual consumers — including AI — understand what your images convey.

Internal Links Present content.internal_links weight 5

Internal links are how crawlers discover the rest of your site. A homepage with very few of them looks like a dead end.