Audit glossary
Every signal AIAuditFix checks, grouped by category, with what it means and why it matters for AI agents and crawlers. Category percentages are the weight each contributes to your overall score. Run an audit →
AI Crawler Access
25% of overallrobots.txt is how you tell crawlers — including AI bots — what they may fetch. Without one, every crawler falls back to its own default behaviour.
A malformed robots.txt is ignored outright by some crawlers, so the rules you intended silently don't apply.
GPTBot is OpenAI's crawler for ChatGPT. Whether you allow or block it should be a deliberate choice — not declaring a rule leaves it to OpenAI's default.
ClaudeBot is Anthropic's crawler. Addressing it explicitly decides whether your content can be read and surfaced in Claude's answers.
PerplexityBot feeds Perplexity's answer engine. Leaving it unaddressed means Perplexity applies whatever default it chooses, not what you'd choose.
Bytespider (ByteDance / TikTok) is a high-volume crawler with a mixed reputation. Decide explicitly whether to allow or block it rather than leaving it open.
CCBot feeds Common Crawl, the open corpus that many AI models train on. Addressing it controls whether your content enters that training data.
Applebot-Extended specifically governs whether Apple may use your content for AI training — separate from ordinary Applebot search indexing.
Addressing some AI crawlers but not others leaves coverage gaps. A consistent policy across all the major bots is clearer to operate and easier to defend.
A large Crawl-delay throttles how fast crawlers may fetch your pages. Above roughly 10 seconds it materially slows how quickly AI systems can index you.
Content Discoverability
25% of overallllms.txt is an emerging convention (llmstxt.org) for telling language models which of your URLs matter most, in clean Markdown. Publishing one signals AI-readiness.
An llms.txt with no '# ' heading or no links can't be interpreted the way the spec intends — models get nothing structured from it.
A bare llms.txt (just a title) gives a model almost nothing. Good ones have a description and several sectioned links to your most useful pages.
llms-full.txt is the companion file carrying your expanded content in one fetch, for models that want the full text rather than a link index.
A sitemap tells crawlers which URLs exist and how recently they changed — the difference between your pages being discovered and being missed.
Declaring the sitemap inside robots.txt makes it discoverable by any crawler that reads robots, without it having to guess the conventional path.
AI summaries and search snippets frequently use the meta description. Without one, the model takes whatever text appears first — often menu links or boilerplate.
og:title controls the headline shown when your link is shared or surfaced by an assistant or social platform.
og:description is the blurb shown in link previews across social, chat, and AI tools.
When the same content is reachable via several URLs, the canonical link tells crawlers which is authoritative and prevents duplicate-content dilution.
A noindex directive on the homepage removes it from search and AI indexes entirely. On a homepage that's almost always a leftover staging-config mistake.
AI crawlers and search engines prefer and prioritise HTTPS URLs; HTTP-only content can be deprioritised as a quality signal.
Structured Data
20% of overallSchema.org JSON-LD lets crawlers and language models understand your page as structured entities — an Organization, a Product, an Article — not just prose.
A JSON-LD block that doesn't parse is ignored entirely; the markup might as well not be there.
Models recognise the standard schema.org types (Organization, WebSite, Article, Product…). Custom or invented @type values are ignored.
An Organization node with name, url, and description tells AI who you are in a machine-readable form it can quote with confidence.
A WebSite node helps assistants understand the site as a whole and can enable sitelinks-search-box behaviour.
BreadcrumbList markup tells AI where a page sits in your site's structure — useful context for navigation and summarisation. We only scan the homepage, which is the root and has no parent path, so absence here is reported as a skip (not a fail) — but if you do publish one, we credit it as a positive signal.
Richer markup — multiple types or an @graph — gives models more entity context than a single bare node.
Missing required fields (an Article with no headline, a Product with no name) make the markup unreliable, and AI may discard it rather than risk bad data.
Agent Protocols
15% of overallAn MCP Server Card at /.well-known/mcp/server-card.json (SEP-1649) declares the tools your site exposes to AI agents — the emerging standard for agent-tool discovery. See modelcontextprotocol.io.
A useful MCP card needs a name, description, and a tools array; without them an agent can't tell what your site can actually do.
An A2A Agent Card at /.well-known/agent-card.json describes your agent's skills so other agents can interoperate with it.
An A2A card needs name, description, and a skills/capabilities field to be actionable by another agent.
A published OpenAPI description lets agents discover and call your API programmatically, without a bespoke integration for each one.
OAuth Authorization Server Metadata lets agents discover how to authenticate to your API automatically instead of needing manual setup.
OAuth Protected Resource Metadata (RFC 9728) is the resource-server sibling of OAuth discovery: it tells agents which authorization servers issue tokens for your API, so they can pick the right one without manual configuration.
WebMCP lets a page expose tools to an in-browser AI agent at runtime via document.modelContext.registerTool(). It's a runtime JavaScript API, so a static scan can only detect the API surface in your page source, not verify live tool registration.
Linking your API documentation from the homepage is the low-tech baseline that lets agents (and developers) find how to integrate with you.
An API Catalog at /.well-known/api-catalog (RFC 9727) is the standardised entry point for automated API discovery: a linkset+json document that points agents at your OpenAPI spec, docs, and status endpoint without them guessing paths.
Storefront markers on the homepage (Product schema, a visible cart/checkout/shop link, og:type=product) signal to AI shopping agents that this site sells things — context for whether commerce-specific protocols like UCP are even relevant. Non-commerce sites skip this cleanly without affecting the score.
The Universal Commerce Protocol (Google, Shopify, Etsy, Wayfair, Target, Walmart +20 endorsers, announced Jan 2026) lets AI shopping agents discover and negotiate with merchants through a JSON capability profile at /.well-known/ucp. Skipped for non-commerce sites.
Content Signals
15% of overallThe <title> is the single most-used signal for what a page is about — across search, link sharing, and AI summarisation.
An <h1> states the page's primary topic inside the document itself — the first structural signal a parser reads.
Multiple <h1> elements create ambiguity about the page's primary topic for machines that rely on heading structure.
The lang attribute tells crawlers and assistants what language the content is in, which affects how it's processed and surfaced.
The Content-Language response header reinforces the document language at the HTTP level for clients that read headers rather than markup.
Returning Markdown when an agent sends Accept: text/markdown gives models clean text without HTML noise — an emerging best practice for agent-friendly sites.
RFC 8288 Link response headers let agents discover related resources — sitemap, llms.txt, your API — directly from the response, without parsing the page.
A page that is almost all JavaScript shell with little server-rendered text is hard for non-executing crawlers to read — they see an empty page.
Declaring UTF-8 prevents garbled characters when crawlers and models decode your content.
A favicon is a small but real signal AI agents use for visual identification of your site in lists and previews.
Headings that descend in order (h1 → h2 → h3 without skipping) give machines a clean outline of how the content is organised.
Alt text is how non-visual consumers — including AI — understand what your images convey.
Internal links are how crawlers discover the rest of your site. A homepage with very few of them looks like a dead end.