How to Optimize Static HTML for AI Crawlers

Q: Which AI crawlers should I allow in robots.txt?

If your goal is to be cited in AI answers, allow OAI-SearchBot, PerplexityBot, Claude-SearchBot, and ChatGPT-User. Allow GPTBot, ClaudeBot, and Google-Extended only if you also want your content used for model training. ChatGPT-User, Claude-User, and Perplexity-User are user-triggered and largely ignore robots.txt by design.

By Michael McGrory, Solutions Engineer, Partnerships at Cloudflare Last updated May 2, 2026 GEOLLMOStatic HTML

Key takeaways

Most AI crawlers do not execute JavaScript. Static HTML or server-side rendering is the baseline requirement.
The pages most likely to be cited are original research, definitions, comparisons, and how-tos — in that order.
GPTBot raw requests grew +305% YoY May 2024 to May 2025; ChatGPT-User grew +2,825%; PerplexityBot grew +157,490% (Cloudflare, 2025).
Add llms.txt, a markdown mirror per page, JSON-LD, and Content Signals in robots.txt.
Recency matters: AI assistants shift cited publication dates forward by up to 4.78 years when reranking, per a 2025 study.

If you want a static HTML page to be crawled and cited by AI assistants like ChatGPT, Claude, Perplexity, and Google's AI Overviews, you need three things in place: (1) the page content has to live in the initial HTML response, not behind JavaScript; (2) the page has to be structured so that a language model can extract a complete answer to a specific question from a single chunk; and (3) the site has to declare its preferences and signals to crawlers via robots.txt, sitemap.xml, and (increasingly) llms.txt. This page is itself an example of all three.

The three kinds of AI crawlers

AI crawlers are not a single thing. They fall into three categories, and each has a different appetite. Optimizing for one type alone leaves citations on the table.

AI crawler taxonomy — sources: OpenAI, Anthropic, Perplexity, Cloudflare AI Crawl Control reference (2026)
Category	Examples	Honors robots.txt?	What they want
Training	GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, Google-Extended	Yes	Breadth, novelty, original data — bulk corpora
Index / search	OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot	Yes	Stable URLs, structured pages, fresh content
User-triggered	ChatGPT-User, Claude-User, Perplexity-User, Meta-ExternalFetcher	Largely no	Authoritative pages on demand, for citations

If your goal is appearing in AI answers, the index/search and user-triggered bots are what produce visible citations. The training bots build long-term recognition of your brand and topic authority.

Page types most likely to be crawled and cited

Public studies by Ahrefs, Semrush, and Microsoft converge on the same ranking. The list below reflects citation lift, not raw crawl volume.

Original research and proprietary data. Surveys, benchmarks, statistics with a methodology blurb and a sample size. AI assistants need numbers to cite.
Definitional / glossary pages (“What is X”). Direct answer in the first paragraph, then context.
Comparison pages (“X vs Y”, “best X for Y”). Tables and side-by-sides extract verbatim.
How-to and step-by-step guides with numbered steps and code blocks where relevant.
Pricing or cost pages with concrete numbers in plain text (not images).
FAQ and Q&A pages using clear question headings.
Reference and API documentation — this is where llms.txt earns its keep.
Programmatic pages with consistent schemas (e.g. “{thing} statistics 2026”).
Recent news and time-stamped analysis for YMYL or fast-moving topics.
Free tools and calculators — cited heavily and convert well from referrals.

Page-level signals to add

Each item below corresponds to something in this page's HTML. View source if you want a copy-paste reference.

Server-side rendered HTML. Content is in the response body, not injected by JavaScript.
One <h1> phrased as the user's question. Subheads (<h2>, <h3>) phrased as subquestions.
Direct answer in the lead paragraph, plus a Key takeaways block above the fold.
Short paragraphs (2–4 sentences) and lists. Improves chunk extractability for retrieval-augmented generation.
Stats and quotes in plain text, with units, dates, and an inline source.
Visible Last updated date plus a machine-readable <time datetime="..."> element and dateModified in JSON-LD.
Author byline with credentials and a real bio link.
JSON-LD structured data matching content type: Article, FAQPage, HowTo, Product, Dataset.
Stable canonical URL. Citations need a permanent target.
Markdown mirror at page.html.md for any docs/reference content. This page has one at /index.html.md.

Site-level signals to add

robots.txt with explicit allow rules for the AI bots you want and Content Signals directives. View this site's robots.txt.
llms.txt at the site root, listing your most useful pages in markdown. View this site's llms.txt.
sitemap.xml with accurate <lastmod> per URL. View sitemap.
Allowlist verified AI bots in your WAF or Bot Management. Cloudflare's AI Crawl Control gives per-bot policies.
Backlinks from authoritative sources. Citations skew heavily toward DR 80+ domains because they retrieve more often.
Brand mentions across UGC platforms (Reddit, Wikipedia, YouTube transcripts, Stack Overflow). These influence both training and real-time RAG.
Fast TTFB and 200-status responses. AI crawlers crawl at scale; 5xx and 429s reduce future crawl depth.

Gotchas worth flagging

JS-rendered SPAs are mostly invisible. If your page needs hydration to render the article body, AI bots see an empty shell. Test with curl -A "GPTBot" https://example.com/page on a sample URL and check that the content is there.

Aggressive bot management blocks legit AI bots. Common mistake on sites that turned on bot fight mode without exceptions. Allowlist verified bots by user-agent and published IP ranges. OpenAI, Anthropic, and Perplexity all publish JSON IP lists.

robots.txt is voluntary. User-triggered fetchers ignore it by design. If you actually need enforcement, use Cloudflare AI Crawl Control or WAF rules.

Stale URLs kill citations. A page cited in ChatGPT today needs to be at the same URL six months from now when a user clicks the source link.

A static HTML checklist for AI crawl

Use this as a final pre-deploy review.

Server-rendered HTML, content visible in view-source:
One <h1> matching the page's question
First paragraph or Key takeaways block answers the main question directly
At least one stat, quote, or data point with a date and source
<time datetime="..."> and visible Last updated line
Author byline with link to bio
JSON-LD for the relevant schema type
<table> or <ol> / <ul> for any list or comparison data
Canonical tag, clean URL, present in sitemap.xml with lastmod
robots.txt allows the bots you want; Content Signals set
Optional but worth it: .md mirror and /llms.txt entry

Frequently asked questions

Do AI crawlers execute JavaScript?

Most production AI crawlers (GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended) do not execute JavaScript. Content rendered client-side is invisible to them. Use server-side rendering, static site generation, or prerendering to put your content in the initial HTML response.

Should I use llms.txt?

Yes, for documentation, API reference, and content-heavy sites. The llms.txt file is a markdown index at the root of your domain that points language models to LLM-friendly versions of your most useful pages. It coexists with robots.txt and sitemap.xml. Adoption is uneven across crawlers, but the cost of adding it is near zero.

Which AI crawlers should I allow in robots.txt?

If your goal is to be cited in AI answers, allow OAI-SearchBot, PerplexityBot, Claude-SearchBot, and ChatGPT-User. Allow GPTBot, ClaudeBot, and Google-Extended only if you also want your content used for model training. Note that ChatGPT-User, Claude-User, and Perplexity-User are user-triggered and largely ignore robots.txt by design.

What page format gets cited most often?

Original research with concrete numbers, definitional or glossary pages, comparison pages, and how-to guides with numbered steps. Pages with quotes and statistics show roughly 30–40% higher visibility in AI answers, according to a 2023 study of 10,000 queries.

Sources

Cloudflare blog, “From Googlebot to GPTBot: who's crawling your site in 2025” — blog.cloudflare.com
Cloudflare AI Crawl Control documentation — developers.cloudflare.com/ai-crawl-control
OpenAI bot reference — platform.openai.com/docs/bots
Perplexity bot reference — docs.perplexity.ai/guides/bots
llms.txt specification — llmstxt.org
Content Signals — contentsignals.org
Ahrefs, “How to Earn LLM Citations” — ahrefs.com/blog/llm-citations
Semrush GEO guide — semrush.com