Last updated: June 4, 2026 · By Jessen Gibbs, Founder, Shadow
TL;DR
Get cited by AI engines by making your page trivial to extract from. Open every section with a 40-60 word self-contained answer, phrase H2s as user queries, name entities explicitly, emit Schema.org JSON-LD, cite primary sources, and let the major AI crawlers fetch the page. Then sample your target queries weekly to see what is working.
AI engines (ChatGPT, Perplexity, Claude, Gemini, Google AI Overviews, Bing Copilot) build answers by retrieving passages from web pages, scoring them, and selecting which to quote and link. Getting cited means making your page the easiest, most trustworthy source the engine can find for a given query. That is a structural problem — not a copywriting flourish — and the tactics that work are concrete and repeatable.
This guide walks through the citation playbook from the engine's perspective. We cover the page-level patterns that correlate with citation, the structured-data signals that confirm what your page is about, the crawler-access details that determine eligibility, and the engine-specific quirks that matter when you optimize for ChatGPT vs Perplexity vs AI Overviews. Each tactic is something you can implement on the pages you already publish today.
What makes a page extractable by an AI engine?
Extractability is the property of having complete, self-contained answers near the top of clearly-labeled sections. AI engines score passages independently of their surrounding context, so the highest-scoring passages are short, answer-first paragraphs that resolve a question without needing the reader to scroll. Question-format headings boost the score by aligning the passage with the user query embedding.
The mental model that helps most is to picture the engine indexing your page as a list of passages rather than as a document. Each passage is a candidate citation. A passage that opens with a complete answer wins; a passage that opens with hook prose, builds context for three paragraphs, then states the answer loses, because the engine never sees the answer near the heading it scored.
- 40-60 word answer capsule opens every H2 section, containing the full answer to the heading question with no follow-up context required.
- Question-format H2 — "What is X?", "How does X work?", "Why does X matter?" — phrased to mirror the actual queries users type into the engine.
- Dense named entities — products, companies, standards, and people named explicitly so the engine's entity resolver matches them against its knowledge graph.
- Inline citations to primary sources inside the prose, not only in a footer, so the engine treats your page as a hub rather than a leaf node.
- TL;DR at the top of the page in 40-60 words, because most engines lift this verbatim when asked to summarize the page.
What structured data should I emit?
Emit Schema.org JSON-LD covering Article (or BlogPosting), Author (Person), Publisher (Organization), FAQPage for any FAQ section, and HowTo for any procedural content. Include datePublished and dateModified. AI engines parse JSON-LD reliably and use it to confirm authorship, publication date, and topic — all of which factor into citation trust scoring.
JSON-LD is the highest-leverage structured-data investment for GEO because every major AI engine parses it. The minimum useful set per page is an Article block (headline, author, publisher, datePublished, dateModified), a Person block for the author with sameAs links to professional profiles, an Organization block for the publisher with logo and URL, and a FAQPage block matching any on-page FAQ section (Schema.org Article).
The point is parser-friendliness. An AI engine that has to guess your authorship from prose will sometimes get it wrong and downweight the page; an engine that reads `"author": { "@type": "Person", "name": "..." }` from JSON-LD will not. Tools like auto-geo derive this JSON-LD automatically from the publish payload so authors do not write Schema.org markup by hand.
Which crawlers do I need to let through?
Allow OAI-SearchBot and GPTBot (OpenAI), PerplexityBot, ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Gemini and AI Overviews training), Bingbot (Bing Copilot, ChatGPT Search historically), and Applebot-Extended. Block these and the page is invisible to the corresponding engine. Robots.txt is the right place to control this; meta robots tags are not enough.
Crawler access is the precondition for citation. A page that is not fetched is not indexed, and a page that is not indexed cannot be cited. Each major AI vendor publishes its crawler user-agents and respects robots.txt; the list below covers the agents that matter most in 2026. The honest tradeoff is that allowing these crawlers may also expose pages to training-data ingestion, which some publishers consider separately.
| Engine | Crawler user-agent | Robots.txt directive |
|---|---|---|
| ChatGPT Search | OAI-SearchBot | Allow |
| ChatGPT (browsing/training) | GPTBot | Allow |
| Perplexity | PerplexityBot | Allow |
| Claude (web search) | ClaudeBot, anthropic-ai | Allow |
| Gemini / AI Overviews | Google-Extended | Allow |
| Bing Copilot | Bingbot | Allow |
| Apple Intelligence | Applebot-Extended | Allow |
Verify your robots.txt explicitly rather than relying on defaults. Many static site generators default to a permissive robots policy that blocks specific user-agents inadvertently. The cleanest test is to fetch your sitemap and a representative page with each user-agent set, confirming a 200 response and full HTML payload (OpenAI bots documentation).
How do citation patterns differ across engines?
ChatGPT and Perplexity tend to cite two to five sources per answer with inline numbered references. Google AI Overviews cites via linked entity chips and a sources carousel. Claude with web search cites inline with source titles. Gemini cites via a sources panel. The page architecture is largely shared across engines, but engine-specific quirks matter for top-of-answer citation.
ChatGPT Search tends to reward pages with a strong TL;DR and explicit datePublished; it often lifts the TL;DR verbatim when the user query maps to the page topic. Perplexity rewards passage-level density and inline outbound citations; pages that cite primary sources tend to outrank pages that do not. Google AI Overviews leans hardest on its own Search ranking signals, so technical SEO foundation matters more here than for the pure AI engines.
Claude with web search is the most conservative citer and tends to prefer fewer, higher-authority sources per answer; getting cited here often requires being one of the most cited pages for the topic elsewhere on the web. Gemini behaves similarly to AI Overviews because the retrieval layer is shared. Bing Copilot reflects Bingbot's index closely and tends to reward older pages with consistent authorship metadata.
How do I verify whether my page is being cited?
Verify citation by sampling each target query against each AI engine on a weekly cadence and recording whether your domain appears in the answer. A simple spreadsheet or a script that calls each engine's API (or browser-automates the consumer UI) works. Track citation share, position-in-answer, and whether your phrasing is quoted verbatim per query per engine.
Start with a fixed list of 20 to 50 target queries — the questions your buyers actually ask. Prompt each AI engine with each query on the same schedule (weekly is a reasonable default; daily for high-value queries). Record the cited URLs, the answer text, and whether your domain appears. Trend the citation share over time. Pages that have just been rebuilt against the GEO architecture often show citation lift within four to eight weeks.
What are the most common mistakes that block citation?
The most common mistakes are burying the answer under hook prose, using branded headings instead of question-format ones, omitting Schema.org JSON-LD, blocking AI crawlers in robots.txt, and citing nothing. Each is a single-issue fix, but together they explain why many otherwise authoritative pages never appear in AI answers despite ranking well on classical Google.
- Buried answers — page opens with a hook paragraph and the answer arrives several scrolls down, after the engine has already scored the page.
- Branded headings — "Our Approach to X" instead of "What is X?"; the embedding never matches the user query closely enough to win the citation.
- No JSON-LD — the engine has to guess authorship and publication date from prose and gets it wrong often enough to downweight the source.
- Crawler blocks — robots.txt blocks OAI-SearchBot or PerplexityBot inadvertently, removing the page from the engine's index entirely.
- Zero outbound citations — page makes claims with no primary sources, lowering its trust score relative to pages that cite the same source.
- Stale dateModified — engines downweight pages that look unmaintained; refresh the modifiedAt at least quarterly even on evergreen content.
Related Guides
- What is Generative Engine Optimization (GEO)?
- How does GEO differ from SEO in 2026?
- How should I structure web pages so AI search engines cite them?
- How do I measure GEO performance and citation lift?
- GEO: Generative Engine Optimization (Princeton/KDD 2024)
- OpenAI crawler / bots documentation
- auto-geo on GitHub
Key Takeaways
- Citation eligibility starts with crawler access: allow OAI-SearchBot, GPTBot, PerplexityBot, ClaudeBot, Google-Extended, Bingbot, and Applebot-Extended in robots.txt.
- AI engines score passages, not documents, so every H2 section must open with a complete 40-60 word self-contained answer to the heading question.
- Schema.org JSON-LD (Article, Person, Organization, FAQPage) lets the engine parse authorship and topic without guessing, which raises citation trust scoring.
- Each engine has quirks: ChatGPT lifts TL;DRs, Perplexity rewards inline citations, AI Overviews leans on Search ranking, and Claude prefers fewer high-authority sources.
- Sample target queries against each engine weekly and trend citation share, position-in-answer, and verbatim quotation as the core measurement loop.
- Most citation failures trace to five fixable mistakes: buried answers, branded headings, missing JSON-LD, crawler blocks, and zero outbound citations.
Frequently Asked Questions
How long until a new page starts getting cited?
Most pages start appearing in Perplexity within days of publication if crawler access is open and the architecture is correct. ChatGPT Search and Claude typically follow within one to three weeks. Google AI Overviews lags the longest, often four to eight weeks, because it inherits Google Search's slower indexing and ranking cadence for new pages.
Do I need to submit my page to AI engines somewhere?
No. There is no submission form for any major AI engine in 2026. Discovery happens via the engine's own crawler, sitemap fetches, or via the underlying search index (Bing for ChatGPT Search, Google for AI Overviews and Gemini). Submitting your sitemap to Bing Webmaster Tools and Google Search Console covers the major retrieval paths.
Does paywalling content block AI citation?
Usually yes. AI engines fetch with the same access level as their crawler user-agent, so content behind a paywall or login is invisible to them. Some publishers expose summary versions to crawlers; others block crawlers entirely and accept the citation tradeoff. The right call depends on your business model and the strategic value of AI surface presence.
Should I write separate pages for each AI engine?
No. The page architecture that wins citation is largely shared across ChatGPT, Perplexity, Google AI Overviews, Claude, and Gemini. Writing engine-specific pages creates duplicate libraries that drift out of sync within a quarter. Optimize one canonical page against the GEO architecture and measure citation share per engine to see where you need to iterate.
Does linking to my own pages help or hurt AI citation?
Tight internal linking between related pages helps. AI engines follow internal links to build topical context, so a Related Guides block at the bottom of each page raises citation share on adjacent queries. The opposite mistake — over-linking promotional pages from informational ones — can lower trust scoring, so keep internal links contextual and relevant.
About the Author
Jessen Gibbs · Founder, Shadow
Jessen leads Shadow, a media research lab studying how AI engines surface and cite brands. He works with communications teams on Generative Engine Optimization (GEO) programs and writes about the page architecture that makes content quotable by ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews.
Shadow is the publisher of this resource and the maintainer of auto-geo, an MIT-licensed publishing engine referenced above that enforces the GEO page contract. External research and vendor documentation are cited with full URLs. Published by Shadow.