All posts
AI Search9 min read

How ChatGPT, Perplexity and Google AI Overviews choose which sources to cite

A practical breakdown of how modern AI answer engines retrieve, rank and cite web sources — and what you can do to be one of them.

LE
LumenEntity Research
Visibility & AI search team

When a user types a question into ChatGPT, Perplexity or Google's AI Overviews, the model does not invent the answer from raw weights. It runs a retrieval pipeline: pull a small set of candidate documents from the web (or a curated index), rank them, and ground the generated answer in them. The cited sources you see at the bottom are the documents the pipeline trusted most.

The four stages of an AI answer

  1. Query understanding — the user's prompt is rewritten into one or more search queries.
  2. Retrieval — those queries hit a search index (Bing for ChatGPT and Copilot, Google for AI Overviews and Gemini, a proprietary blend for Perplexity).
  3. Reranking — candidate documents are re-scored by a smaller LLM or learned ranker on semantic relevance to the question.
  4. Generation — the top documents are passed into the answering model as context, which writes the response and attaches citations.

Two practical consequences fall out of this. First: if you are not in the underlying search index (Bing or Google), you cannot be cited. Classic SEO is still table stakes. Second: ranking well on Bing or Google for the rewritten query gets you into the candidate set, but the rerank step picks the winners — and that step rewards different signals than a normal SERP.

What the rerank step rewards

  • Passage-level relevance. A single paragraph that directly answers the question beats a long article that buries the answer.
  • Authority of the domain. Editorially curated sources, official documentation, .gov and .edu domains, and Wikipedia consistently outrank thin content.
  • Freshness, weighted by topic. For evergreen topics, age barely matters. For news, software versions or pricing, recency dominates.
  • Clarity of structure. H2 / H3 / list / table markup helps retrievers chunk a page into atomic answers.
  • Schema.org annotations. Article, FAQPage, HowTo and Product schemas are signals to both Google's AI Overviews and Bing-backed answer engines.

The crawlers you should know about

AI engines either reuse a major search index or run their own crawler. If you block the wrong user agent in robots.txt, you disappear from the corresponding answer surface. The most relevant agents in 2026:

  • GPTBot — OpenAI's general training and retrieval crawler.
  • OAI-SearchBot — OpenAI's live search crawler used by ChatGPT Search.
  • ClaudeBot and anthropic-ai — Anthropic's crawlers for Claude.
  • PerplexityBot — Perplexity's crawler.
  • Google-Extended — Google's flag to allow or deny use in Bard / Gemini / Vertex AI training and AI Overviews.
  • CCBot — Common Crawl, used as a training corpus by many open models.
  • Bingbot — still the index of record for ChatGPT and Copilot answers.

A short, practical checklist

  1. Confirm your robots.txt allows GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended and Bingbot.
  2. Publish an llms.txt at the root of your domain pointing AI agents at your most important canonical pages.
  3. Add Article and FAQPage JSON-LD to editorial pages, and Organization + WebSite to your home page.
  4. Lead every article with a 3-5 bullet TL;DR. It is the passage most retrievers will pull.
  5. Keep update dates honest and visible — many rerankers parse them.

Frequently asked questions

Does ChatGPT use Google or Bing to search the web?
ChatGPT's browsing and ChatGPT Search use Bing as their primary index. That is why Bing SEO has gained importance in the AI era.
Does Perplexity have its own crawler?
Yes. PerplexityBot crawls the web for Perplexity's index in addition to using public search APIs. You can allow or disallow it in robots.txt.
What is llms.txt?
llms.txt is an emerging convention, similar in spirit to robots.txt and sitemap.xml, that gives AI agents a curated map of your most important pages in a model-friendly format.
Is Schema.org markup still worth adding?
Yes — perhaps more than before. Both Google's AI Overviews and Bing-backed answer engines use structured data to extract atomic facts (price, rating, author, publish date) reliably.
AI SearchRetrievalRAGTechnical SEO

Keep reading