University of Toronto's AI Search Paper, Broken Down

TL;DR

A 2025 University of Toronto preprint by Nick Koudas and colleagues found that ChatGPT, Perplexity, and Gemini overwhelmingly cite earned media (third-party sources) over brand-owned content, in contrast to Google's more balanced source mix (Koudas et al., 2025).
The finding matches what we see in client data. The brands earning AI citations are the brands earning third-party coverage and active reviews, not just publishing more on-domain content.
The paper proposes a four-part GEO agenda. Three parts are solid. One part, machine-scannability content engineering, is supporting hygiene at best, not the headline answer.
The implication for founder-operator brands is clear. Earned media velocity is the lever, not more on-site pages.
This is a preprint, not yet peer-reviewed. Use the findings as directional evidence that aligns with observed practice, not as proof of a settled science.

What the paper actually claims

Nick Koudas, a Professor of Computer Science at the University of Toronto, and his colleagues published a preprint on arXiv in September 2025 titled "Generative Engine Optimization: How to Dominate AI Search" (Koudas et al., 2025). The paper runs controlled experiments across multiple verticals, languages, and query paraphrases, comparing how ChatGPT, Perplexity, and Gemini source their answers against how Google sources its results.

The headline finding is the part that matters for any team building AI visibility. The AI engines show a systematic and overwhelming bias toward earned media, meaning third-party authoritative sources, when generating their answers. Google's source mix is more balanced across brand-owned content, social, and earned coverage. The AI engines lean far harder on the earned side.

The paper also documents that the AI engines differ from each other on domain diversity, source freshness, cross-language stability, and how sensitive their answers are to small changes in query phrasing. From those findings, the authors propose a four-part GEO agenda: machine-scannable content, earned media dominance, engine-specific strategies, and overcoming big-brand bias for smaller players.

Fig 01 · The paper's four part GEO agenda

01

Earned media dominance

Third-party coverage, listicles, podcasts, Reddit threads. The AI engines pull from these more than from brand-owned content.

HEADLINE

02

Overcoming big-brand bias

AI engines lean toward known brands by default. Niche players have to earn the citation through coverage and category authority.

CRITICAL

03

Engine-specific strategy

ChatGPT, Perplexity, and Gemini source differently. The same brand can be cited heavily in one and missed in another. Optimize per engine.

USEFUL

04

Machine-scannable content engineering

FAQ structure, claim-evidence patterns, entity definitions. Useful production discipline. Not the lever that earns the citation.

SUPPORTING

Why the earned-media finding lines up with what we see

The paper's headline finding is the one that matters operationally, and it confirms a pattern we have watched play out across our client book for the last year. The brands that earn AI citations are the brands earning third-party coverage and active reviews. Not the brands publishing more pages on their own site.

Liam Lytton, founder of The 66th, has been saying this on client calls for months: "The brands we see getting a real surge of ChatGPT referrals do two things differently. They publish useful content across the full funnel, not just bottom-of-funnel money pages, and they actively ask for reviews on Google Business Profile, Trustpilot, and whatever category platforms are relevant to them. Those two behaviors correlate with citation velocity more than any single technical lever."

The "useful content across the full funnel" part still matters, because it is the substrate that earned coverage attaches to. A third-party blog cannot cite your category positioning if no positioning exists on your site. A podcast cannot reference your guide if no guide exists. The owned content makes the earned coverage possible. But the earned coverage is the layer the AI engines actually pull from.

Hedra, an AI technology company we work with, earned 108 new AI citations across the engines we monitor last quarter. The pattern in their citation set matches the paper: heavy on third-party developer coverage, comparison roundups, Reddit threads, and product reviews. Their own product pages contribute, but they are not the dominant source. AetherHaus, a wellness brand in Vancouver, hit position one on Google for cold plunge Vancouver in three months and started seeing AI citations on local prompts within weeks. The citations leaned on Reddit threads and local listicles, not on AetherHaus's own service page.

Where the paper's framing needs a small adjustment

The fourth part of the paper's GEO agenda is machine-scannable content engineering. FAQ structure, claim-evidence patterns, entity definitions front-loaded into the first paragraph. The paper presents it alongside earned media and engine-specific strategy as a peer lever.

It is not a peer lever. It is supporting hygiene. Google's own May 2026 guidance on AI search said this directly: site owners should not rewrite content for AI extraction or chunk pages for retrievers (Google Search Central, 2026). The same scannable patterns that help AI models lift quotes also help human readers, which is why we use them. We treat them as production discipline, not as an unlock.

The risk of leading with content engineering is that teams spend three months adding FAQ schema and reformatting headings instead of building the earned media velocity that actually moves citations. The paper is right that scannable content helps. It is also right, at least implicitly, that the earned media lever is bigger. Lead with the bigger lever.

Our position on this matches the dossier's framing of GEO as bundled with SEO rather than as a separate discipline. The fundamentals that earn Google rankings are the fundamentals that earn AI citations. We covered the long version of that thesis in our piece on what generative engine optimization actually is.

What this means if you are building AI visibility right now

If you are a founder-operator brand without massive existing authority, the paper's "overcoming big-brand bias" point is the one to take seriously. AI engines lean toward known brands by default. Your lever is earned media velocity, not more on-site content.

That means deliberate digital PR, podcast appearances, third-party listicle placements, Reddit conversations where your category is being discussed, and a steady review acquisition motion across the platforms your buyers actually consult. Each of those creates a third-party source the AI engines can pull from when they read your category. Without them, your own content has to do all the work, and the data suggests it cannot.

The order matters. Earned media velocity does not replace strong content. It compounds on top of it. The brands that win the citation layer have both: enough useful content for third parties to reference, and enough active review and PR work that the references actually exist.

For the tactical setup on tracking which AI engines are citing you, our piece on brand citations in ChatGPT walks through how to read each citation as a diagnostic. For the broader playbook, the three layer GEO framework shows where earned media sits in the stack.

Frequently Asked Questions

Is the University of Toronto paper peer-reviewed?

Not yet. It is a preprint posted to arXiv in September 2025. The findings line up with what practitioners have been observing, but the work has not yet gone through formal peer review. Cite it as directional evidence, not as settled science.

Does the paper say schema and FAQ structure do not matter?

No. The paper includes machine-scannable content as one of four parts of its proposed GEO agenda. We agree it matters, but we frame it as supporting production hygiene rather than the load-bearing lever. The bigger lever, per the same paper, is earned media.

What counts as earned media for AI citation purposes?

Any third-party source that names your brand without being controlled by you. Industry listicles, comparison roundups, podcast appearances, Reddit threads, news coverage, review platforms, and independent guides all qualify. The common thread is that the source is editorially independent of your site.

How long does earned media take to show up in AI citations?

Depends on the source and the engine. New high-authority coverage can start appearing in AI answers within 4 to 8 weeks. Lower-authority coverage takes longer or never reaches the citation pool. Pair earned media outreach with measurement on a fixed prompt set so you can see what actually moves.

Does the paper distinguish between AI engines?

Yes. The authors document meaningful differences across ChatGPT, Perplexity, and Gemini on source diversity, freshness, and phrasing sensitivity. A brand can be heavily cited in one engine and underweighted in another. Tracking per-engine share of voice is the right level of measurement.

Key Takeaways

A 2025 University of Toronto preprint found AI search engines lean overwhelmingly on earned media compared to Google. The finding is directional evidence that confirms what we already see in client work.
The headline lever for AI citation velocity is earned media plus active reviews, not more on-site content or more FAQ schema.
The paper's "machine-scannable content" recommendation is real but secondary. Treat it as production discipline, not as the unlock.
For founder-operator brands, the work is digital PR, podcast appearances, listicle placements, and review velocity. Owned content makes the references possible. The references do the citation work.
This is a preprint. Use the findings as supporting evidence, not as proof of a settled science.