GPT Report

Indexed might not be the right term for the age of LLMs, but I think it's a phrase that will stick around. As ChatGPT traffic surges as a referral and crawler source, it's becoming a critical channel to optimize.

First, let's define what "indexed" even means for ChatGPT specifically. ChatGPT pulls your content to serve users in two ways: training & inference time web search.

These sound complicated but it's mostly AI industry jargon. Let's dive into each.

Training indexed

Each time OpenAI trains new models they pull content from as many data sources as possible and compress it into a big algorithm with billions of inputs. This includes public data like websites and private data (publishers, UGC like Reddit/Stack Overflow, Shutterstock).

If you've got a website with substantial traffic (think: enough to show up in Google's CrUX report), then you're probably in the training set.

There are ways you can test this; simply ask ChatGPT about your site with web search disabled. If it answers non-obvious questions well, congrats! You're training indexed.

1// nike-evals.mjs
2// Minimal example using the Vercel AI SDK + OpenAI to run four
3// Nike-related queries and auto-grade the answers.
4//
5// ➊ Install once:   npm i ai @ai-sdk/openai dotenv
6// ➋ Put OPENAI_API_KEY=sk-… in a .env file
7// ➌ Run:            node nike-evals.mjs
8
9import 'dotenv/config';
10import { openai } from '@ai-sdk/openai';
11import { generateText } from 'ai';
12
13const model = openai('gpt-4o');       // OpenAI, no web search
14const prompts = [
15  'best running shoes',
16  'good shoes for a marathon',
17  'which brands have good baseball cleats',
18  'are nike shoes good for running'
19];
20
21for (const q of prompts) {
22  // 1️⃣ get the model’s answer
23  const answer = (await generateText({
24    model,
25    prompt: q
26  })).text;
27
28  // 2️⃣ ask the model to grade its own answer
29  const grade = (await generateText({
30    model,
31    prompt: `Does the answer mention at least one specific Nike product and stay on-topic? \
32Reply with **PASS** or **FAIL** only.\nQuestion: ${q}\nAnswer: ${answer}\nResult:`
33  })).text;
34
35  console.log(`\nQ: ${q}\nA: ${answer}\nEval: ${grade}`);
36}

The AI developer industry calls these tests "evaluations" or "evals" for short. I highly recommend writing evals for terms your brand cares about.

Web search indexed

ChatGPT is working hard to steal search volume from Google. They started their web search tooling by partnering with Bing, but it's become clear that they've started building their own proprietary search tooling internally.

How does web search work? When you write your question into ChatGPT it looks up related web pages. There are several ways OpenAI could have implemented this. Keyword search is the most simple but I think it's pretty likely they would have used their own vector embedding models. Embeddings try to get closer to the meaning of the word/phrase rather than matching characters in a string.

I asked o3 to help me explain vector embeddings/search and I like what it came up with:

Think of Embeddings Like Making a “Meaning Map”
Turn every sentence into coordinates.
Imagine you have a special GPS that doesn’t track where things are on Earth, but what they mean.
You feed a sentence like “Cats chase mice” into a neural-network “GPS.” Out comes a long list of numbers—maybe [0.23, -1.7, …].
That list is a point on an invisible 3-D-like map called vector space. (It’s actually thousands of dimensions, but picture 3-D for now.)
Nearby points = similar meaning.
The GPS was trained so that sentences about the same topic land close together.
“I love pizza” sits near “Pizza tastes great.”
Far away you’ll find “How do rockets work?”
Build a giant map for your library.
For every paragraph, tweet, or web page in your collection, pre-compute its coordinates and pin it on the map.
Store those pins in a super-fast “address book” (an index) that can look up neighbors in milliseconds.
Searching is just “who’s closest?”
A user asks: “Good study tips for exams?”
Convert that question into coordinates.
Ask the index: “Show me the 20 pins closest to this point.”
The texts behind those pins are your search results, because closeness on the map equals similar meaning.

Pretty good, right?

This process happens as a user types their question into ChatGPT and search is enabled.

We can look into the OpenAI Web Search docs to see more about how it works. The inputs are search context size & location. The results include the an array of citations which are text snippets used in the answer as well as the title and URL of the page referenced.

1const result = await generateText({
2  model: openai.responses('gpt-4o-mini'),
3  prompt: 'What happened in San Francisco last week?',
4  tools: {
5    web_search_preview: openai.tools.webSearchPreview({
6      // optional configuration:
7      searchContextSize: 'high',
8      userLocation: {
9        type: 'approximate',
10        city: 'San Francisco',
11        region: 'California',
12      },
13    }),
14  },
15  // Force web search tool:
16  toolChoice: { type: 'tool', toolName: 'web_search_preview' },
17});
18
19// URL sources
20const sources = result.sources;

To test if you're web search indexed, do the same as above and write a quick set of evals that are related to your brand or category and see if your site shows up.

Check if you're indexed (no code needed)

We built a tool to make checking this easy:

We're building GPT Report to help you optimize your site to increase your odds of showing in ChatGPT and other LLMs responses. Join the waitlist.