How We Track ChatGPT Visibility: Real Queries, Real Tests, Real Citations
We test your brand against the actual queries your audience types into ChatGPT - pulled from Google Search Console plus your site's niche vocabulary. Each query gets executed against live ChatGPT, and we record exactly when your brand is cited and when it is not.
One of 48 criteria in AEO Rank, the citation-readiness score we run against every site we audit.
By Alex Shortov
Quick Answer
Visibility tracking runs against real ChatGPT API queries (or the ChatGPT Pro subscription via Codex CLI to control cost), seeded from up to 1,000+ Google Search Console queries that get filtered down to the top ~100 commercially relevant ones. Each tracked query is executed and we capture three signals per result: was the brand mentioned (HIT or MISS), which competitors were mentioned, and which URLs ChatGPT cited as sources. The data persists across runs so we can show citation trends week over week, not just a snapshot.
Audit Note
In our audits, we've measured How We Track ChatGPT Visibility: Real Queries, Real Tests, Real Citations on live sites, we've compared implementations, and we've audited the gaps that keep scores low.
How do you know which queries to test in ChatGPT?
We pull up to 1,000 Google Search Console queries, run a keyword classifier to filter to the top 100 commercially relevant ones, then test those against live ChatGPT.
How does the system tell a brand mention from a competitor mention?
The brand-variation matcher recognizes prose mentions and non-apex hosts, so "HelpSquad's chat widget" and "app.helpsquad.com" both count as brand mentions, not noise.
Why test against real ChatGPT instead of guessing what it might say?
Real ChatGPT testing measures actual citation behavior on your customers' queries, while invented "best X 2026" templates produce numbers that mean nothing in practice.
How often does ChatGPT visibility get refreshed?
ChatGPT visibility refreshes on a recurring weekly cadence per domain so the report tracks citation trend over time, not a single moment-in-time snapshot.
What do I do with a list of MISS queries?
Feed MISS queries straight into Studio's /design-cluster as topic candidates and into the visibility report's gap section so operators know what to fix next.
Summarize This Article With AI
Open this article in your preferred AI engine for an instant summary and analysis.
What this article answers
- How do you know which queries to test in ChatGPT?
- How does the system tell a brand mention from a competitor mention?
- Why test against real ChatGPT instead of guessing what it might say?
- How often does ChatGPT visibility get refreshed?
- What do I do with a list of MISS queries?
Key takeaways
- Queries come from Google Search Console snapshots merged with niche enrichment - real questions your audience actually asks, not invented “best X 2026” templates.
- A keyword classifier filters 1,000+ raw queries down to the top ~100 commercially relevant ones - separates brand, noise, and target queries so we test only what matters.
- Every query is executed against live ChatGPT and the response is parsed for brand mentions, competitor mentions, and cited URLs - three independent signals per query.
- The brand-variation matcher recognizes prose mentions (not just exact-string matches) plus non-apex hosts (so
app.helpsquad.comcounts as a HelpSquad mention). - Recurring weekly runs mean we track citation trend, not just one snapshot - sees whether published content actually moved the needle on the same query.
- MISS queries directly feed two downstream actions: the topic backlog for Studio’s
/design-cluster(each MISS becomes a candidate article), and the visibility report’s gap section that shows operators what to fix next.
Why Real-Query Testing Beats Made-Up Templates
Real customers do not type “best X 2026” templates, so trackers built on those queries report meaningless wins while real visibility lives in awkward business-context-loaded specific questions.
The lazy way to build a “AI visibility tracker” is to make up template queries: “best X 2026,” “top 10 X tools,” “X for small business.” Run those through ChatGPT, count mentions, ship a dashboard. That number means nothing.
The reason it means nothing: your real customers do not type “best X 2026.” They type the actual questions sitting in their head when they reach for ChatGPT. Those queries are specific, awkwardly phrased, business-context-loaded - “what’s the cheapest way to do live chat for a Shopify store with under 1,000 visits a day,” not “best live chat 2026.” A visibility score built on template queries tells you whether your brand wins generic comparisons, which has roughly zero correlation with whether real people see you in their real conversations with AI.
Our tracker pulls real Google Search Console queries as the seed. These are queries that actually drove traffic to your site - meaning real people asked them, and Google had enough signal to point them at you. If a query already drove traffic from Google, it is exactly the kind of query the same person might next type into ChatGPT instead. That seed is what makes the visibility data actionable.
How the Query Seed Gets Built
For GSC-connected domains we pull 1,000-plus recent queries, classify across five passes to filter brand, intent, and language, then keep the commercially relevant target list.
For domains with GSC connected, we pull the most recent snapshot - typically 1,000+ queries with impressions, clicks, click-through-rate, and average position metadata. For domains without GSC connected (new sites or sites we haven’t been granted access to), we substitute a fallback seed built from the brand profile + niche enrichment via Codex CLI - up to ~100 phrases generated from a site crawl.
Either way the next step is the keyword classifier, which runs a 5-pass filter to convert raw queries into a clean target list:
- Brand detection (rules) - queries that include the customer’s brand name or any registered variation get tagged as brand-queries. These are excluded from the visibility target list because the customer obviously wins their own brand searches.
- Noise detection (rules) - queries with adult content, foreign-language fragments, internal-search artifacts, error-page indicators, or platform-search noise (
demandware,wp-admin, etc.) get filtered out. - Phrase + token relevance - remaining queries are scored against the customer’s niche vocabulary. Queries whose tokens overlap with the customer’s services, topic phrases, and audience descriptors score high; queries whose tokens are off-topic score low.
- Near-duplicate dedup - queries that are minor variants (“live chat software” vs “live chat softwear” vs “live chat softwares”) get merged so we don’t waste API calls testing the same intent 12 ways.
- Rank by relevance × demand - the surviving queries get sorted by relevance score × GSC impression count, and the top ~100 become the target set.
Operators see the classified output in the Visibility Wizard before any test runs - brand queries listed separately, target queries highlighted, noise hidden by default. The operator can promote, demote, or add custom queries before the test fires.
How Each Query Actually Gets Tested
Once the target set is finalized, the tracker executes each query against ChatGPT. We support two execution paths:
Path 1 - OpenAI Direct API (default for new domains since 2026-05-19). Each query is sent to the ChatGPT API. We capture the raw response text + any cited URLs + the model version. This path has predictable per-query cost and is appropriate when API spend is acceptable.
Path 2 - ChatGPT Pro via Codex CLI (subscription-based, no per-token cost). Runs from a managed Mac Mini against the ChatGPT Pro subscription using the Codex CLI. Same query, same prompt shape, no per-call charge. Path is preferred when running large query sets at high frequency.
Both paths return the same structure - response text, cited URLs, response metadata - so downstream parsing is identical. The choice is operational, not architectural.
How the Tracker Decides Whether You Were Cited
The brand-variation matcher handles apex domain matches, subdomain hosts, and prose mentions in three rules so subdomain docs and conversational references both count as citations.
This is where naive trackers break. A query like “what’s the best live chat for Shopify” returns a ChatGPT response with 3-5 product mentions. A simple string-match against “HelpSquad” misses mentions like “HelpSquad’s chat widget” or “the team at HelpSquad” - the brand was clearly cited, but no exact match.
Our brand-variation matcher handles three cases:
- Apex domain match - cited URL contains
helpsquad.com→ HIT. - Non-apex host match - cited URL contains
app.helpsquad.com,support.helpsquad.com, or any subdomain → HIT, since it’s clearly the same brand. (Without this rule, a customer running their docs on a subdomain looks invisible.) - Prose mention match - response text contains the brand name with normalization for punctuation, possessives (“HelpSquad’s”), and known abbreviations registered in the brand profile.
We also capture competitor mentions the same way for any competitor entity registered in the brand profile. The output for a single query looks like:
Query: "best live chat for Shopify"
Brand cited (HelpSquad): true
Competitors cited: ["Tidio", "LiveChat", "Crisp"]
Cited URLs: ["https://www.helpsquad.com/shopify", "https://www.tidio.com/integrations/shopify"]
That structured output is what populates the visibility report.
The ChatGPT tracker checks each query response against four citation tests before scoring you as cited or missed.
| Citation Test | What It Looks For | Outcome |
|---|---|---|
| Direct domain link | Footnote points at your domain | Cited |
| Brand name mention | Your brand named in prose | Cited (soft) |
| Competitor only | Only competitors named | Missed |
| No relevant answer | Engine refuses or generic | Excluded |
How Often Does the Tracker Run?
Visibility runs recur weekly by default with the same target query list, so week-over-week citation rates surface in trend graphs that prove whether new articles moved the needle.
Visibility runs are recurring. New domains run once on setup, then schedule into a recurring cadence based on the customer’s plan tier (typically weekly). Each recurring run uses the same target query list as the previous run (re-classified periodically as new GSC data lands), so trends are comparable across weeks.
The output of every run goes into aeo_monitor_results keyed by (domain, query, engine, run_date). That is what powers the trend graphs in Studio - operators can see whether their citation rate on a specific query went up after publishing a related article, or stayed flat, or dropped because a competitor published faster.
The weekly visibility digest (see weekly-visibility-digest) summarizes the deltas in an email. Operators get the new HIT count, the new MISS list, and the gap between the customer and the leading competitor on each query.
What Should You Do With MISS Queries?
MISS queries become topic backlog candidates, weighted cluster-design inputs, and FAQ section seeds so every measured visibility gap automatically becomes an actionable AEO project.
MISS queries are the most valuable output of the visibility tracker. Every MISS is a measured opportunity - a real query where a real person asked ChatGPT and your brand was not mentioned, but the question is relevant enough to your business to be worth winning.
Three actions Studio takes with MISS queries automatically:
1. Topic backlog seeding. Each MISS query becomes a candidate topic in aeo_topic_ideas. Operators see them surfaced when they run /topic-ideas next, ranked by impact = visibility gap × commercial relevance × competitive softness.
2. Cluster design input. /design-cluster reads the customer’s MISS queries and identifies topical groupings - 7 MISS queries about “insurance billing” become a candidate cluster (pillar + 6 children) instead of 7 standalone articles.
3. Content pipeline injection. When an article is being written about a topic that has MISS queries attached, the writer’s evidence library includes those queries as explicit “questions to answer” in the FAQ section. The article literally addresses the question ChatGPT couldn’t answer for you.
The compound effect: every MISS query becomes either a new article topic, a section in an existing article, or a FAQ entry. The data loop closes - measure visibility, identify gaps, write content to close gaps, re-measure visibility, see the gap close (or document why it didn’t).
What This Tracker Does Not Do
The tracker does not predict future citations, test every phrasing, distinguish training-data from web retrieval, or account for ChatGPT response personalization across different users.
Honesty matters here. The tracker is not magical and not all-knowing:
- It does not predict whether a future query will cite you. It measures current state on the queries you choose to track.
- It does not test every possible phrasing of every possible question - it tests the top ~100 most commercially relevant queries, because testing 10,000 would mostly waste compute on tail queries with no business value.
- It does not distinguish between ChatGPT’s training-data knowledge and its real-time web retrieval. A HIT could be either, and ChatGPT itself does not always reveal which.
- It does not account for personalization in ChatGPT responses. Different ChatGPT users get different responses based on context, history, and account-level signals. We test in a clean session so results are comparable across runs, but they may not match what an individual user with personalization sees.
What it does do is give you a deterministic, repeatable, week-over-week measurement of citation visibility on the queries that drive your business. That measurement is what lets you decide whether AEO work is paying off.
External Resources
- OpenAI ChatGPT API reference - https://platform.openai.com/docs/api-reference/chat
- Codex CLI - https://github.com/openai/codex
- Google Search Console - https://search.google.com/search-console/about
- AEO Visibility Report sample - https://www.aeocontent.ai/knowledge/aeo-score-methodology
Related topics
Key takeaways
- Queries come from Google Search Console snapshots merged with niche enrichment - real questions your audience actually asks, not invented 'best X 2026' templates.
- A keyword classifier filters 1,000+ raw queries down to the top ~100 commercially relevant ones - separates brand, noise, and target queries so we test only what matters.
- Every query is executed against live ChatGPT and the response is parsed for brand mentions, competitor mentions, and cited URLs - three independent signals per query.
- The brand-variation matcher recognizes prose mentions (not just exact-string matches) plus non-apex hosts (so 'app.helpsquad.com' counts as a HelpSquad mention).
- Recurring weekly runs mean we track citation trend, not just one snapshot - sees whether published content actually moved the needle on the same query.
- MISS queries directly feed two downstream actions: the topic backlog for Studio's /design-cluster (each MISS becomes a candidate article), and the visibility report's gap section that shows operators what to fix next.