What are "phantom pages" in a React Server Components site?
Phantom pages are URLs and content a crawler sees inconsistently or wrongly in a React Server Components (RSC) site — most often a Next.js App Router build. The word covers four distinct failures that share one root: the document a crawler fetches first is not always the page a person eventually sees.
The four symptoms are concrete. Rendered words arrive only after streaming or hydration, so they miss the cheap first-pass HTML a crawler reads. Client-side route transitions and prefetches mint ?_rsc= payload URLs that carry no new content but still get crawled. A not-found state returns HTTP 200 instead of 404 — a soft-404. And the resulting index fills with thin or duplicate entries while crawl budget burns on non-canonical parameter URLs. None of these are bugs in your code, exactly — they are the predictable shape of split rendering meeting a crawler that reads HTML cheaply and runs JavaScript expensively.
Why does the App Router create them — what does the rendering split actually do?
RSC splits rendering between server and client. In the App Router, Server Components render on the server to HTML for the initial document, while Client Components hydrate in the browser. The framework also produces a separate RSC payload — the Flight protocol, which shipped stably with React 19 — a compact, streamable text description of the UI tree that is neither raw HTML nor standard JSON.
Be precise about the risk: App Router Server Components do server-render to HTML by default, so content is not categorically missing. The gap bites conditionally — when content sits behind a Suspense boundary that streams later, lives inside a client-only component that renders after hydration, or when a slow or failed stream leaves deferred chunks unread on the crawler's first pass. Separately, on Link prefetches and client transitions Next.js issues requests carrying a ?_rsc= query parameter and an RSC: 1 header; behind a misconfigured CDN, that payload can be cached and served in place of the HTML page, breaking it for visitors and crawlers alike.
Why does a Next.js "not found" page often return HTTP 200 instead of 404?
Because streaming locks the status code. Once the response shell — the preamble React flushes first — is sent, the HTTP status and headers can no longer change. A Vercel maintainer put the constraint plainly: with streaming rendering you cannot modify the status or headers after preinitialization. So if notFound() (or redirect()) fires after JSX has begun rendering, which is common when a Suspense boundary sits anywhere in the path, the response stays HTTP 200 for a not-found state — a soft-404. Next.js injects Google-recommended fallback tags such as <meta name="robots" content="noindex">, but the status code is still 200.
The fix is ordering. Fetch and validate data first, then call notFound() before returning any JSX, so the 404 is set before headers flush. Alternatively, validate the route in middleware before streaming and rewrite to /404 with an explicit 404. Both set the real status; neither is optional once content streams.
How does a build-time-rendered estate avoid this entire class?
It removes the split that creates the problem. This estate is static and build-time-rendered: generate.py reads one canonical dataset, splices fragments into HTML templates at build time, and writes finished pages to disk. The page a crawler fetches, the page a person sees, and the page an AI retriever reads are byte-for-byte the same — the principle we set out in build-time rendering for SEO.
With no client framework delivering content, the phantom-page failures cannot arise here by construction. There is no later stream to defer the words, so nothing is missing from the first pass. There are no client transitions minting ?_rsc= URLs, so the index has nothing spurious to absorb — the discipline that keeps crawl budget on real pages. And because a static file is served with its own status, a missing page is a true 404, never a 200 locked open over a stream. This is a deliberate posture for content that must be found, and it is the spine of programmatic SEO without penalty.
If your team needs RSC, how do you keep it crawlable?
RSC is the right tool for genuinely interactive app surfaces — this is a scope argument, not a verdict against the framework. For content that must be crawled or cited, four mitigations point in the correct direction, though none is a guaranteed fix.
- Keep indexable content in the server HTML — out from behind a late
Suspensestream or a client-only component, so it is present on the cheap first pass. - Validate, then
notFound()before returning JSX (or validate in middleware) so a missing page returns a real404. - Canonicalize and contain
?_rsc=URLs via a robots disallow likeDisallow: /*?_rsc=*, anext.configredirect that strips the parameter, or disablingLinkprefetch — partial measures, since a disallow can still surface Indexed, though blocked by robots.txt. - Verify directly: View Source for the headline text, reload with JavaScript off to approximate the crawler, run
curl -Ion a known-bad URL, and check Search Console for soft-404 and Alternative page with proper canonical tag states. This is the practice we run under platform & web operations.