Crawl Budget: Why Most of Your URLs Never Get Crawled

What is crawl budget?

Crawl budget is the amount a search engine will crawl on your site in a given window — a product of how fast it can fetch without straining your server (crawl rate) and how much it currently wants to (crawl demand, driven by your pages' importance and freshness). It is not a quota you apply for; it is an emergent ceiling.

For a small site it is a non-issue — everything gets crawled. For a large site with many thousands of URLs it is a real constraint: the crawler will not fetch everything on every visit, so which URLs it spends the budget on becomes the whole game.

What wastes crawl budget?

Every low-value fetch is a fetch your important page didn't get. The usual drains: thin or duplicate pages, near-infinite URL parameter spaces (filters, sort orders, session IDs), soft-404s that return 200, long redirect chains, and large bodies of stale, unlinked content.

This is where crawl budget and penalty risk meet: thin programmatic pages are doubly costly — they risk a scaled-content demotion and they burn the budget that should have gone to your real pages.

What we see running a large estate

Operating a multi-subdomain estate makes the pattern concrete: the crawler concentrates on what is well-linked, genuinely valuable, and recently changed, and the long tail of low-value URLs is fetched rarely — many effectively not at all. Adding more thin URLs doesn't expand the attention; it dilutes it.

The practical lesson is uncomfortable but freeing: publishing a page is not the same as getting it crawled, and getting it crawled is a prerequisite for getting it indexed or cited. Treat crawl as the scarce resource it is, and only ship URLs you'd actually want fetched.

How do you see your own crawl budget?

Two sources show the reality. Google Search Console's Crawl Stats report (Settings → Crawl stats) breaks down total fetches per day, response codes, and what Googlebot spent its time on — including the budget burned on redirects and errors. Your server access logs are the ground truth: filter for the crawler's user-agent and you can see exactly which URLs it fetches, how often, and which it never touches.

Both usually tell the same story on a large site — a small core crawled constantly, a long tail that barely registers. Once you can see where the budget goes, the pruning decisions make themselves.

How to spend crawl budget well

Prune or noindex thin and duplicate pages so the crawler stops seeing them. Collapse redirect chains. Keep your XML sitemap to canonical URLs only — ours is generated from the same canonical dataset as the pages, so it never lists a URL that shouldn't be there. Internal-link the pages that matter so importance is legible, and keep them fast (see Core Web Vitals) so each fetch is cheap.

Most of this is ordinary platform hygiene — which is exactly why it's so often skipped.

What is crawl budget?

What wastes crawl budget?

What we see running a large estate

How do you see your own crawl budget?

How to spend crawl budget well

One system at a time.