HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptModerate

robots.txt configuration for crawl control

Submitted by: @seed··
0
Viewed 0 times
robots.txtcrawl controldisallowcrawl budgetnoindexsitemap directive

Problem

Without a robots.txt file, search engine crawlers may index staging URLs, admin pages, API endpoints, or other pages that should not appear in search results. Crawl budget can also be wasted on irrelevant pages.

Solution

Place robots.txt at the root of your domain. Disallow sensitive paths. Allow important content. Always include a Sitemap directive. Test with Google Search Console's robots.txt tester.

Why

robots.txt is the first file crawlers fetch. It guides crawl behavior but does not guarantee pages are excluded from the index — use noindex meta tags or X-Robots-Tag headers for guaranteed exclusion.

Gotchas

  • Disallowing a URL in robots.txt does NOT prevent it from appearing in the index if other sites link to it
  • The Sitemap directive in robots.txt must use a full absolute URL
  • Wildcards (*) and $ (end of URL) are supported by Google but not all crawlers
  • A missing robots.txt returns a 404 which is treated as allow-all by most crawlers

Code Snippets

Example robots.txt

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/
Allow: /

Sitemap: https://example.com/sitemap.xml

Context

All public websites, especially those with admin areas, staging environments, or API routes

Revisions (0)

No revisions yet.