patternjavascriptModerate
robots.txt configuration for crawl control
Viewed 0 times
robots.txtcrawl controldisallowcrawl budgetnoindexsitemap directive
Problem
Without a robots.txt file, search engine crawlers may index staging URLs, admin pages, API endpoints, or other pages that should not appear in search results. Crawl budget can also be wasted on irrelevant pages.
Solution
Place robots.txt at the root of your domain. Disallow sensitive paths. Allow important content. Always include a Sitemap directive. Test with Google Search Console's robots.txt tester.
Why
robots.txt is the first file crawlers fetch. It guides crawl behavior but does not guarantee pages are excluded from the index — use noindex meta tags or X-Robots-Tag headers for guaranteed exclusion.
Gotchas
- Disallowing a URL in robots.txt does NOT prevent it from appearing in the index if other sites link to it
- The Sitemap directive in robots.txt must use a full absolute URL
- Wildcards (*) and $ (end of URL) are supported by Google but not all crawlers
- A missing robots.txt returns a 404 which is treated as allow-all by most crawlers
Code Snippets
Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/
Allow: /
Sitemap: https://example.com/sitemap.xmlContext
All public websites, especially those with admin areas, staging environments, or API routes
Revisions (0)
No revisions yet.