
Robots.txt is a file used to keep search engines out of certain pages - typically areas of your site you wouldn’t want someone searching on Google to stumble upon, like login or admin pages and shopping cart functions.
It is a useful tool but if you’ve never heard of it or aren’t sure what it actually is, then it can seem confusing. So, to help you understand what a robots.txt file is and when you should be using it, here's a quick guide to robots.txt files.
Robots.txt is a simple, plain-text document that sits at the very root of your domain.
Before a search engine crawler, whether that be a Googlebot or some sort of AI, starts reading your website, it’s programmed to check this file first. The file gives the bot a list of instructions detailing which pages or folders it’s allowed to look at, and which ones it needs to ignore.
It operates on a few rules, but there are two core ones:
Every website should technically have a robots.txt file, even if it’s just a blank one telling bots they’re free to roam. But, you should actively configure its rules in the following scenarios:
You don’t want your WordPress login page or your team’s internal portal showing up in Google search results. A simple disallow rule keeps the crawlers away from the backend of your business.
If you run an eCommerce site, you likely have product filters (such as sorting by size, colour or price). These filters generate thousands of unique, dynamic URLs and if Google tries to crawl every single combination, it could waste all its time on duplicate pages instead of indexing your actual products. You can use robots.txt to block those URLs.
This is getting more important in 2026. Companies like OpenAI, Google, and Anthropic use web scrapers to gather data for their Large Language Models. If you want to protect your proprietary content and stop your site from being used as free training data, you can use your robots.txt file to specifically block these bots.
Google doesn’t have infinite resources, so it assigns your site a ‘crawl budget’. This means it will only crawl a limited number of pages within a given period. If you have a massive website with tens of thousands of pages, you should use robots.txt to block the low-value pages, forcing Google to spend its time crawling your most important, revenue-generating content.
Many marketers make the mistake of assuming robots.txt hides sensitive information, but it doesn't. A robots.txt file is a ‘Staff Only’ sign, not a lock. Good bots (like Google) will respect the sign, but malicious bots, hackers, and email scrapers will completely ignore it. Plus, the file itself is public - anyone can type /robots.txt into their browser and see exactly what you are telling crawlers to do.
If you have genuinely sensitive data, don’t rely on robots.txt. Instead, put it behind a password-protected login.
If you need a hand using robots.txt files, or with any other aspect of technical SEO, don’t hesitate to get in touch with us today.