A robots.txt file helps major search engines understand where they are allowed to visit your website.

However, although the major search engines do support robots.txt files, they may not follow the rules in the same way.

Next, let us break down what the robots.txt file is and how to use it.

What is the robots.txt file?

Robots (also called robots or spiders) will visit your website every day. Search engines such as Google, Yahoo, and Bing will send these bots to your website so that your content can be crawled and indexed. Appears in search results.

A bot is a good thing, but in some cases, you don’t want a bot to run on your website, crawling and indexing everything. This is where the robots.txt file comes in.

By adding certain instructions to the robots.txt file, you can instruct the robot to crawl only the pages you want to crawl.

However, it is important to understand that not every robot will follow the rules you write in the robots.txt file. For example, Google will not listen to any instructions you place in your files about the frequency of crawling.

Do you need robots.txt file?

No, the robots.txt file is not required for the website.

If a bot comes to your website but it does not, it will just crawl your website and index pages as usual.

The robots.txt file is only needed if you want more control over what is being crawled.

Some of the benefits of owning one include:

  • Help manage server overload
  • Prevent wasteful crawling by bots that are visiting pages you don’t want them to visit
  • Keep certain folders or subdomains secret

Can a robots.txt file prevent content indexing?

No, you cannot use the robots.txt file to prevent content from being indexed and displayed in search results.

Not all bots follow instructions in the same way, so some bots may index content that you set to not be crawled or indexed.

In addition, if you try to prevent content displayed in search results from having external links to it, this will also cause search engines to index it.

The only way to ensure that your content is not indexed is to add a noindex meta tag Go to the page. This line of code looks like this and will enter the html of your page.

It should be noted that if you want search engines not to index a certain page, you need to allow that page to be crawled in robots.txt.

Where is the robots.txt file located?

The robots.txt file will always be located in the root domain of the website.For example, our own files can be in https://www.hubspot.com/robots.txt.

In most websites, you should be able to access the actual file so that you can edit it in FTP or by accessing the file manager in the host CPanel.

In some CMS platforms, you can find the file in your management area.For example, HubSpot makes it Easily customize your robots.txt File from your account.

If you are using WordPress, you can access the robots.txt file in the public_html folder of your website.

The robots.txt file in the public_html folder on the WordPress website

WordPress includes a robots.txt file by default, and a new installation will include the following:

User Agent: *

Prohibited: /wp-admin/

Prohibited: /wp-includes/

The above is to tell all robots to crawl all parts of the website, except for any content in the /wp-admin/ or /wp-includes/ directory.

But you may want to create a more powerful file. Let us show you how, below.

Used for Robots.txt file

You might want to customize the robots.txt file for many reasons-from controlling your crawl budget to preventing certain parts of your website from being crawled and indexed. Let us now explore several reasons for using the robots.txt file.

1. Stop all crawlers

Blocking all crawlers from accessing your website is not something you want to do on an active website, but it is a good choice for developing a website. When you block crawlers, it will help prevent your page from showing up on search engines, which is great if your page is not yet ready for viewing.

2. Prohibit crawling certain pages

One of the most common and useful ways to use the robots.txt file is to restrict search engine robots from accessing certain parts of your website. This helps maximize your crawl budget and prevent unwanted pages from appearing in search results.

It’s important to note that just because you tell the bot not to crawl the page, it doesn’t mean it will Not indexed. If you don’t want a page to appear in the search results, you need to add the noindex meta tag to the page.

Example Robots.txt file instructions

The robots.txt file consists of command line blocks. Each instruction starts with a user agent, and then the rules for that user agent will be placed under it.

When a particular search engine lands on your website, it will look for user agents that apply to them and read the blocks that refer to them.

You can use multiple instructions in the file. Let us break them down now.

1. User Agent

User agent commands allow you to target certain robots or spiders. For example, if you only want to target Bing or Google, this is the command you want to use.

Although there are hundreds of user agents, the following are examples of some of the most common user agent options.

User agent: Googlebot

User agent: Googlebot-Image

User agent: Googlebot-Mobile

User agent: Googlebot-News

User agent: Bingbot

User Agent: Baidu Spider

User agent: msnbot

User agent: slurp (Yahoo)

User agent: yandex

It is important to note-user agents are case sensitive, so be sure to enter them correctly.

Wildcard user agent

Wildcard user agent

The asterisk also allows you to easily apply instructions to all existing user agents. Therefore, if you want to apply specific rules to each bot, you can use this user agent.

User Agent: *

User agents will only follow the rules that apply to them best.

2. Prohibited

The disallow directive tells search engines not to crawl or visit certain pages or directories on the website.

Here are a few examples of how to use the disallow directive.

Block access to specific folders

In this example, we tell all robots not to crawl any content in the /portfolio directory on our website.

User Agent: *

Prohibited: /combination

If we only want Bing not to crawl the directory, we would add it like this instead of:

User agent: Bingbot

Prohibited: /combination

Block PDF or other file types

If you don’t want to grab your PDF or other file types, then the following instructions should help. We tell all robots that we don’t want to crawl any PDF files. The $ at the end tells search engines that it is the end of the URL. So if i have a pdf filemywebsite.com/site/myimportantinfo.pdf ,

Search engines will not access it.

User Agent: *

Prohibited: *.pdf$

For PowerPoint files, you can use:

User Agent: *

Prohibited: *.ppt$ A better option might be to create a folder for your PDF or other files, then prohibit crawlers from crawling it, and useMeta tag

.

Block access to the entire website

This command is especially useful if you have a development website or test folder. It tells all robots not to crawl your website at all. It is important to remember to delete your website when it goes live, otherwise you will encounter indexing issues.

User Agent: *

The * (asterisk) you see above is what we call “wildcard” expressions. When we use asterisks, we imply that the following rules should apply to all user agents.

3. Allow The allow directive can help you specify certain pages or directories do

Hope the robot visits and crawls. This can be an override rule for prohibited options, as shown above.

In the example below, we tell Googlebot that we do not want to crawl the portfolio catalog, but we do want to access and crawl a specific portfolio item:

User agent: Googlebot

Prohibited: /combination

Allow: /portfolio/crawlableportfolio

4. Site Map

Including the location of the sitemap in your file can make it easier for search engine crawlers to crawl your sitemap.

If you submit your sitemap directly to the webmaster tools of each search engine, you don’t need to add it to your robots.txt file. Site map:

https://yourwebsite.com/sitemap.xml

5. Crawl delay

Crawl delay can tell the robot to slow down when crawling your website, so as not to overwhelm your server. The following command example requires Yandex to wait 10 seconds after each crawl operation performed on the website.

User agent: yandex

Crawl delay: 10

This is an instruction you should be careful of. On a very large website, it can greatly reduce the number of URLs crawled every day, which can backfire. However, this is useful on smaller sites because the bot traffic is a bit too much. Note: The crawl delay isGoogle or Baidu does not support . If you want to ask their crawler to slow down the crawling of your site, you need to do thisThrough their tools

.

What are regular expressions and wildcards?

Pattern matching is a more advanced method that can control the way robots use characters to crawl your website.

There are two common expressions, and both Bing and Google use them. These instructions are particularly useful on e-commerce websites. Asterisk:

* Is regarded as a wildcard and can represent any sequence of characters Dollar sign:

$ Is used to specify the end of the URL

A good example of using the * wildcard is when you want to prevent search engines from crawling pages that may contain question marks. The following code tells all robots not to crawl any URL with question marks.

User Agent: *

Not allowed: /*?

How to create or edit Robots.txt file

  1. If there is no existing robots.txt file on your server, you can easily add one by following the steps below.
  2. Open your preferred text editor to start a new document. Common editors that may exist on your computer are Notepad, TextEdit, or Microsoft Word.
  3. Add the instructions you want to include in the document.
  4. Save the file named “robots.txt”
  5. Test your file as shown in the next section

Use FTP or upload your .txt file to your server in your CPanel. The way you upload depends on the type of website you have.

In WordPress, you can use plugins such as Yoast, All In One SEO, Rank Math, etc. to generate and edit your files. You can also use robots.txt generator

Help you prepare a document that may help reduce errors.

How to test the Robots.txt file

Before using the robots.txt file code you created, you need to run it through a tester to make sure it is valid. This will help prevent problems with incorrect instructions that may have been added.

The robots.txt testing tool is only available for the old version of Google Search Console. If your website is not connected to Google Search Console, you need to connect first. Visit Google support

Then click the “Open robots.txt tester” button on the page. Select the attribute you want to test, and then you will be taken to a screen as shown below.

To test your new robots.txt code, just delete the current content in the box and replace it with your new code, then click

Robots.txt tester on Google Support

I hope this article will save you from being afraid of delving into the robots.txt file-because doing so is a way to improve rankings and promote SEO efforts.

SEO Starter Pack



Source link