Often overlooked, often under optimized, often neglected...that is your robots.txt file. No, these are not miniature robots that you build to make your morning coffee, but files which instruct the search engines how to crawl your website. This is because you tell them exactly where critical files and folders are located so that they do not inadvertently miss them.
1. Be sure your sitemap.xml file is in the robots.txt file.
2. If you have lot of images on your website/blog, use a standard image sitemap generator to add to the robots.txt file also. The same thing if you have lots of videos hosted on your website. These types of sitemap generators are commonly available on the internet.
3. Make sure you exclude all scripts folders, admin folders and other behind the scene files and folders from being indexed.
4. Go to Google Webmaster Tools and check that your robots.txt file is correctly setup. Simply login to Google Webmaster Tools > Configuration > Crawler Access.
5. The most efficient tool to generate your robots.txt file is found directly in Google Webmaster Tools. Site Configuration > Crawler Access > Generate Robots.txt. Do not rely on or use any other third party generators.
6. The general syntax used in the Robots file is this:
User-agent: *
Disallow: /yourfolder/
Here, user-agent:* means all search agents(Google,MSN,Yahoo etc).
/yourfolder/ restricts that folder from crawling. Note that the sub-folders of this folder will not be crawled either.
7. Directly specify your image file location for the Google Image Robots.
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
For websites / blogs with lots of images, it is a good idea to specify your image folder to ensure that Google indeed crawls it. Properly optimized images names and alt-text are a good source of link juice. A common example for Wordpress users is: wp-content/uploads/
8. Typically you set the user agent to * , which is the most broad agent and makes it available to all bots/users agents. But if you feel you have a unique situation and the standard bot might not be the best for you, reference the the entire list of user agents: agents list here.
9. Exclude unwanted/expired/out of use/dead/etc URLs using Robots.txt
If there are URLs you don’t want the search engines to crawl, use the following syntax.
User-agent: *
Disallow: /directory/folder/
In the above example, all the URLs beginning with /directory/folder won’t be crawled.
10. A page might still become indexed even if is excluded via a robot because another website probably linked to this page. If this happens and you still do not want a page indexed, use the Meta NoIndex tags to get it excluded.
11. Robots.txt is not the most optimal method to exclude or include a file or folder on search engines. There could be mistakes. The best method is to use robots.txt AND using the meta index/noindex files.
12. Make sure you follow the robots syntax as described by the standards here.
13. The Robots.tx tool is also not the most optimal method to block or remove a URL from being indexed by the search engines. In regards to Google, the best method is to use the URL removal rool inside Google Webmaster Tool to remove the link.
14. To leave yourself notes and comments within the Robots.txt so that you can remember what you were thinking at a later date, simply use the hash symbol.
Example:
# Comments which will be ignored when crawled go here.
15. Do not include all the folders and file names in one line. The right syntax is to arrange them in each line by folder.
User-agent: *
Disallow: /donotcrawl/
Disallow: /donotindex/
Disallow: /scripts/
16. URL paths and folder names are case sensitive. Also be sure to not make any typos or mistakes as otherwise you will just have wasted your time implementing the robot.txt file.
17. Using the “Allow” command.
Some selected crawlers like Google supports a new command called the “Allow” command. It lets you particularly dictate what files/folders should be crawled. However, this field is currently not part of the "robots.txt" protocol.
18. Robots.txt for Blogger.com
Blogger.com users cannot upload files to a root directory, thus they cannot use robots.txt files. Instead they can use the robots meta tag to control the crawling of bots on particular files.
19. Even if your website is a sub-directory, make sure that your Robots.txt is in the main root directory. Always; this is a universal standard.
20. Make sure your robots.txt file has the right access permissions and to not be writable by all.
1. Be sure your sitemap.xml file is in the robots.txt file.
2. If you have lot of images on your website/blog, use a standard image sitemap generator to add to the robots.txt file also. The same thing if you have lots of videos hosted on your website. These types of sitemap generators are commonly available on the internet.
3. Make sure you exclude all scripts folders, admin folders and other behind the scene files and folders from being indexed.
4. Go to Google Webmaster Tools and check that your robots.txt file is correctly setup. Simply login to Google Webmaster Tools > Configuration > Crawler Access.
5. The most efficient tool to generate your robots.txt file is found directly in Google Webmaster Tools. Site Configuration > Crawler Access > Generate Robots.txt. Do not rely on or use any other third party generators.
6. The general syntax used in the Robots file is this:
User-agent: *
Disallow: /yourfolder/
Here, user-agent:* means all search agents(Google,MSN,Yahoo etc).
/yourfolder/ restricts that folder from crawling. Note that the sub-folders of this folder will not be crawled either.
7. Directly specify your image file location for the Google Image Robots.
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
For websites / blogs with lots of images, it is a good idea to specify your image folder to ensure that Google indeed crawls it. Properly optimized images names and alt-text are a good source of link juice. A common example for Wordpress users is: wp-content/uploads/
8. Typically you set the user agent to * , which is the most broad agent and makes it available to all bots/users agents. But if you feel you have a unique situation and the standard bot might not be the best for you, reference the the entire list of user agents: agents list here.
9. Exclude unwanted/expired/out of use/dead/etc URLs using Robots.txt
If there are URLs you don’t want the search engines to crawl, use the following syntax.
User-agent: *
Disallow: /directory/folder/
In the above example, all the URLs beginning with /directory/folder won’t be crawled.
10. A page might still become indexed even if is excluded via a robot because another website probably linked to this page. If this happens and you still do not want a page indexed, use the Meta NoIndex tags to get it excluded.
11. Robots.txt is not the most optimal method to exclude or include a file or folder on search engines. There could be mistakes. The best method is to use robots.txt AND using the meta index/noindex files.
12. Make sure you follow the robots syntax as described by the standards here.
13. The Robots.tx tool is also not the most optimal method to block or remove a URL from being indexed by the search engines. In regards to Google, the best method is to use the URL removal rool inside Google Webmaster Tool to remove the link.
14. To leave yourself notes and comments within the Robots.txt so that you can remember what you were thinking at a later date, simply use the hash symbol.
Example:
# Comments which will be ignored when crawled go here.
15. Do not include all the folders and file names in one line. The right syntax is to arrange them in each line by folder.
User-agent: *
Disallow: /donotcrawl/
Disallow: /donotindex/
Disallow: /scripts/
16. URL paths and folder names are case sensitive. Also be sure to not make any typos or mistakes as otherwise you will just have wasted your time implementing the robot.txt file.
17. Using the “Allow” command.
Some selected crawlers like Google supports a new command called the “Allow” command. It lets you particularly dictate what files/folders should be crawled. However, this field is currently not part of the "robots.txt" protocol.
18. Robots.txt for Blogger.com
Blogger.com users cannot upload files to a root directory, thus they cannot use robots.txt files. Instead they can use the robots meta tag to control the crawling of bots on particular files.
19. Even if your website is a sub-directory, make sure that your Robots.txt is in the main root directory. Always; this is a universal standard.
20. Make sure your robots.txt file has the right access permissions and to not be writable by all.
No comments:
Post a Comment