Robots.txt for SEO (Search Engine Optimization)

data robotFor all the websites and blogs I have, I think this topic belongs here although my blog is not concentrated on SEO (Search Engine Optimization) but somehow this is related because if you run a website or a blog and you are using it for your business, then it’s also a legitimate business. Robots.txt file is a text file use to command search engine robots (they are just a program for crawling and indexing  websites and blogs) whether to crawl or not to crawl, index or not to index your websites, follow or not to follow the links on your websites. So why it is important for search engine optimization? In the early days of the internet, there is no such thing as duplicate content. Webmasters or website owners can spam search engines (e.g. Google, Altavista, Hotbot or Yahoo! Search) and rank very well if you know how to abuse their weaknesses. How I love those days because it’s so easy to get visitors to your websites even though you are not geek enough.

So what happened is search engine companies, particularly Google, are looking for a solution to fight spamming so now there is a term called “duplicate content penalty”. This will penalize your website if it is found having duplicate content whether your website copies content from other site or you copy your own content and produce many pages. Either you do it on purpose or you are unaware that it is happening on your website, your website will be penalized. This is similar to the saying “Ignorance of the Law Excuses No One”. If your website is a static html website, done using Dreamweaver or MS Frontpage, then you can control your site not to have duplicate content unless you do it on purpose. But what about a dynamic one? examples are WordPress, Joomla, Dolphin dating site, eCommerce website and forum which mostly run by PHP MySQL database. Pages are generated on the fly and you have no idea how many duplicate pages your website is creating. To find out if your website have been indexed by Google and to know if you have many duplicate contents, try typing this to the search box: site:http://www.yourdomain.com

Advertisements

You will see all your indexed pages in Google. If for example you have only 100 pages on your site and you saw your website having 3,000 pages indexed in Google, then obviously you have lots of duplicate content. The only solution for that is to remove the indexed pages in Google. How? you only need to create and upload a text file and named it robot.txt file. The things you will write on the file depends on which pages you wanted to remove on the search engine. One example is a classified ads website. I’m using a PHP classified ads on this blog and found out that it created thousands of duplicate pages. Those pages should be translated version of my original classified ads pages to other foreign languages but it was not translated so it turns out that they are also in English language. What I have now are duplicate contents. What I did was type on the search box:  site:http://www.classifieds.filentrep.com then after hitting the search button I found out that there is a common variable on all the duplicate pages. A sample of those URL are the following:

http://www.classifieds.filentrep.com/index.php?setlang=por&catid=43
http://www.classifieds.filentrep.com/index.php?setlang=ned&catid=43
http://www.classifieds.filentrep.com/index.php?setlang=gre&catid=43

The variable was “setlang=”. This variable is for setting the language translation.  The pages suppose to have different languages like the setlang=gre which should mean the page should have a Greek language but it’s not.  So to exclude those pages in the indexing, below is what I put in the robot.txt file:

User-agent: *
Disallow: /*setlang=*

You can use  the above sample by replacing the variable “setlang=” if you see a common variable to the duplicate pages. If your site have a mod rewrite that can convert some pages to static html pages but some  are still in dynamic form, you can just put the “?” query string in place of the variable.  This is what I did on my community site using Dolphin software.  Like the one below:

Advertisements

User-agent: *
Disallow: /*?*

In case you are using a blog software like WordPress, although this is a problem too but there are so many plugins to solve the duplicate content. Some plugins I use and recommend are Permalink Redirect and WP SEO Master. You also need to tweek your permalink in the settings section of your wordpress admin.