As a web designer/developer/programmer every person want their site to be SEO friendly. In this contents we will discuss about Automated search engine robots, sometimes called “spiders” or “crawlers”, are the seekers of web pages. How do they work? What is it they really do? Why are they important?
Generally designer/developer think with all the fuss about indexing web pages to add to search engine databases, that robots would be great and powerful beings is completely Wrong. Search engine robots have only basic functionality like that of early browsers in terms of what they can understand in a web page. Robots just can’t do certain things. Robots don’t understand frames, Flash movies, images or JavaScript. They can’t enter password protected areas and they can’t click all those buttons you have on your website. They can be stopped cold while indexing a dynamically generated URL and slowed to a stop with JavaScript navigation.
Search engine spiders and robots are pieces of code or software that have only one aim – seek content on the internet and within each and every individual web page out there. These tools have a very important role in how effectively search engines operate.
Search engine spiders and robots visit websites and get the necessary information that it needs to determine the nature and content of the website and then adds the data to the search engine’s index. Search engine spiders and robots follow links from one website to another so that it can consistently and infinitely gather the necessary information. The ultimate goal of search engine spiders and robots is to compile a comprehensive and valuable database that can deliver the most relevant results to the search queries of visitors.
But how exactly do search engine spiders and robots work?
The whole process begins when a web page is sent to a search engine for submission. The submitted URL is added to the queue of websites that will be visited by the search engine spider. Submissions can be optional though because most spiders will be able to find the content in a web page if other websites link to the page. This is the reason why it is a good idea to build reciprocal links with other website. By enhancing the link popularity of your website and getting links from other sites that have the same topic as your website.
When the search engine spider robot visits the website, it checks if there is an existing robots.txt file. The file tells the robot which areas of the site are off limits to its probe – like certain directories that have no use for search engines. All search engine bots look for this text file so it is a good idea to put one even if it is blank.
The robots list and store all of the links found on a page and they follow each link to its destination website or page.
The robots then submit all of this information to the search engine, which in turn compiles the data received from all the bots and builds the search engine database. This part of the process already has the intervention of search engine engineers who write the algorithms employed in evaluating and scoring the information that the search engine bots compiled. The moment all of the information is added to the search engine database this information is already made available to search engine visitors who are making search queries in the search engine.
How Do Search Engine Robots Work?
Think of search engine robots as automated data retrieval programs, traveling the web to find information and links.
When you submit a web page to a search engine at the “Submit a URL” page, the new URL is added to the robot’s queue of websites to visit on its next foray out onto the web. Even if you don’t directly submit a page, many robots will find your site because of links from other sites that point back to yours. This is one of the reasons why it is important to build your link popularity and to get links from other topical sites back to yours.
When arriving at your website, the automated robots first check to see if you have a robots.txt file. This file is used to tell robots which areas of your site are off-limits to them. Typically these may be directories containing only binaries or other files the robot doesn’t need to concern itself with.
Robots collect links from each page they visit, and later follow those links through to other pages. In this way, they essentially follow the links from one page to another. The entire World Wide Web is made up of links, the original idea being that you could follow links from one place to another. This is how robots get around.
The “smarts” about indexing pages online comes from the search engine engineers, who devise the methods used to evaluate the information the search engine robots retrieve. When introduced into the search engine database, the information is available for searchers querying the search engine. When a search engine user enters their query into the search engine, there are a number of quick calculations done to make sure that the search engine presents just the right set of results to give their visitor the most relevant response to their query.
You can see which pages on your site the search engine robots have visited by looking at your server logs or the results from your log statistics program. Identifying the robots will show you when they visited your website, which pages they visited and how often they visit. Some robots are readily identifiable by their user agent names, like Google’s “Googlebot”; others are bit more obscure, like Inktomi’s “Slurp”. Still other robots may be listed in your logs that you cannot readily identify; some of them may even appear to be human-powered browsers.
Along with identifying individual robots and counting the number of their visits, the statistics can also show you aggressive bandwidth-grabbing robots or robots you may not want visiting your website. In the resources section of the end of this article, you will find sites that list names and IP addresses of search engine robots to help you identify them.
How Do They Read The Pages On Your Website?
When the search engine robot visits your page, it looks at the visible text on the page, the content of the various tags in your page’s source code (title tag, meta tags, etc.), and the hyperlinks on your page. From the words and the links that the robot finds, the search engine decides what your page is about. There are many factors used to figure out what “matters” and each search engine has its own algorithm in order to evaluate and process the information. Depending on how the robot is set up through the search engine, the information is indexed and then delivered to the search engine’s database.
The information delivered to the databases then becomes part of the search engine and directory ranking process. When the search engine visitor submits their query, the search engine digs through its database to give the final listing that is displayed on the results page.
The search engine databases update at varying times. Once you are in the search engine databases, the robots keep visiting you periodically, to pick up any changes to your pages, and to make sure they have the latest info. The number of times you are visited depends on how the search engine sets up its visits, which can vary per search engine.
Sometimes visiting robots are unable to access the website they are visiting. If your site is down, or you are experiencing huge amounts of traffic, the robot may not be able to access your site. When this happens, the website may not be re-indexed, depending on the frequency of the robot visits to your website. In most cases, robots that cannot access your pages will try again later, hoping that your site will be accessible then.
What Are Search Engine Spiders?
A spider, also known as a robot or a crawler, is actually just a program that follows, or “crawls”, links throughout the Internet, grabbing content from sites and adding it to search engine indexes.
Spiders only can follow links from one page to another and from one site to another. That is the primary reason why links to your site (inbound links) are so important. Links to your website from other websites will give the search engine spiders more “food” to chew on. The more times they find links to your site, the more times they will stop by and visit. Google especially relies on its spiders to create their vast index of listings.
Spiders find Web pages by following links from other Web pages, but you can also submit your Web pages directly to a search engine or directory and request a visit by their spider. In fact, it’s a good idea to manually submit your site to a human-edited directory such as Yahoo, and usually spiders from other search engines (such as Google) will find it and add it to their database. It can be useful to submit your URL straight to the various search engines as well; but spider-based engines will usually pick up your site regardless of whether or not you’ve submitted it to a search engine.
Controlling Robot Indexing
Robot spiders cannot index unlinked files, so they will ignore all the miscellaneous files you may have in your web server directory. Web publishers can control which directories the robots should index by editing the robots.txt file, and web page creators can control robot indexing behavior using the Robots META tag.
Following Links
Local search robot spider indexers locate files to index by following links, just like webwide search engine spiders. You can specify the starting page, and these indexers will request it from the server and received it just like a browser. The indexer will store every word on the page and then follow each link on that page, indexing the linked pages and following each link from those pages.
Link Problems
They will miss pages which have been accidentally unlinked from any of your starting points. And spiders will have problems with JavaScript links, just like webwide search engine robots.
Dynamic Elements
Robot spider indexers will receive each page exactly as a browser will receive it, with all dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This is vital to some sites, but other sites may find that the presence of these dynamic elements triggers the re-indexing process, although none of the actual text of the page has been changed.
Most site search and webwide search engines can handle dynamic URLs (including question marks ? and other punctuation). However, there are others that will not follow these links: for help building plain URLs, see our page on Generating Simple URLs .
Server Load
Because they use HTTP, robot spider indexers can be slower than local file indexers, and can put more pressure on your web server, as they ask for each page. Some older webservers may crash during this process, either from the number of requests or because they uncover file corruption.
Updating Indexes
To update the index, some robot spider will query the web server about the status of each linked page by asking for the HTTP header using a “HEAD” request (the usual request for an HTML page is a “GET”). For HEAD requests, the server may be able to send the page header information from an internal cache, without opening and reading the entire file, and so the interaction may be much more efficient. Then the indexer compares the modified date from the header with its own date for the last time the index was updated. If the page has not been changed, it doesn’t have to update the index. If it has been changed, or if it is new and has not yet been indexed, the robot spider will then send a GET request for the entire page, and store every word. An alternate solution is for robot spiders to send an “If-Modified-Since” request with the previous date they have stored as the file date. this HTTP/1.1 header option allows the web server to send back a code if the page has not changed, and the entire page if it has changed.
Duplicate Files
Robots should contain special code to check for duplicate pages, due to server mirroring, alternate default page names, mistakes in relative file naming (./ instead of ../, for example), and so on. Some search indexers have powerful algorithms to identify these duplicates and only store and search one cop
A Standard for Robot Exclusion
This document represents a consensus on 30 June 1994 on the robots mailing list (robots-request@nexor.co.uk), between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk@info.cern.ch). This document is based on a previous working draft under the same title.
It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.
The latest version of this document can be found on http://www.robotstxt.org/wc/robots.html.
Introduction
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
The Method
The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt”. The contents of this file are specified below.
This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval.
A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.
The choice of the URL was motivated by several criteria:
The filename should fit in file naming restrictions of all common operating systems.
The filename extension should not require extra server configuration.
The filename should indicate the purpose of the file and be easy to remember.
The likelihood of a clash with existing files should be minimal.
The Format
The format and semantics of the “/robots.txt” file are as follows:
The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form “<field>:<optionalspace><value><optionalspace>”. The field name is case insensitive.
Comments can be included in file using UNIX bourne shell conventions: the ‘#’ character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.
The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.
User-agent
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is ‘*’, the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the “/robots.txt” file.
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The presence of an empty “/robots.txt” file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

RSS Feeds
Feed Comment 




The most important aspect of your website is content. Without informative content users will quickly exit your site and not come back. Having tons of quality content on your site’s pages is a great way to develop visitor loyalty. By having returning visitors you will also increase your conversion rate because many people do not buy on their first visit. Content will get your pages ranked higher than similar pages with less quality content. Quality content plays a major role with your search engine marketing campaign and it is important that your site offers relevant information for search engines/visitors.