Most of the common users or guests use different obtainable search engines to look out the piece of data they required. But how this information is provided by search engines? Where from they have collected these info? Basically most of these search engines maintain their own database of information. These database includes the sites out there within the webworld which ultimately maintain the detail net pages info for every accessible sites. Essentially search engine do some background work by using robots to collect info and maintain the database. They make catalog of gathered info and then gift it publicly or at-times for non-public use.
In this article we will discuss about those entities which loiter in the worldwide internet environment or we will concerning net crawlers that move around in netspace. We will learn
· What it’s all about and what purpose they serve ?
· Pros and cons of using these entities.
· How we tend to can keep our pages faraway from crawlers ?
· Differences between the common crawlers and robots.
In the following portion we have a tendency to can divide the whole research work underneath the subsequent 2 sections :
I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained.
I. Search Engine Spider : Robots.txt
What is robots.txt file ?
A web robot may be a program or search engine software that visits sites often and automatically and crawl through the net’s hypertext structure by fetching a document, and recursively retrieving all the documents which are referenced. Sometimes site owners do not want all their website pages to be crawled by the internet robots. For that reason they’ll exclude few of their pages being crawled by the robots by using some customary agents. So most of the robots abide by the ‘Robots Exclusion Customary’, a set of constraints to restricts robots behavior.
‘Robot Exclusion Normal’ is a protocol used by the location administrator to manage the movement of the robots. When search engine robots come to a web site it can hunt for a file named robots.txt in the root domain of the location (http://www.anydomain.com/robots.txt). This can be a visible text file which implements ‘Robots Exclusion Protocols’ by permitting or disallowing specific files inside the directories of files. Site administrator will disallow access to cgi, temporary or private directories by specifying robot user agent names.
The format of the robot.txt file is terribly simple. It consists of 2 field : user-agent and a number of disallow field.
What is User-agent ?
This is the technical name for an programming ideas in the world wide networking surroundings and used to say the precise search engine robot inside the robots.txt file.
For example :
User-agent: googlebot
We have a tendency to can conjointly use the wildcard character “*” to specify all robots :
User-agent: *
Means all the robots are allowed to come back to visit.
What’s Disallow ?
Within the robot.txt file second field is called the disallow: These lines guide the robots, to that file should be crawled or that ought to not be. For example to stop downloading email.htm the syntax can be:
Disallow: email.htm
Prevent crawling through directories the syntax can be:
Disallow: /cgi-bin/
White Space and Comments :
Using # at the start of any line within the robots.txt file can be considered as comments only and using # at the beginning of the robots.txt like the subsequent example entail us that url to be crawled.
# robots.txt for www.anydomain.com
Entry Details for robots.txt :
1) User-agent: *
Disallow:
The asterisk (*) within the User-agent field is denoting “all robots” are invited. As nothing is disallowed so all robots are unengaged to crawl through.
two) User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/
All robots are allowed to crawl through the all files except the cgi-bin, temp and non-public file.
3) User-agent: dangerbot
Disallow: /
Dangerbot is not allowed to crawl through any of the directories. “/” stands for all directories.
4) User-agent: dangerbot
Disallow: /
User-agent: *
Disallow: /temp/
The blank line indicates starting of latest User-agent records. Except dangerbot all the opposite bots are allowed to crawl through all the directories except “temp” directories.
5) User-agent: dangerbot
Disallow: /links/listing.html
User-agent: *
Disallow: /email.html/
Dangerbot is not allowed for the listing page of links directory otherwise all the robots are allowed for all directories except downloading email.html page.
half dozen) User-agent: abcbot
Disallow: /*.gif$
To get rid of all files from a selected file kind (e.g. .gif ) we have a tendency to will use the on top of robots.txt entry.
seven) User-agent: abcbot
Disallow: /*?
To restrict net crawler from crawling dynamic pages we can use the on top of robots.txt entry.
Note : Disallow field might contain “*” to follow any series of characters and may finish with “$” to point the tip of the name.
Eg : Among the image files to exclude all gif files however allowing others from google crawling
User-agent: Googlebot-Image
Disallow: /*.gif$
Disadvantages of robots.txt :
Downside with Disallow field:
Disallow: /css/ /cgi-bin/ /images/
Totally different spider can browse the higher than field in different way. Some will ignore the spaces and will read /css//cgi-bin//pictures/ and could only take into account either /pictures/ or /css/ ignoring the others.
The correct syntax should be :
Disallow: /css/
Disallow: /cgi-bin/
Disallow: /images/
All Files listing:
Specifying every and every file name inside a directory is most ordinarily used mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html
Higher than portion will be written as:
Disallow: /ab/
Disallow: /op/
A trailing slash means that a ton that is a directory is offlimits.
Capitalization:
USER-AGENT: REDBOT
DISALLOW:
Though fields are not case sensitive but the datas like directories, filenames are case sensitive.
Conflicting syntax:
User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:
What can happen ? Redbot is allowed to crawl everything however can this permission override the disallow field or disallow will override the permit permission.
II. Search Engine Robots: Meta-tag Explained:
What’s robot meta tag ?
Besides robots.txt search engine is also having another tools to crawl through internet pages. This can be the META tag that tells internet spider to index a page and follow links on it, that might be a lot of useful in some cases, because it can be used on page-by-page basis. It is also helpful incase you don’t have the requisite permission to access the servers root directory to regulate robots.txt file.
We tend to used to place this tag among the header portion of html.
Format of the Robots Meta tag :
Within the HTML document it’s placed within the HEAD section.
html
head
META NAME=”robots” CONTENT=”index,follow”
META NAME=”description” CONTENT=”Welcome to…….”
title……………title
head
body
Robots Meta Tag choices :
There are four options that may be employed in the CONTENT portion of the Meta Robots. These are index, noindex, follow, nofollow.
This tag permitting search engine robots to index a selected page and can follow all the link residing on it. If website admin doesn’t want any pages to be indexed or any link to be followed then they’ll replace “ index,follow” with “ noindex,nofollow”.
According to the necessities, web site admin can use the robots in the subsequent different choices :
META NAME=”robots” CONTENT=”index,follow”> Index this page, follow links from this page.
META NAME=”robots” CONTENT =”noindex,follow”> Don’t index this page however follow link from this page.
META NAME=”robots” CONTENT =”index,nofollow”> Index this page but don’t follow links from this page
META NAME=”robots” CONTENT =”noindex,nofollow”> Don’t index this page, don’t follow links from this page.
To learn how to increase your website traffic, visit: link popularity building. You can use our link popularity building to increase website’s rank on search engines and boost your business as well. What can SEO do for your business? Find the answers at link popularity building








