When you go to a search engine and perform a search many people don't
understand how those results end up there. Some people may think that sites are
submitted while others know that a piece of software finds the pages. This
article explains one piece of that puzzle: The search engine crawler.
Todays search engines rely on software packages called spiders or robots.
These automated tools are used to search the web to discover new pages.
A brief history of search crawlers
The first crawler was the World Wide Web Wander and it appeared in 1993. It
was developed by MIT and it's initial purpose was to measure the growth of the
web. Soon after, however, an index was generated from the results – effectively
the first "search engine."
Since then, crawlers have evolved and developed. Initially crawlers were
simple creatures, only able to index specific bits of web page data such as meta
tags. Soon, however, search engines realized that a truly effective crawler
needs to be able to index other information, including visible text, alt tags,
images and even other non-HTML content such as PDF's word processor documents
and more.
How a crawler works
Generally, the crawler gets a list of URL's to visit and store. The crawler
doesn't rank the pages, it only goes out and gets copies which it stores, or
forwards to the search engine to later index and rank according to various
aspects.
Search crawlers also are smart enough to follow links they find on pages.
They may follow these links as they find them, or they will store them and visit
them later.
To date there are literally dozens of crawlers out regularly indexing the
web. Some are specialized crawlers – such as image indexers, while others are
more general and therefore more well known.
Some of the most well known crawlers include Googlebot (from Google) MSNBot
(from MSN) and Slurp (from Yahoo!). There is also the Teoma crawler (from Ask
Jeeves), as well as an assortment of crawlers from other engines, such as
shopping engines, blog search engines and more.
Generally, when a crawler comes to visit a site, they request a file called
"robots.txt." this file tells the search crawler which files it can request, and
which files or directories it's not allowed to visit.
The file can also be used to limit specific spiders access to any or all of
the site, and can also be used to control how many times the crawler visits the
site, by limiting it's speed or the times when the crawler can visit. (Yahoo!s
Slurp and MSNBot both support the "Crawl Delay" directive which tells the
crawlers to slow down on their crawling).
It's not imperative that a site have a robots.txt file however as a crawler
will assume it is OK to index the site if there isn't such a file.
Generally, today's crawlers are stripped down versions of web browsers. Some,
like Googlebot, are built upon a text based web browser called Lynx. Therefore
one of the tools one can use to verify a site is the Lynx browser. by loading
the site in the browser you can see essentially what the crawlers "sees." You
can then look for errors in the pages as well as any navigation problems the
crawler may come up against.
One other thing you may notice, as you view your web server log reports, is
that some browsers come many different times and with many different
configurations.
Yahoo!s Slurp, for example emulates many different hardware platforms – from
Windows 98 to Windows XP, and many different browsers, from Internet Explorer to
Mozilla. MSNbot also works like this – emulating different operating systems and
browsers.
They do this to ensure compatibility – after all, the search engines want to
be sure that the majority of their users find a site which they can use.
Therefore, as a design tip, you should test your site against various hardware
platforms and browsers as well. You don't have to use the variety that the
search engines use, but you should test against Internet Explorer, Netscape and
Firefox. Also, you should try your site on other platforms such as a Mac or
Linux just to ensure compatibility.
You may also notice, upon reviewing your reports, that crawlers like
Googlebot will visit repeatedly and request the same page(s) repeatedly. This is
common as crawlers also want to be sure the site is stable and also to measure
the page's change frequency.
If your site goes down temporarily when a crawler visits repeatedly like
this, don't worry. The crawlers are smart enough to leave and come back later
and try again. If, however, the continue to find the site down, or slow to
respond, they may opt to stay away for longer periods, or index the site more
slowly. This can negatively impact your site's performance in the search
engines.
As time goes on, we'd expect these spiders to become even more advanced. As
new authoring technology comes available, or new indexing options become
available, then the search crawlers will be adapted. Remember, the goal of all
the search engines is to have the most complete index of files found on the web.
This means they want to be able to index more than just web pages.
So as you are designing your site, be sure to keep the crawlers in mind.
Don't build your site for crawlers – build it for users – but be sure to test it
thoroughly so that the crawlers see what you want them to without hindrances or
roadblocks. Remember – the crawler is a site owners best friend.
About the Author
About the
author: Rob Sullivan - SEO Specialist and Internet Marketing Consultant.
http://www.textlinkbrokers.com