|
|
|
"One of the best things about working with MachInteractive is that the staff seems genuinely interested in helping."
Joel
Inlays.com
|
| |
|
|
| |
|
|
|
|
How Search Engine Works
Search Engines
Imagine yourself in a library without the librarian. How would you get to that
book you were looking for all this while? It’s practically impossible. Search engines,
like the librarians, index the websites on the World Wide Web and store thousands
of web pages in their database and present a list of web sites based on the
word/words, known as “keywords”, typed in by the user. The major search engines
today are Google, MSN Search, AltaVista, AOL Search and All The Web.
These search engines make your search for the right site or content much easier and quicker.
The Crawling Process:
Search engines make use of a software program called “spider” or “robot”
which “crawls”– parses through the text of thousands of web pages across the web
to give the results. Robots read through the text in order to verify the relevancy,
reliability and importance of the content of web pages, index them and store them in
their database. Search engine robots can’t read through graphics, frames, flash
or similar other technology. The process by which the search engines read and
store the web pages in their database is known as the “crawling” process.
Crawlers move on from one web page, mostly starting with the home page, to
other pages through the links provided on the previously crawled web page.
Robots start their search with directories which are websites having a
large number of links to different websites categorized under different
categories by human editors. Hence, sites that don’t have any inbound links
or not listed in any of the directories are not crawled by the robots.
Indexing Algorithm:
The different parameters on the basis of which the web pages are indexed is
called the “indexing algorithm”. Crawlers crawl the web pages using this algorithm.
One such indexing algorithm PageRank™ is used by Google. These algorithms take care
of different criteria like the title of web page, the use of keywords in the title
as well the way they are being used in the content, the relevancy of the content
related to the keyword typed in, how many sites link to that website and many such
other parameters while giving the results. Different search engines have different
indexing algorithms and hence, give different results for the same keyword.
The spiders crawl a website periodically and do note the changes
made in any of the pages. Even while giving results search engines give importance
to those sites, which update themselves frequently. There’s an optional tag called
“Revisit tag” that can be used to specify the time period after which the search
engines should crawl your page provided you update your content within the specified
time period. It may seriously affect your search engine rankings if the robot revisits
your site based on the revisit tag but still finds the old content. Spiders update
their index on finding an updated web page though the index updation may not occur
as soon as the updated web page is crawled.
Managing the spider:
The information regarding which of your pages should be indexed by the robot and which
shouldn’t be indexed can be provided within the Robots Meta tag, a special Meta tag
placed within the head section. In addition to this, the links on a web page that a
robot should or shouldn’t follow can also be specified within this tag.
Through the file named “Robots.txt”, written in any text editor,
we can control which portions of the web page should be crawled by the robot and
which portions shouldn’t be crawled. You can also specify the robots that should
crawl your website since a robot on visiting a site first of all reads this file.
You can also prevent a specific search engine from crawling your site using this.
Thus this file works as a firewall between your website and the different search
engine robots.
Thus search engines basically use different programs to find,
index and list out the web sites available on the World Wide Web.
Back
|
 |