|
by Jayaram V
How Search Engines Work
Search engines crawl the world wide web to gather information about
websites and their
content. This is usually done through robots or
crawlers or bots, complex
mechanisms that can roam the internet with incredible speed doing what
ordinary browsers can do but with much greater efficiency, speed and
capacity. The information so gathered is then passed on to indexers who index the
content according to a set of business rules, algorithms and other criteria and
store them as indexed data in huge databases. Each
search engine company develops its own set of rules and criteria
to organize the data they collect based upon the business model they
have chosen. Once the data is organized, then client mechanisms such as search forms
can be used to access it using keywords and various other criteria.
A typical search engine usually has four
components. Together they constitute what we understand as search
engine mechanism.
- Information gathering mechanism.
- Indexing mechanism.
- Ranking mechanism and
- Retrieval mechanism.
Limitations of Search Engines
Search engine business is very cost intensive because of the amount
of work involved in gathering and indexing information and keeping it up-to-date.
To accomplish this task search engine companies have to invest heavily in the state of the art technology
and technically qualified staff to maintain, manage and manipulate the
information and make it useful, convenient and meaningful for the end
users. The fast expanding world wide
web, with its complexity and incongruity poses a multitude of problems and challenges to the search engine companies
in managing information and keeping their technologies scalable and
effective. Government interference, internet
threats, cyber crime, linguistic
and regional variations, cross-cultural issues, absence of uniform
global internet policies and usability issues and people's
unwillingness to pay for search are some of the serious issues
which threaten the viability of search engine business and make it one
of the most difficult to manage on a long term basis without recourse to
search based ads and paid listings. While these alternatives save
the companies from financial problems, there is no guarantee that they
do not undermine the quality of the information they provide.
Despite the advances made in search engine technology, most search engines
do not have necessary means to to keep pace with the vast amount of data
that is being added constantly to the world wide web and the new websites that are
hosted every day. This results in some inefficiencies in the manner in
which the search engines work which are discussed below.
- Search engines have built in limitations in
responding to users' queries due to the limitations in their
indexing mechanism or the algorithms they use. They may also respond
differently to each keyword or combination of keywords or letters
and symbols depending upon how they are programmed.
- Because of the limitations in processing and indexing information
and the time and costs involved in removing irrelevant and useless
information to keep the indexes clean and up-to-date, a substantial portion of the content available on the world
wide web is either outdated or outside the reach of the search engines and the public
who use them. The so called invisible web is considered to be two to
three times larger than the visible web.
- Search engines distribute information on several servers to manage
load problems and not all of
them are updated or available at the same time. So the results of a search query may
vary depending upon which server received your query.
- Most search engines limit the number of pages they crawl on a
website. Even in respect of pages they crawl they index only a
certain portion of content and links on a page. Google for
example indexes the first 101KB of a Web page, and 120KB of PDF's.
- Since most of the websites do not keep reliable records of date stamps
or the dates on which they add or modify their content, date
searching capability of search engine content is unreliable.
- The indexing is usually a long drawn process and may involve days
and weeks before the information is processed and made available to
the public. So the information is not always the latest.
- Spamming, keyword manipulation, search engine optimization
techniques dilute and slow down the efforts of search engines in maintaining
quality.
- Paid submission policies used by Yahoo and other companies and
paid listings compromise the quality and the actual ranking of websites
based on merit.
- The rules and regulations evolved by search engines to deal with
duplicate content on the web often go against the original providers
of the information. Search engines do not have a reliable mechanism to distinguish
original content
from the duplicate because of limitations in date stamping. As a result,
providers of original
content often suffer due to illegal copying and reproduction.
- The ranking criteria used by search engines do not necessarily
bring up the best websites in each category. Hinduwebsite is
one good example.
Directory Services
A Directory is a database of information about websites and their
pages are organized alphabetically into categories, usually done by humans, instead of machines and automated
software, using a set of predefined criteria. Users can navigate through the directory through a
series of menus organized in a predictable manner to find the
information they want. Unlike the search engines which require state of
the art technology to gather and index information, the creation and maintenance
of directory requires the involvement of huge manpower to organize, evaluate
and categorize information. Hence they are slow to develop and usually
smaller in size compared to the indexes created by commercial search
engine companies. One of the best examples of a web directory is the one
maintained by dmoz.org, which being a public domain non-commercial
directory is used by several search engines and websites like Hinduwebsite.com.
Among the commercial directory Yahoo's directory is perhaps the best
known and the largest. Besides general directories, there are also
specialized directories dealing with a specific subject or category,
also known as metasites.
The Directory vs. Search Engine
Directories are very useful when you are researching on a general
topic, a popular category or a particular subject. For example if you are looking for information on religion, you can go
to the society and culture part of a directory to begin your search. If you are looking for information on a particular religion such as
Hinduism or Buddhism you can scroll down the category on religion in the
directory and locate links
to them easily. Besides categories of information, the directory services
usually provide an internal search engine with which you can easily look
for information with in the directory using a keyword or combination of
keywords. Search engines
are more useful when you are looking for in-depth information, or more recent information or more specialized information
on a subject, or information
that is beyond the scope of the categories in a directory. The standard
practice is to begin your search with directories and then move on
to search engines.
Meta Search Engines
Meta search engines do not use their own crawlers or databases to
gather and index information. Instead they use a complex set of routines
to access the databases publicly made available by various search
engines to gather information and provide them to the public in an
organized way. The advantage with meta search tools is that you can
simultaneously access various search engine databases and subject
directories without doing individual searches and see the results
displayed in one place. The main disadvantage is that the results are
not necessarily comprehensive. A meta search tool can
only fetch results from as many search engines as time, technology and
resources permit. Secondly. Due to the limitations
placed by each search engine in retrieving information, you may not
always get the best results or all the results. Besides, meta search
tools retrieve information basically through simple search routines. So
these tools are not ideal for advanced search. Despite these
limitations, if you are aiming to have an overview or comparative
view of how each search engine is reacting to a particular keyword or a
set of keywords, meta search engines
are the best place to start with.
Specialized Search Tools
| Academic |
Education |
Biblio |
Bus. Intell. |
Other |
|
|
|
|
|
|
| Health |
Kids |
Legal |
Media/Music |
News/Blogs |
|
|
|
|
|
|
| Public
Records/People |
Reference |
Religion |
Statistics |
Politics |
|
|
|
|
|
|
|