A Complete Guide To Search Engines

Information Technology Resources

by Jayaram V

How Search Engines Work

Search engines crawl the world wide web to gather information about websites and their content. This is usually done through robots or crawlers or bots, complex mechanisms that can roam the internet with incredible speed doing what ordinary browsers can do but with much greater efficiency, speed and capacity.

The information so gathered is then passed on to indexers who index the content according to a set of business rules, algorithms and other criteria and store them as indexed data in huge databases. Each search engine company develops its own set of rules and criteria to organize the data they collect based upon the business model they have chosen. Once the data is organized, then client mechanisms such as search forms can be used to access it using keywords and various other criteria.

A typical search engine usually has four components. Together they constitute what we understand as search engine mechanism.

  1. Information gathering mechanism.
  2. Indexing mechanism.
  3. Ranking mechanism and
  4. Retrieval mechanism.

Limitations of Search Engines

Search engine business is very cost intensive because of the amount of work involved in gathering and indexing information and keeping it up-to-date. To accomplish this task search engine companies have to invest heavily in the state of the art technology and technically qualified staff to maintain, manage and manipulate the information and make it useful, convenient and meaningful for the end users. The fast expanding world wide web, with its complexity and incongruity poses a multitude of problems and challenges to the search engine companies in managing information and keeping their technologies scalable and effective. Government interference, internet threats, cyber crime, linguistic and regional variations, cross-cultural issues, absence of uniform global internet policies and usability issues and people's unwillingness to pay for search are some of the serious issues which threaten the viability of search engine business and make it one of the most difficult to manage on a long term basis without recourse to search based ads and paid listings. While these alternatives save the companies from financial problems, there is no guarantee that they do not undermine the quality of the information they provide.

Despite the advances made in search engine technology, most search engines do not have necessary means to to keep pace with the vast amount of data that is being added constantly to the world wide web and the new websites that are hosted every day. This results in some inefficiencies in the manner in which the search engines work which are discussed below.

1. Search engines have built in limitations in responding to users' queries due to the limitations in their indexing mechanism or the algorithms they use. They may also respond differently to each keyword or combination of keywords or letters and symbols depending upon how they are programmed.

2. Because of the limitations in processing and indexing information and the time and costs involved in removing irrelevant and useless information to keep the indexes clean and up-to-date, a substantial portion of the content available on the world wide web is either outdated or outside the reach of the search engines and the public who use them. The so called invisible web is considered to be two to three times larger than the visible web.

3. Search engines distribute information on several servers to manage load problems and not all of them are updated or available at the same time. So the results of a search query may vary depending upon which server received your query.

4. Most search engines limit the number of pages they crawl on a website. Even in respect of pages they crawl they index only a certain portion of content and links on a page. Google for example indexes the first 101KB of a Web page, and 120KB of PDF's.

5. Since most of the websites do not keep reliable records of date stamps or the dates on which they add or modify their content, date searching capability of search engine content is unreliable.

6. The indexing is usually a long drawn process and may involve days and weeks before the information is processed and made available to the public. So the information is not always the latest.

7. Spamming, keyword manipulation, search engine optimization techniques dilute and slow down the efforts of search engines in maintaining quality.

8. Paid submission policies used by Yahoo and other companies and paid listings compromise the quality and the actual ranking of websites based on merit.

9. The rules and regulations evolved by search engines to deal with duplicate content on the web often go against the original providers of the information. Search engines do not have a reliable mechanism to distinguish original content from the duplicate because of limitations in date stamping. As a result, providers of original content often suffer due to illegal copying and reproduction.

10. best websites in each category. Hinduwebsite is one good example.

Directory Services

Directory is a database of information about websites and their pages are organized alphabetically into categories, usually done by humans, instead of machines and automated software, using a set of predefined criteria. Users can navigate through the directory through a series of menus organized in a predictable manner to find the information they want. Unlike the search engines which require state of the art technology to gather and index information, the creation and maintenance of directory requires the involvement of huge manpower to organize, evaluate and categorize information. Hence they are slow to develop and usually smaller in size compared to the indexes created by commercial search engine companies. One of the best examples of a web directory is the one maintained by dmoz.org, which being a public domain non-commercial directory is used by several search engines and websites like Hinduwebsite.com. Among the commercial directory Yahoo's directory is perhaps the best known and the largest. Besides general directories, there are also specialized directories dealing with a specific subject or category, also known as metasites.

The Directory vs. Search Engine

Directories are very useful when you are researching on a general topic, a popular category or a particular subject. For example if you are looking for information on religion, you can go to the society and culture part of a directory to begin your search. If you are looking for information on a particular religion such as Hinduism or Buddhism you can scroll down the category on religion in the directory and locate links to them easily. Besides categories of information, the directory services usually provide an internal search engine with which you can easily look for information with in the directory using a keyword or combination of keywords. Search engines are more useful when you are looking for in-depth information, or more recent information or more specialized information on a subject, or information that is beyond the scope of the categories in a directory. The standard practice is to begin your search with directories and then move on to search engines.

Meta Search Engines

Meta search engines do not use their own crawlers or databases to gather and index information. Instead they use a complex set of routines to access the databases publicly made available by various search engines to gather information and provide them to the public in an organized way. The advantage with meta search tools is that you can simultaneously access various search engine databases and subject directories without doing individual searches and see the results displayed in one place. The main disadvantage is that the results are not necessarily comprehensive. A meta search tool can only fetch results from as many search engines as time, technology and resources permit. Secondly. Due to the limitations placed by each search engine in retrieving information, you may not always get the best results or all the results. Besides, meta search tools retrieve information basically through simple search routines. So these tools are not ideal for advanced search. Despite these limitations, if you are aiming to have an overview or comparative view of how each search engine is reacting to a particular keyword or a set of keywords, meta search engines are the best place to start with.

Best Search Engine Tools
Google
All the Web
Ask Jeeves
Alta Vista
Gigablast
Lycos
Teoma
Yahoo
AOL Search
MSN Search
Netscape
Dipsie
Fybersearch
Mozdex
Whatuseek
Wisenut
ExactSeek
Lost Link/ Web Links
Link Centre
Scubtheweb
Jayde
AOL Search
HotBot
Search.com
Metacrawler
Dogpile
Mamma
C4
Canada.com
ixquick
Infogrid
WebInfoSearch
Query Server
800go
Debriefing
Highway 61
Link Master
Splat Search
37.com
OneSeek
MetaSpider
Vivisimo
PlanetSearch
surfwax
qbSearch
ProFusion
Proteus
Go2 Net
MegaGo.com
WebFile
myGO
Megacrawler
Search Climbers
IX Quick
Northern Light
Subjex
Zen Search
Kanoodle
NBCi/
Snap
Go
InfoSeek
7Search
Acclaim Search
AllCrawl
Amnesi
Ampleo
Deja.Com
Deoji
DevSearch
Frequent Finders
iBound
Info Hiway
Infomak
GoshDarn!
Jump City
Z Search
Meta Search Engines
clusty.com
HighBeam Research
Dogpile
Surfwax
Copernic
Metacrawler
IxQuick
Search.com
Fazzle
Infogrid
Vivismo
Infonetware
Ithaki
KillerInfo
Mamma
Profusion
Kartoo
QueryServer
Turbo10
Weblens
Widow
philb.com
Zapmeta Searchy
Subject Directories
nthing.com
Bubl Link
Complete Planet
Infomine
Suite 101
Internet Public Library
Joe Ant
Librarian Index
Open Directory
Top Ten Links
Resource Discovery
Asiaco
Awesome Library
BBCI Directory
Galaxy.com
Gimpsy
GoGuides
Illumirate
Ranks.com
Specialized Search Tools
AcademicEducationBiblioBus. Intell.Other
Academic info
Best Info
Bubl Link
Infomine
Psigate
Leiden University
Music Schools
Shrock Guide
Education World
Education Index
E Journal
ERIC
Study Abroad
C&RL Newsnet
History
Library Catalogs
Competitive Intelligence
CEOExpress
libWeb
Big Book
Bizweb
Northern Lights
Recall
EEVL
ULB
Scirus
History
HealthKidsLegalMedia/MusicNews/Blogs
Biome
Chemdex
HealthAtoZ
HON
Health Finder
Medline Plus
Achoo
Awesome Library
Cybersleuth
ACLIN
AJKids
Kids Gov
Kidsclick
peachpod
Searchopolis
LawCrawler
Law Review
West Law
WLD
Ditto
FindSounds
Nasa Image
pollstar
Trashsurfing
picturesurfin
singingfish
Webseek
Moreover
Daypop
Public Records/PeopleReferenceReligionStatisticsPolitics
BRB Publications
Abika
Any Student
Bigfoot
DocuSearch
InfoSpace
IPL
Thorplus
Refdesk
Rutgers
Govdocs
Political Info

Share This


Suggestions for Further Reading