A Complete Guide To Search Engines
How Search Engines Work
Search engines crawl the world wide web to gather information about websites and their content. This is usually done through robots or crawlers or bots, complex mechanisms that can roam the internet with incredible speed doing what ordinary browsers can do but with much greater efficiency, speed and capacity.
The information so gathered is then passed on to indexers who index the content according to a set of business rules, algorithms and other criteria and store them as indexed data in huge databases. Each search engine company develops its own set of rules and criteria to organize the data they collect based upon the business model they have chosen. Once the data is organized, then client mechanisms such as search forms can be used to access it using keywords and various other criteria.
A typical search engine usually has four components. Together they constitute what we understand as search engine mechanism.
- Information gathering mechanism.
- Indexing mechanism.
- Ranking mechanism and
- Retrieval mechanism.
Limitations of Search Engines
Search engine business is very cost intensive because of the amount of work involved in gathering and indexing information and keeping it up-to-date. To accomplish this task search engine companies have to invest heavily in the state of the art technology and technically qualified staff to maintain, manage and manipulate the information and make it useful, convenient and meaningful for the end users. The fast expanding world wide web, with its complexity and incongruity poses a multitude of problems and challenges to the search engine companies in managing information and keeping their technologies scalable and effective. Government interference, internet threats, cyber crime, linguistic and regional variations, cross-cultural issues, absence of uniform global internet policies and usability issues and people's unwillingness to pay for search are some of the serious issues which threaten the viability of search engine business and make it one of the most difficult to manage on a long term basis without recourse to search based ads and paid listings. While these alternatives save the companies from financial problems, there is no guarantee that they do not undermine the quality of the information they provide.
Despite the advances made in search engine technology, most search engines do not have necessary means to to keep pace with the vast amount of data that is being added constantly to the world wide web and the new websites that are hosted every day. This results in some inefficiencies in the manner in which the search engines work which are discussed below.
1. Search engines have built in limitations in responding to users' queries due to the limitations in their indexing mechanism or the algorithms they use. They may also respond differently to each keyword or combination of keywords or letters and symbols depending upon how they are programmed.
2. Because of the limitations in processing and indexing information and the time and costs involved in removing irrelevant and useless information to keep the indexes clean and up-to-date, a substantial portion of the content available on the world wide web is either outdated or outside the reach of the search engines and the public who use them. The so called invisible web is considered to be two to three times larger than the visible web.
3. Search engines distribute information on several servers to manage load problems and not all of them are updated or available at the same time. So the results of a search query may vary depending upon which server received your query.
4. Most search engines limit the number of pages they crawl on a website. Even in respect of pages they crawl they index only a certain portion of content and links on a page. Google for example indexes the first 101KB of a Web page, and 120KB of PDF's.
5. Since most of the websites do not keep reliable records of date stamps or the dates on which they add or modify their content, date searching capability of search engine content is unreliable.
6. The indexing is usually a long drawn process and may involve days and weeks before the information is processed and made available to the public. So the information is not always the latest.
7. Spamming, keyword manipulation, search engine optimization techniques dilute and slow down the efforts of search engines in maintaining quality.
8. Paid submission policies used by Yahoo and other companies and paid listings compromise the quality and the actual ranking of websites based on merit.
9. The rules and regulations evolved by search engines to deal with duplicate content on the web often go against the original providers of the information. Search engines do not have a reliable mechanism to distinguish original content from the duplicate because of limitations in date stamping. As a result, providers of original content often suffer due to illegal copying and reproduction.
10. best websites in each category. Hinduwebsite is one good example.
Directory is a database of information about websites and their pages are organized alphabetically into categories, usually done by humans, instead of machines and automated software, using a set of predefined criteria. Users can navigate through the directory through a series of menus organized in a predictable manner to find the information they want. Unlike the search engines which require state of the art technology to gather and index information, the creation and maintenance of directory requires the involvement of huge manpower to organize, evaluate and categorize information. Hence they are slow to develop and usually smaller in size compared to the indexes created by commercial search engine companies. One of the best examples of a web directory is the one maintained by dmoz.org, which being a public domain non-commercial directory is used by several search engines and websites like Hinduwebsite.com. Among the commercial directory Yahoo's directory is perhaps the best known and the largest. Besides general directories, there are also specialized directories dealing with a specific subject or category, also known as metasites.
The Directory vs. Search Engine
Directories are very useful when you are researching on a general topic, a popular category or a particular subject. For example if you are looking for information on religion, you can go to the society and culture part of a directory to begin your search. If you are looking for information on a particular religion such as Hinduism or Buddhism you can scroll down the category on religion in the directory and locate links to them easily. Besides categories of information, the directory services usually provide an internal search engine with which you can easily look for information with in the directory using a keyword or combination of keywords. Search engines are more useful when you are looking for in-depth information, or more recent information or more specialized information on a subject, or information that is beyond the scope of the categories in a directory. The standard practice is to begin your search with directories and then move on to search engines.
Meta Search Engines
Meta search engines do not use their own crawlers or databases to gather and index information. Instead they use a complex set of routines to access the databases publicly made available by various search engines to gather information and provide them to the public in an organized way. The advantage with meta search tools is that you can simultaneously access various search engine databases and subject directories without doing individual searches and see the results displayed in one place. The main disadvantage is that the results are not necessarily comprehensive. A meta search tool can only fetch results from as many search engines as time, technology and resources permit. Secondly. Due to the limitations placed by each search engine in retrieving information, you may not always get the best results or all the results. Besides, meta search tools retrieve information basically through simple search routines. So these tools are not ideal for advanced search. Despite these limitations, if you are aiming to have an overview or comparative view of how each search engine is reacting to a particular keyword or a set of keywords, meta search engines are the best place to start with.
|Meta Search Engines|
|Specialized Search Tools|
Suggestions for Further Reading
- Common Gateway Interface - CGI
- CRM, Customer Relationship Management
- Data Warehouse and Data Mining Technologies
- Relational Databases
- DHTML Scripts - Resources For Dynamic Websites
- Domain Name Registration Services
- Ecommerce and Online Store Solutions
- Online Marketing Solutions and Promotions
- Essays on Information Technology
- Free Software, Tools and Applications
- Computer and Web Graphics
- HyperText Markup Language
- Html Tools for World Wide Web
- Java Programming language, Tutorials, Tools and Applets
- Computer Networks
- Computer Operating Systems
- OS X Mountain Lion and Parallels Desktop For Mac
- Perl Scripts - Resources for CGI and Perl Programming
- Programming Languages
- Project Management
- Search Engines and SEO
- Web design and Website development
- Best Web Hosting Services
- Web Resources and Web Tools
- Web Tools And Web Resources For The Web Masters
Translate the Page