SEARCH ENGINES
IN-DEPTH
Jadranka Stojanovski
Rudjer Boskovic Institute
Introduction
"Which of the following has helped you find your way to the Web sites you use?" was the question given to the unknown number of Internet users. According this study provided by Forrester Research search engines remain the leading way users in the United Kingdom find web sites. The "UK Internet User Monitor" survey found 81 percent of users said that search engines helped them find the web sites they use, up from 67 percent in 1999. The next most popular source was by following links, a method used by 59 percent of those surveyed (Table 1).
Table 1. UK Internet User Monitor, May 2000
Search engines seem to be everywhere. The majority of the web public use search engines to find information at least weekly, if not daily. The availability of free search engines that index words from millions of web pages has been one of the driving forces of the web. They are changing and growing rapidly, and no one knows for sure in which direction they are going.
This article will first provide an overview of the web space and search engine features in general. This will be followed by a detailed survey of the seven main search engines.
Web space
It is possible to search only a part of the web space called "visible" web that contains mainly of the static web pages. Static web pages are manually produced, they offer a generic information and most of them are indexable. On the other side dynamic web pages are computer generated, offer customised information and are not indexable.
The “invisible” web contains pages with authorisation requirements, pages excluded from indexing using the robot exclusion meta-tag, badly designed pages with frames, non-HTML pages and dynamically generated web pages. The “visible” web contains static web pages, “publicly indexable” pages (Lawrence and Gilles, 1999).
The approximate size of the “visible” web is growing very fast:
December 1997 |
320 M pages |
February 1999 |
800 M pages |
February 2000 |
>1,2 G pages |
July 10, 2000 |
2.1 G (7 M pages per day growth) |
Search engines
When using a search engine, the user is searching a database of indexed web sites. All search engines have three primary components:
Relevancy ranking and the way how it is calculated is the "top secret" by most of the search engines. Most search engines use the location and frequency of keywords on a web page as the basis of ranking it in response to a query. The exact mechanism is slightly different for each engine. In addition to location and frequency, some search engines base relevancy ranking algorithm on the popularity by number of links or by number of "clicks" on the user side. Some search engines that support the meta description and keywords tag will also give pages and extra boost if search terms appear in these areas. All approaches have problems with cyberspamming. The sites that attempt to do “a simple spam” (as “stacking” or “stuffing” words on a page) are penalised by all major search engines.
According Search engine watch at http://www.searchenginewatch.com the biggest search engine is Google (Figure 2).
Figure 2. Sizes are as reported by each search engine and as of June 6, 2000.
GG=Google, WT=WebTop.com, AV=AltaVista, FAST=FAST, NL=Northern Light, EX=Excite, INK=Inktomi, Go=Go (Infoseek)
Simple research provided by the author of this paper on September 25, 2000 (query was Croatian word "korisnici" = “users” in English) shows different data (Table 1):
SEARCH ENGINE |
SIMPLE SEARCH |
ADVANCED SEARCH |
Lycos |
12,988 |
12,837 |
Fast |
12,982 |
12,989 |
|
8,060 |
- |
Northern Light |
7,172 |
7,172 |
Alta Vista |
7,086 |
7,488 |
HotBot |
5,100 |
5,100 |
Excite |
about 145 |
about 145 |
Snap |
66 |
66 |
Go (Infoseek) |
1,079 |
- |
MSN Search |
1,921 |
- |
Web Crawler |
22 |
- |
Table 1. Hits by simple query
Lycos and Fast are obviously using the same (Fast) database that is the biggest one at the moment.
The general search engine features are:
Alta Vista
A very good comprehensive, fast and powerful search engine. AltaVista provides a lot of search construction options for sophisticated searchers, and therefore has long enjoyed favour by information professionals. Besides traditional Boolean search options, AltaVista has many field limits, and also has an interesting forced phrase searching feature.
AltaVista translates its results into several different languages, which can come in handy when you run across a page in a language you don't understand. The translation is not precise, but it can be good enough to get the clue of what is on the page. AltaVista also has a special image and media finder. The image finder returns a result list complete with image thumbnails.
PROS |
CONS |
powerful search features |
Inconsistent results |
size |
Only 10 hits per page |
translation service (only 5k) |
No sorting options |
image search |
|
high quality index |
|
international approach |
|
intuitive interface |
Search features and results:
Search inconsistencies:
Excite
Excite has a popular, medium size database, also used by Webcrawler. Excites features, its personalised page with news and portfolio tracking are tops. Unfortunately its search engine relevancy could be improved.
Excite provides a few more options than Google for creating a detailed search, but the beauty of Excite is its concept searching. If you type a word in the search box, not only do you search for that word itself, but also forms of the word, synonyms of the word, and other words that are related to that word. User doing more complex search should be very careful: if he/she use a Boolean operator in the search string, it will turn the search into an exact keyword search, eliminating any concept searching for word variations, synonyms, etc.
PROS |
CONS |
offers sophisticated personalization My Excite |
Middle size database |
“more like this” feature |
No truncation controlled by user |
choosing “site” |
No international approach |
very relevant results for very popular queries |
Boolean must be typed in all CAPS |
News Search for web newspapers access |
Towards user-consumer |
numerous reference databases (dictionaries, almanac, encyclopaedia) |
Search features and results:
Fast
Accordingly his name Fast is very fast search engine having in mind one of the largest database. Results seem to be listed in order of how many keywords that match. Seems to index all the keywords in a document. The main problem with this search engine is a redundancy: if a site has multiple pages on the same topic, it returns them all requiring the searcher to sift through many redundant listings.
PROS |
CONS |
size |
Not so fresh index |
speed |
Lack of command Boolean searching, truncation, and many field searches |
don’t have stopwords |
|
100 hits per page in advanced search |
|
relevance ranking algorithm may be significantly more effective than others |
Search features and results:
Google is a popularity engine like Direct Hit. Despite his name that sounds like a baby babbling, Google claims to use a complicated mathematical analysis based on hyperlinks on the web, to return high-quality search results so you don't have to sift through junk. Google gives an excerpt of the text that matches the query--with the search terms in bold. Google presently has indexed more than 1 billion web pages, the most number of any search engine. They use Open Directory for directory listings.
This is the simplest search engine to use that provides very few options for searchers to construct a detailed search. In fact, just about the only options are a minus sign to exclude terms and parentheses to force terms to be searched as a phrase.
Google's results list provides a similar pages feature that allows the computer to construct automatically a new search for pages similar to a specified page - a good option if searcher find one good page and want more like it.
PROS |
CONS |
simple interface |
Limited searching capabilities |
fast!!! |
Link search must be exact |
often has excellent results |
|
stopwords search could be forced |
|
can go to the web, or a "cached" copy, which Google stored when it retrieved the page!!! |
|
Option of 10, 30, or 100 records per page of results |
Search features and results:
Search inconsistencies
HotBot
HotBot uses both the Open Directory Project and Inktomi's databases. Like most search engines it suffers when it comes to finding relevant listings. While the Open Directory Project and Inktomi represent two of the largest databases their size doesn't mean the algorithms used to sort the data can guarantee accurate results. Still HotBot it is one of the better individual search engines and includes powerful advanced search options.
The best way to use HotBot is to go directly to the Advanced Search page, which provides many more search options. HotBot's form interface is a favorite among many users. It allows searcher to build a sophisticated search without having to remember operators and limits.
HotBot results employ direct hit technology, which provides a list of the top ten most popular links for any given search. The theory behind direct hit is that the sites that people go to most based on a given search are also likely to be the most relevant sites for that search.
PROS |
CONS |
double system: Inktomi and Direct Hit database and Open Directory |
Database size shrunk |
Interface is user-friendly |
Has stop-words |
Many field searching options |
Consumer oriented |
Text Mode - Much faster to use at www.hotbot.com/text |
|
New index every two weeks? |
Search features and results:
Search inconsistencies
Go.com
Go.com uses the search capabilities of the former Infoseek, and still employs many of the same features. Provides quality results thanks to its ESP search algorithm. It also has a large human compiled directory.
Go.com supports Boolean searching and allows searching with many limits. Additionally, the advanced search provides a variety of specialised search engines for finding specific information. Go.com provides the opportunity to search in levels, by performing one broad search first, then narrowing it down by searching again within the given results, rather than searching the Web all over again. Its translation option allows the translation of any page.
PROS |
CONS |
sorts by site and date |
Less powerful search features |
rich supplemental resources |
Small database |
rich portal content |
Consumer oriented |
additional reference databases |
|
translation service - eng, fra, ger, ita, spa, por - larger documents than Alta Vista |
|
refine options |
|
good for diacriticals |
Search features
Lycos
Bringing together data from FAST, Direct Search and the Open Project Directory, Lycos also supports Boolean searching, and by far has the most extensive options for proximity searching as any search engine on the Web. Lycos, like Go.com, provides the option of searching by levels, where you can search within a previous set of results. It will also offer suggested searches following your initial search. Lycos' advanced search provides a variety of specialised search engines for locating specific information. An automated tracking feature at Lycos allows users to register and have their searches updated automatically. Lycos' results page offers a popular links region where the most popular links for certain searches will be distinguished from regular results.
PROS |
CONS |
long tradition |
Small database in regular Lycos |
conglomeration of databases, online services and Internet properties |
Slow to refresh the database |
popular Top 5% sites |
|
large database in Lycos Pro (Fast) |
|
advanced features in Lycos Pro |
|
extensive portal content |
Search features and results:
Northern Light
PROS |
CONS |
parallel with their large database of web pages NL offers Special Collection |
Not very relevant results |
large database |
Only 10 hits at time |
reach search feature (proximity,Boolean, truncation, Power Search, Business Search) |
|
different Help levels - general, search, power search |
|
Current News (the most recent - two weeks) |
Search features and results:
Search inconsistencies
Tips for formulating searches
Future development
With thousands of millions of web pages nowadays, it is getting harder and harder for search engines to keep up with a demand for accuracy. What are the search engines likely to do about it in the near future? The way of development in search engines is that they can adapt to user's needs. In the future a sophisticated robot will search for concept, no words. A spider will be "trained" to pick out only high-quality web pages, to make their selection more precise. Probably indexing tools will have to perform more sophisticated analysis of the web page they are indexing. They will measure how many links are in the page, internal as well as external, how much text is included, how many graphic images are animated. Implementation of standards will be crucial demand and movement from HTML to XML is taking place already. Librarians and their experiences could be of great importance in the future classification and cataloguing web pages process, according adopted metadata standards.
Conclusion
As already mentioned search engines are inconsistent, inaccurate, unreliable, sometimes inaccessible, incomplete, out of date, don’t live up to their advertisements, error prone, and provide millions of irrelevant items. If we take a look at the Lycos top ten most popular searches, which are as follows:
we can realise that all are about entertainment, games, and sex. Who wants to speak about relevancy ranking any more?
References
Beyond the hype: Dissecting AltaVistas claims. (1999, November 1). The Search Engine Report. http://www.searchenginewatch.com/sereport/99/11-avclaims.html
Botluk, D. (2000, Sept) Update to Search Engines compared. http://www.llrx.com/features/engine3.htm
Habib, D. P. & Balliot, R. L. (1999, September 19), How to search the World Wide Web: A tutorial for beginners and non-experts. http://204.17.98.73/midlib/tutor.htm
Hasebrook, J. P. (1999, April-June), Searching the Web without losing your mind: Traveling the knowledge space. WebNet Journal, 1(1).
Notess, G. (1999a), Search engine showdown. http://www.notess.com/search/stats/
Notess, G. (1999b), Multiple search engines, search engine showdown. http://www.notess.com/search/multi/
Repman, J. & Carlson, R.D. (1999, May), Surviving the storm: Using metasearch engines effectively, Computers in Libraries, 19(5).
Search engine shootout. http://home.cnet.com/category/topic/
TargetedListings (1999), Understanding how search engines rank Web pages. http://www.targetedlistings.com/tips6.html
WebSideStory. http://www.websidestory.com/content.cfm?Pg=3&PR=25
Wiens, B (2000, Sept 15), Websearching. http://www.benwiens.com/websearch.html).
Whos the biggest of them all? (1999, November 1). The Search Engine Report.