3.7 Scalability - DESIRE Information Gateways Handbook

3.7. Scalability

In this chapter...
	an overview of scalability issues user interface and usability administration and management systems issues

Introduction

Scalability is an issue that needs to be considered when designing any system for long-term data storage. It is not sufficient to design your system to meet current requirements; you also need to take into account (or at least be aware) how your collection of data is likely to grow in the coming years. A system that is perfectly adequate for storing, manipulating and providing access to a small number of records or data may be quite unable to cope if the amount of data increases by one or two orders of magnitude.

This chapter will look at the problems and issues specific to subject gateways that arise because of such increases in database size and will consider approaches to dealing with these problems.

Background

At present, subject gateways tend to consist of no more than a few thousand records because of the manual effort required to select and catalogue Internet resources. Even a 'large' subject gateway typically has only about six or seven thousand records. This is very small in comparison with traditional online bibliographic databases. Consequently, the problems associated with storing and retrieving large collections of bibliographic data, such as recall and precision in searches and search engine functionality, have not yet been significant.

It seems unlikely that individual subject gateways are capable of growing significantly in size, given current funding models. Only directories that have limited or no quality criteria, high levels of funding or possibly voluntary effort - such as Yahoo!, OCLC's NetFirst or the Open Directory Project - seem to be capable of producing manually-created databases with sizes of the order of hundreds of thousands of records.

The likely method of growth for subject gateways seems instead to be via collaborative effort. There are two approaches to building a collaborative subject gateway. The first is for a number of different organisations to contribute records to a central database. The problems with such an approach are likely to be concerned with the size of the database, maintaining reasonable performance on a single machine and providing network access to it. The second approach is for each organisation to maintain its own database, allowing the end-user to search across one or more of them depending on the nature of their query. In some cases a combination of the two approaches may be appropriate. These methods allow a real or virtual increase in size of the collection of resources presented to the end-user.

Interoperability, Co-operation between gateways

We have also begun to see the creation of harvesting software which enables the automated indexing of Internet resources whilst retaining a degree of quality because of the ability to choose the seeding URIs for the robot. The first phase of the DESIRE project developed some harvesting tools that can be used in conjunction with the ROADS and Zebra software. Such mechanisms have the potential to create databases at least one order of magnitude larger than those of current gateways. This increase in size of the database presented to the end-user and the ability to pass a single search to a number of different databases produce new problems that need to be addressed.

E X A M P L E

Case study - SOSIG Link Harvester Index

The SOSIG Link Harvester Index is an online database separate from the main SOSIG Internet Catalogue. Whereas the resources found in the SOSIG Internet Catalogue have been selected manually by subject experts, those in the SOSIG Link Harvester Index have been collected by software called a harvester (similar mechanisms may be referred to as robots or Web crawlers). The records in the Internet Catalogue provide the list of seeding URLs for the harvester.

Combine

Harvesting, indexing and automated metadata collection

Experiments are also taking place using useful 'lists of lists', not normally added to the catalogue, as seeding URLs.

Note: problems with large subject gateway databases are not limited to the user interface - the SOSIG Link Harvester Index has already had to be limited to 50,000 records because of indexing limitations in the ROADS software.

Scalability Issues

Overview

Part of the scalability problem is concerned with interface and usability issues. These include the presentation of large results sets to the user, the means by which the cross-search paradigm is presented and the ranking or filtering of any results produced. Another part of the problem is concerned with the management of such collections: for example, the need for automated mechanisms for link checking and perhaps for detecting changes to sites that require their descriptions to be updated. Finally there are issues relating to the computer systems used to run the subject gateway service, such as the need for databases that can handle much larger collections of data.

The rest of this chapter therefore consists of three sections; the first will look at user interface and usability issues, the second will consider administration and management issues and the third will consider the systems issues involved in maintaining large collections of records.

User interface and usability issues

With a relatively small database, the issue of precision in searching is not very important, since the user can scroll quickly through a results set to discover which are the most useful records. However, as the size of the database increases, so does the average number of records retrieved, and it then becomes much more difficult to select the most relevant and useful ones. This problem can be approached in two ways:

by increasing the precision of the search so that fewer irrelevant results are returned
by ranking and filtering the results set so that the most relevant results stand out in some manner Mechanisms for increasing precision of searches

Here are some ways in which the precision of searches can be increased:

Allow searching by individual fields, such as title, as a way of increasing the usefulness of the search terms. Fields containing 'extra' information such as geographical area or type of resource will also be helpful for sorting relevant from irrelevant information.
Allow the use of keywords. Keywords may be added to records as a means of describing the main topics dealt with in the resource being catalogued. This generally increases the 'recall' of searches. However, if keywords are combined with fielded searching, so that the keyword field can be specified, the precision of the results can also be improved.
Allow the use of controlled vocabularies. These serve mainly to improve the recall of keyword searches and are usually organised into hierarchical structures, making it easier for the user to find the most relevant and specific term. Keyword searching using controlled vocabularies may cause problems with cross-searching, however, and requires the cross-searched catalogues to use the same vocabularies or to have a cross-mapping scheme drawn up for them.

Subject indexing and classification

Displaying large results sets

Typically, large results sets cannot be displayed on a single Web page. This is because of the time taken to retrieve the data and because of scrolling problems for the end-user. The ROADS software limits the total number of records which can be returned by a search but, as the size of the database increases, the proportion of searches resulting in 'too many hits' will also increase. In addition to reducing the number of hits returned, by increasing the precision of searches, it may also be sensible to investigate mechanisms for improving the way in which records are displayed. These may include:

Limiting the number of records displayed at a time (note that ROADS doesn't currently support this feature). Remember that end-users may still not look through many pages of results even when they are presented in small chunks.
Ranking and/or filtering the results. It may be possible to use metadata both to rank and filter results, for example to display results only for resources that are of undergraduate level or above. Such a technique could also be combined with recommendations (quality ratings) from other people in the end-user's subject area. A detailed discussion of these techniques is beyond the scope of this chapter; however some work in this area is currently under way in the DESIRE II project.

Quality selection: Quality ratings

Browsing larger collections (including cross-browsing)

Most subject gateways provide a browsing interface to their data in addition to a search interface. Many of the issues raised above apply equally to the browse interface. For example, as the number of records in the database grows, the lists of records presented in the browse interface are likely to become too long to be shown on a single Web page.

The browse interface is typically designed (at least in part) around the controlled vocabulary (classification scheme) for keywords described above. As the database increases in size, the number of records per section will also increase unless the granularity of the classification scheme is increased. Therefore, there are some design decisions that need to be taken concerning the depth and complexity of the classification scheme used.

Subject indexing and classification, User interface implementation

It is worth noting that a combination of browse and search interfaces may help the end-user. This may be achieved by embedding a restricted search interface into each sub-section of the browse interface, returning results that are only applicable to that sub-section.

Administration and Management Issues

As the number of records in a subject gateway database increases, the techniques used to manage it may need to change. Manual checking of records is likely to be feasible for a small database, but who wants to check 7,000 records by hand? What about 50,000 records?!

Some areas where automated checking of records may be possible are:

Link checking. The ROADS software provides an automated link checker which will confirm the validity of the URLs in all the records in a subject gateway's database on a regular basis.
Resource updates. There is a danger that the descriptions of resources held in subject gateways will become out of date as the resources themselves are updated. It may be possible to develop robot-based tools that check for potentially 'significant' changes to the resources described in a subject gateway's database, automatically warning resource cataloguers of the records that are likely to need updating.
Review-by dates. By embedding a 'review-by' date into every resource description you can be notified automatically that a record hasn't been checked recently. Note that ROADS supports this feature out of the box.

Collection management

Systems Issues

It is clear that as a database grows the amount of disk space it requires will also grow. Memory and CPU power requirements will probably also increase. It is possible that database software that copes with 10,000 records may not cope efficiently with 100,000 records. For example, there is some evidence that the file system based database software supplied with ROADS by default does not cope well with databases larger than about 50,000 records. In theory, ROADS allows you to plug in alternative back-end databases. However, it is not clear how many services are actively using this feature.

There may also be performance problems associated with cross-searching large numbers of large databases. The searching system has to wait for results to come back from all the databases that it is searching. This may tie up network and other resources on that system. Research is currently being done within the DESIRE project into the areas of parallel searching and results interfaces which return results to the user as and when they become available. Findings in this area will be published on the DESIRE Web site.

Glossary
	DESIRE - Project funded under the Europena Union's Telematics for research Programme to enhance and facilitate Web usage among researchers in Europe (producer of this handbook) OCLC - Online Computer Library Centre Inc. ROADS - ROADS is a set of software tools to enable the set up and maintenance of Web based subject gateways. SOSIG - The Social Science Information Gateway

References

Combine, http://www.lub.lu.se/combine

DESIRE, http://www.desire.org/results/training/D8-2af.html

OCLC, http://www.oclc.org/

Open Directory Project, http://dmoz.org/

SOSIG Harvester, http://www.sosig.ac.uk/roads/cgi/search.pl?form=harvester

Yahoo!, http://www.yahoo.com/

Credits
	Chapter author: Phil Cross, Andy Powell

<< P R E V I O U S	1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8	N E X T >>
	Go to the table of contents


Return to: Handbook Home DESIRE Home	Search \| Full Glossary \| All References Last updated : 20 April 00	Contact Us © 1999-2000 DESIRE