DESIRE Information Gateways Handbook
HomeTable of contentsAuthors-
Search | Help   
-2.6. Collection management

In this chapter...
 
  • the importance of keeping collections up to date
  • methods for maintaining collections
  • what do those error codes really mean?
  • a link checking case study: SOSIG
  • creating a collection management policy
  • priorities for administrators
Introduction
 

This chapter will look at some of the day-to-day administrative tasks required for running and maintaining an information gateway and the staff effort required for these tasks.

Whilst setting up and configuring a database for a gateway is labour intensive, it is a one-off task. The longer-term and time-consuming work is involved in creating and maintaining the collection: notably, in keeping the records up to date and error free. An out-of-date collection of resource descriptions is little use to anyone and may even be potentially harmful to users. It is important that sufficient staff effort is allocated for regular housekeeping duties, the main ones being:

  • checking that resources are still available and links within records are still correct
  • making sure that descriptions of resources are up to date and still adequately reflect the content of the resources themselves

The Internet is a volatile and fast changing environment; resources and information that are available today may not be available tomorrow. It has been estimated that at any one time between 5 and 8% of the Web's content is unavailable (Pitkow, 1998). There may be a number of reasons for resources not being available, ranging from networks being out of action, servers being out of order, or information being updated, to the resource's being removed permanently from the network. Whatever the reason, resources that are not available should be removed from your collection (if only on a temporary basis while the problem is solved).

Similarly, Internet resources do not tend to be static; they grow and change on a regular basis. Unless resource descriptions are checked on a routine basis, you may find that the records bear no resemblance to the resource itself, which may have changed or expanded beyond recognition within a few months or weeks.


Maintaining collections
 

There are various tasks involved in making sure that an information gateway's collection maintains its integrity:

  • validating records (spell checking, etc.) to ensure that the record is accurate
  • link checking records to ensure that resources are still physically available
  • updating resource descriptions to ensure that the record still adequately reflects content of the resource or Web site

Validating records

A basic housekeeping duty is to ensure that catalogue records are as accurate as possible, not only in terms of the factual information they provide about a resource, but also in terms of the content of the record itself, e.g. making sure they do not contain spelling mistakes, that cataloguing guidelines are adhered to, etc. There are various internal procedures which can help gateways maintain accuracy within their records. These include:

  1. Spell checking records. This can be done manually; some gateways employ staff to check and edit records before they are added to a live database. A less time-consuming way would be to use an automatic spell checker; however, there can be problems with spell checkers understanding discipline-specific or technical terms.
  2. Cutting and pasting URLs and other pieces of factual information to avoid the possibility of typing errors.
  3. Authority files. The use of lists of controlled terms and vocabulary can help enormously to cut down spelling mistakes and ensure consistency within the records.

For further information on ensuring accuracy and consistency within the collection see the chapter on cataloguing.

Cross reference
Cataloguing

Link checking

Much of the information available over the Web is intentionally ephemeral in nature, designed only to be useful in the short term (e.g. TV listings, news bulletins, price lists). The average life span of a Web document is estimated at around 50 days, with HTML files being modified or deleted more frequently than images or other media (Pitkow, 1998). Gateways generally try to ensure that the resources they catalogue will have a degree of longevity and often include URL stability as one of their selection criteria. However, the inconstant nature of the Web means that it is still necessary to check resources regularly and update the records of those that have moved, are temporarily unavailable, or have been permanently deleted from the Internet. It is important to have collected contact information about the administrators or maintainers of the sites on which the resources reside. When a resource is unavailable, sending an email message to the administrator is often the quickest way to find out what the problem really is and whether or not it is temporary or permanent.

Automatic link checking software is available to help gateways keep a check on the resources described within their catalogues. The programs generally work by checking each of the URLs (often by requesting the 'HEAD' files of the pages) and compiling a report of any errors they find. The software can normally be scheduled to run at regular intervals (ideally at least once a week) and can be set to run at 'quiet' times, e.g. overnight, to reduce the load on the network. Once the error report has been generated, it usually then requires human effort to go through the report and decide which of the resources should be edited or removed from the catalogue. Working through an error report is much like detective work; you need to use patience, information finding skills and knowledge of the Internet to track down the problems and put them right.

As well as commercial software packages there are a number of link checking programs available in the public domain (freely available) or as shareware packages (for a small fee).

For a listing of some link checking shareware programs available see:

What do those error codes really mean?

You will sometimes see error codes when you are attempting to connect to Web pages or looking at the output of link checking reports. These are HTTP status codes and whilst they appear to be frustratingly cryptic they can tell you a lot about the type of problem that you are encountering.

404 - Page Not Found

This is the most common error code that gateway administrators will come across. Web site maintainers often change the structure of their sites, as the information they provide grows or as the maintainers get new ideas about how to arrange and present the information. One of the most common reasons for a 404 error is simply that the resource has been moved to a different part of the site. To find the new location you can often systematically move up the directory structure of the URL deleting the text before each trailing slash (/) until you find a link to the resource. Sometimes the resource may have moved to another Web site altogether (this often happens when the resource is located on a commercial site); it is worth doing a search on one of the big search engines (such as Alta Vista) to try to locate its new address. In the worst case, the resource has been deleted permanently and the record should be removed from the collection. If you cannot locate the resource simply by looking around the site, an email message to the administrator will often solve the mystery.

Some of the other frequent error codes are:

Error Code

Problem

Possible Reason and Action

401

Unauthorised Request Access

The resource may be protected by a username and password - contact the maintainers for more information.

402

Payment Required

The request requires a charge to be applied to the transaction.

403

Forbidden

Access to the directory is forbidden. The resource may no longer be available for public access or the Web site administrator may have changed the directory permissions by mistake!

500

Internal Error

These types of error messages are very frustrating, as it is often hard to pin down what the problem is. It may be a problem caused by attempted execution of a CGI script. The best course of action is to monitor it as a problem and email the maintainer of the site for more information about the nature of the problem and to find out whether it is temporary.

501

Not Implemented

The server does not support the method being requested.

503

Server Busy

The server is unable to process the request for the page because of the high number of other requests. These tend to be temporary errors; try again at another time.

A link checking case study: SOSIG

SOSIG uses the link checking software that is supplied as part of the ROADS system. The program is scheduled to run automatically just after midnight on Sunday when the network traffic is generally low. The program runs through each of the URLs in the SOSIG database (over 7,000) and for each it requests the HEAD file from the page. If the request is successful the software moves on to the next URL; if it encounters a problem it writes the URL and the unique ID number for the record into a file. Once the link checker has processed all of the URLs, the problem resources are sorted and presented according to the error codes discussed in the section above. The error report is made available through the SOSIG online administration centre (see Figure 1); additionally a copy is emailed to the SOSIG staff responsible for processing the report.

Sosig link checker report

Figure 1 SOSIG Link Checking Summary Report

SOSIG currently has one member of staff assigned to link checking, who spends approximately one day a week going through the report and updating or deleting records as appropriate. As the number of records in the collection grows, so does the number of problem resources, and it is likely that the amount of time required to maintain the collection will increase over time.

The errors reported are given an order of priority and the '404 Page Not Found' problems are dealt with first of all. These are probably the most straightforward of the errors; either the resource has moved and the record has to be edited to have the new address or it is no longer available and it needs to be deleted from the database. Either way, having error pages appear when users try to connect to resources is likely to reduce their confidence in the collection.

The next errors dealt with would be any errors to do with authorisation (error 401), payment (error 402) or permissions (error 403). These errors are not as common as the 404 errors and they tend to appear when a resource that had previously been publicly available is now restricted to use within an organisation or community and some form of payment or authorisation is required. These problems may become more common as the Web matures and commercial practices become more established. Occasionally the problem is simply that the Web site administrator has inadvertently changed the permissions on the directory and is unaware that there is a problem. SOSIG has found that the best way to deal with these problems is to get in touch with the maintainers of the resource by email and ask what the situation is; generally replies return within a day and the record can be dealt with appropriately.

The final errors that are dealt with are the 500 errors, generated by the server from which you are requesting the resource. They tend to be more unpredictable and it is usually quite difficult to pinpoint the problem; often URLs listed as giving 500 errors are working perfectly well when checked again. The reason for this may be because that the server was undergoing maintenance or updating when the link checker requested the URL. SOSIG tends to monitor 500 errors over a few weeks and an email message will be sent to the maintainers of those resources that persistently record an error. The ROADS link checking software does have a feature which allows you to automatically delete URLs that are consistently unavailable, but this is not used as it is felt that the 500 errors are too unpredictable and staff prefer to make a judgement on each resource.

For more details of the link checking software and the ROADS software in general see:

Updating resource descriptions

The dynamic nature of the Web is a problem when it comes keeping manually catalogued records of resources up to date and relevant. Web documents, unlike their printed equivalents, are very easy to edit and modify; studies have shown that most Web pages are not static but expand and evolve over time. For a gateway's collection to maintain its integrity and usefulness, the records must also reflect the changes in the resources. This is a time-consuming job that requires ongoing staff effort to be assigned to the task.

There are a number of steps which gateways can take to help to identify and review resources that need their descriptions to be updated:

  1. Making full use of administrative metadata such as review-by dates. When records are created, a date can be added by which this record should be reviewed. A simple script can pull out all of the records that require reviewing at any particular time.
  2. Using automated processes to email resource maintainers to ask whether there have been any changes to the resource since the record was created.
  3. Using automated processes to delete time-dependent resources, e.g. conference announcements.
  4. Using Web page tracking tools (such as Mind-it http://mindit.netmind.com/) to monitor changes in resources (these generally report changes when the size of the file is altered).
  5. Taking the opportunity to update descriptions of records that are being edited as a result of running a link checker.

Creating a collection management policy
 

The Web has often been described as a 'moving target'; it is constantly changing and expanding and trying to catalogue its content is a difficult business. Gateways need to think about what they are trying to provide for their users: a catalogue of the entire Web or a focused collection of selected material? A previous chapter on quality selection criteria has dealt with the need for gateways to consider formalising a Scope Policy to help clarify the type of service they are offering. It will also be helpful to think about a policy for managing collections. A collection management policy will allow you to formalise not only the scope and selection criteria for a gateway but also deselection criteria, that is the principles under which you may choose to edit or delete records from the collection. A collection management policy might include:

Guidelines for deselecting a resource:

  • if the resource is no longer available
  • if the currency or reliability of the resource has lessened
  • if another Internet site or resource offers more comprehensive coverage

Guidelines for editing a record:

  • if the information content of the resource has changed so that the resource description and keywords need to be updated
  • if any of the factual details of the resource have changed (e.g. new admin email, new short title)
  • to correct any errors made in the original record

Collection management policies may change over time to reflect the changing nature and content of the Web. As more resources become available it may be necessary to delete entries from the collection, replacing them with more suitable material.

For examples of gateway collection management policies see:


Priorities for administrators
 

When one is faced with limited time and resources, there will always be a conflict between building up the gateway collection and adequately maintaining the existing collection. In order to continue to offer useful services, gateway administrators need to ensure that they balance effort spent in creating new records with preserving the integrity of the current collection. It is advised that gateways make as much use as possible of automated tools to monitor and track changes in resources, so that any human effort is directed at the more intellectual tasks of revising and correcting records.


Glossary
 

ADAM Art, Design, Architecture and Media gateway (UK)
authority file cataloguing tool that offers the cataloguer a set list of options from which they must choose to fill a particular field - ensures consistency of entry within catalogue fields
ROADS Resource Organisation And Discovery in Subject based services. eLib funded project developing software for use by Internet subject services.


References
 

Mind-it by NetMind, http://mindit.netmind.com/

ROADS, http://www.roads.lut.ac.uk/

SOSIG, http://www.sosig.ac.uk/

W. Koehler, 'Digital Libraries and World Wide Web Sites and Page Persistence', Information Research Volume 4 No. 4 (June 1999).

J. E. Pitkow, 'Summary of WWW Characterizations', in Proceedings of the Seventh International World Wide Web Conference, 14-18 April 1998, Brisbane, Australia (Elsevier Science B.V., 1998).

Credits
 

Chapter author: Debra Hiom

With contributions from: Phil Cross and Emma Place


<< P R E V I O U S 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 N E X T >>
  Go to the table of contents  

Return to:
Handbook Home
DESIRE Home
Search | Full Glossary | All References

Last updated : 20 April 00
Contact Us
© 1999-2000 DESIRE