There are various tasks involved in making sure that an information gateway's collection maintains its integrity:
- validating records (spell checking, etc.) to ensure that the record is accurate
- link checking records to ensure that resources are still physically available
- updating resource descriptions to ensure that the record still adequately reflects content of the resource or Web site
Validating records
A basic housekeeping duty is to ensure that catalogue records are as accurate as possible, not only in terms of the factual information they provide about a resource, but also in terms of the content of the record itself, e.g. making sure they do not contain spelling mistakes, that cataloguing guidelines are adhered to, etc. There are various internal procedures which can help gateways maintain accuracy within their records. These include:
- Spell checking records. This can be done manually; some gateways employ staff to check and edit records before they are added to a live database. A less time-consuming way would be to use an automatic spell checker; however, there can be problems with spell checkers understanding discipline-specific or technical terms.
- Cutting and pasting URLs and other pieces of factual information to avoid the possibility of typing errors.
- Authority files. The use of lists of controlled terms and vocabulary can help enormously to cut down spelling mistakes and ensure consistency within the records.
For further information on ensuring accuracy and consistency within the collection see the chapter on cataloguing.
Cataloguing
Link checking
Much of the information available over the Web is intentionally ephemeral in nature, designed only to be useful in the short term (e.g. TV listings, news bulletins, price lists). The average life span of a Web document is estimated at around 50 days, with HTML files being modified or deleted more frequently than images or other media (Pitkow, 1998). Gateways generally try to ensure that the resources they catalogue will have a degree of longevity and often include URL stability as one of their selection criteria. However, the inconstant nature of the Web means that it is still necessary to check resources regularly and update the records of those that have moved, are temporarily unavailable, or have been permanently deleted from the Internet. It is important to have collected contact information about the administrators or maintainers of the sites on which the resources reside. When a resource is unavailable, sending an email message to the administrator is often the quickest way to find out what the problem really is and whether or not it is temporary or permanent.
Automatic link checking software is available to help gateways keep a check on the resources described within their catalogues. The programs generally work by checking each of the URLs (often by requesting the 'HEAD' files of the pages) and compiling a report of any errors they find. The software can normally be scheduled to run at regular intervals (ideally at least once a week) and can be set to run at 'quiet' times, e.g. overnight, to reduce the load on the network. Once the error report has been generated, it usually then requires human effort to go through the report and decide which of the resources should be edited or removed from the catalogue. Working through an error report is much like detective work; you need to use patience, information finding skills and knowledge of the Internet to track down the problems and put them right.
As well as commercial software packages there are a number of link checking programs available in the public domain (freely available) or as shareware packages (for a small fee).
For a listing of some link checking shareware programs available see:
What do those error codes really mean?
You will sometimes see error codes when you are attempting to connect to Web pages or looking at the output of link checking reports. These are HTTP status codes and whilst they appear to be frustratingly cryptic they can tell you a lot about the type of problem that you are encountering.
404 - Page Not Found
This is the most common error code that gateway administrators will come across. Web site maintainers often change the structure of their sites, as the information they provide grows or as the maintainers get new ideas about how to arrange and present the information. One of the most common reasons for a 404 error is simply that the resource has been moved to a different part of the site. To find the new location you can often systematically move up the directory structure of the URL deleting the text before each trailing slash (/) until you find a link to the resource. Sometimes the resource may have moved to another Web site altogether (this often happens when the resource is located on a commercial site); it is worth doing a search on one of the big search engines (such as Alta Vista) to try to locate its new address. In the worst case, the resource has been deleted permanently and the record should be removed from the collection. If you cannot locate the resource simply by looking around the site, an email message to the administrator will often solve the mystery.
Some of the other frequent error codes are:
Error Code |
Problem |
Possible Reason and Action |
401 |
Unauthorised Request Access |
The resource may be protected by a username and password - contact the maintainers for more information. |
402 |
Payment Required |
The request requires a charge to be applied to the transaction. |
403 |
Forbidden |
Access to the directory is forbidden. The resource may no longer be available for public access or the Web site administrator may have changed the directory permissions by mistake! |
500 |
Internal Error |
These types of error messages are very frustrating, as it is often hard to pin down what the problem is. It may be a problem caused by attempted execution of a CGI script. The best course of action is to monitor it as a problem and email the maintainer of the site for more information about the nature of the problem and to find out whether it is temporary. |
501 |
Not Implemented |
The server does not support the method being requested. |
503 |
Server Busy |
The server is unable to process the request for the page because of the high number of other requests. These tend to be temporary errors; try again at another time. |
A link checking case study: SOSIG
SOSIG uses the link checking software that is supplied as part of the ROADS system. The program is scheduled to run automatically just after midnight on Sunday when the network traffic is generally low. The program runs through each of the URLs in the SOSIG database (over 7,000) and for each it requests the HEAD file from the page. If the request is successful the software moves on to the next URL; if it encounters a problem it writes the URL and the unique ID number for the record into a file. Once the link checker has processed all of the URLs, the problem resources are sorted and presented according to the error codes discussed in the section above. The error report is made available through the SOSIG online administration centre (see Figure 1); additionally a copy is emailed to the SOSIG staff responsible for processing the report.
Figure 1 SOSIG Link Checking Summary Report
SOSIG currently has one member of staff assigned to link checking, who spends approximately one day a week going through the report and updating or deleting records as appropriate. As the number of records in the collection grows, so does the number of problem resources, and it is likely that the amount of time required to maintain the collection will increase over time.
The errors reported are given an order of priority and the '404 Page Not Found' problems are dealt with first of all. These are probably the most straightforward of the errors; either the resource has moved and the record has to be edited to have the new address or it is no longer available and it needs to be deleted from the database. Either way, having error pages appear when users try to connect to resources is likely to reduce their confidence in the collection.
The next errors dealt with would be any errors to do with authorisation (error 401), payment (error 402) or permissions (error 403). These errors are not as common as the 404 errors and they tend to appear when a resource that had previously been publicly available is now restricted to use within an organisation or community and some form of payment or authorisation is required. These problems may become more common as the Web matures and commercial practices become more established. Occasionally the problem is simply that the Web site administrator has inadvertently changed the permissions on the directory and is unaware that there is a problem. SOSIG has found that the best way to deal with these problems is to get in touch with the maintainers of the resource by email and ask what the situation is; generally replies return within a day and the record can be dealt with appropriately.
The final errors that are dealt with are the 500 errors, generated by the server from which you are requesting the resource. They tend to be more unpredictable and it is usually quite difficult to pinpoint the problem; often URLs listed as giving 500 errors are working perfectly well when checked again. The reason for this may be because that the server was undergoing maintenance or updating when the link checker requested the URL. SOSIG tends to monitor 500 errors over a few weeks and an email message will be sent to the maintainers of those resources that persistently record an error. The ROADS link checking software does have a feature which allows you to automatically delete URLs that are consistently unavailable, but this is not used as it is felt that the 500 errors are too unpredictable and staff prefer to make a judgement on each resource.
For more details of the link checking software and the ROADS software in general see:
Updating resource descriptions
The dynamic nature of the Web is a problem when it comes keeping manually catalogued records of resources up to date and relevant. Web documents, unlike their printed equivalents, are very easy to edit and modify; studies have shown that most Web pages are not static but expand and evolve over time. For a gateway's collection to maintain its integrity and usefulness, the records must also reflect the changes in the resources. This is a time-consuming job that requires ongoing staff effort to be assigned to the task.
There are a number of steps which gateways can take to help to identify and review resources that need their descriptions to be updated:
- Making full use of administrative metadata such as review-by dates. When records are created, a date can be added by which this record should be reviewed. A simple script can pull out all of the records that require reviewing at any particular time.
- Using automated processes to email resource maintainers to ask whether there have been any changes to the resource since the record was created.
- Using automated processes to delete time-dependent resources, e.g. conference announcements.
- Using Web page tracking tools (such as Mind-it http://mindit.netmind.com/) to monitor changes in resources (these generally report changes when the size of the file is altered).
- Taking the opportunity to update descriptions of records that are being edited as a result of running a link checker.
|