2.0 Information Issues (Print Version) - DESIRE Information Gateways Handbook

J. Alexander & M. A. Tate, Evaluating Web Resources,
http://www2.widener.edu/Wolfgram-Memorial-Library/webeval.htm

C. Armstrong, 'Metadata, PICS and Quality', Ariadne Issue 9. 1997
http://www.ariadne.ac.uk/issue9/pics/

N. Auer, Bibliography on Evaluating Internet Resources
http://www.lib.vt.edu/research/libinst/evalbiblio.html

D. Brickley, T. Gardner, R. Heery & D. Hiom, Recommendations on Implementation of Quality Ratings in an RDF Environment.
http://www.desire.org/html/research/deliverables/D3.2/

A. Cooke, Finding Quality on the Internet: a guide for librarians and information professionals,
(London: Library Association Publishing, 1999. ISBN: 1-85604-267-7).

Credits
	Chapter author: Emma Place With contributions from: Michael Day, Debra Hiom, Ann-Sofie Zettergren

2.2. Resource discovery

In this chapter...

the resource discovery process - ensuring new Internet resources are found to add to your gateway
systems for gateway managers - to support efficient resource discovery within your team
strategies for gateway staff - to continuously locate high quality resources on the Internet
case studies - resource discovery tips and hints from existing gateways
new and mature gateways - different resource discovery issues for different gateways

Introduction

Subject gateways should aim to describe the best resources that the Internet has to offer in their field and for their target audience. They need to:

point to the highest quality networked resources currently available
point to new networked resources as they appear

Finding high quality resources on the Internet can be a time-consuming job - which of course, is exactly why gateways exist - to save the end-user some of the time and commitment required to discover and retrieve high quality information on the Internet.

Locating resources to add to your gateway will require one of the biggest investments of staff time and effort, and so it is important to find efficient and effective methods of working at this task:

gateway managers need to ensure that systems to support resource discovery are in place
individual gateway staff need to develop their own strategies for locating as many high quality resources as efficiently as possible

Resource discovery issues for gateway managers

Gateway managers will need to provide the systems and strategies to support efficient resource discovery within their team.

Resource discovery is labour-intensive and efficient strategies can help to maximise the number of resources added to the gateway. This section suggests some of the systems that managers can put in place to support efficient resource discovery within the team:

Avoid duplicated effort.
Find the right people for the job.
Provide training in resource discovery.
Set up support systems for resource discovery staff.
Set up systems to encourage your user community to suggest resources.

1. Avoiding duplicated effort

Duplicated effort can be wasted effort. There are issues of duplication:

between gateways
within the team

Avoid duplication with other gateways

It is worth finding out whether other gateways already describe Internet resources in your field. If there are other gateways you have to ask yourself whether it really makes sense to spend time and effort cataloguing the same resources twice. If existing gateways are already describing resources relevant to your users you should consider:

collaboration with other gateways (to avoid cataloguing the same resources twice)
cross-searching your gateway with other gateways so that your users can search more than one simultaneously
sharing metadata records

Subject indexing and classification, Distributed cataloguing

Avoid duplication within your team

Time can be wasted if members of your team are all trawling the same sources. Consider developing a team strategy for resource discovery. For example by:

giving people different subject responsibilities - so they are each hunting for resources in a different discipline
giving people different monitoring responsibilities - so they are each monitoring different sources (email lists/URLs/current awareness services etc.)

E X A M P L E

Example of a team dividing resource discovery responsibilities

SOSIG has divided responsibilities among the team of core staff and section editors as follows:

Section Editors: each have responsibility for a particular SUBJECT area
Central staff: have responsibility for trawling generic sources and for monitoring suggestions of sites sent in by users

See: http://www.sosig.ac.uk/contact.html

2. Find the right people for the job

It will be financial and political considerations which determine whom you can take on to do the job of resource discovery, as with recruiting staff for cataloguing.

Volunteers?

Pros: may be cheap and plentiful

Cons: may be inconsistent and unreliable in their contribution and it may be difficult to find volunteers with the subject expertise to select the high quality resources you want

Subject specialists?

Pros: may know of the best sources to use to discover relevant resources for your gateway and should be able to assess resources effectively, given their subject knowledge.

Cons: may be expensive, short of time, difficult to recruit and unable or unwilling to spend time cataloguing

Librarians/information professionals?

Pros: have training in selecting resources to meet the information needs of users and also may be able to catalogue resources in addition to selecting them, since they may have training in cataloguing/information retrieval issues.

Cons: may be expensive/difficult to recruit

R E M E M B E R

Internet skills can be taught more easily than subject expertise!
Librarians may be more willing and able to catalogue resources than to discover them

3. Provide training in resource discovery

The Internet is always growing and changing, so there are always new tips and hints to be learned in Internet resource discovery - training staff can improve skills and effectiveness. Training may include:

offering lists of sources for staff to use
offering demonstrations and hands-on work with different resource discovery tools
brainstorming ideas within the team to share resource discovery strategies

4. Set up support systems for resource discovery staff

The following are ideas for support systems for resource discovery staff:

create Web documents that list resource discovery strategies appropriate to your gateway
set up a mailing list for resource discovery staff so that the team can share knowledge of any useful new sources or techniques they find - and so they can talk about issues that arise
set up meetings for resource discovery staff to share stories of successful and unsuccessful strategies which they have found.

E X A M P L E

Example of a support system for gateway staff

SOSIG has created a Web page for section editors, which lists possible resource strategies: 'Finding Internet resources for SOSIG: strategies and sources'
A mailing list has been set up for section editors to share news of any new, effective strategies they discover.
Twice a year the section editors come together and compare experiences of the most effective and the most ineffective (!) resource discovery strategies.

5. Set up systems to encourage your user community to suggest resources

Why not let the resources come to you! Encourage your users to send you details of any sites which they think should be added to the gateway. You will need:

to publicise an email address or Web form for submissions
to publicise your scope and selection criteria

Quality selection

Web forms are great because they encourage users to generate the appropriate metadata - and they may have good ideas about keywords and descriptions
make sure your selection criteria are freely available, to try to discourage inappropriate resources from being submitted and to make it clear that not all submissions will be accepted
a quick thank-you message to users is good PR and can encourage them to submit again. If you are getting a lot of submissions - create a standard courtesy reply
publicise the fact that you welcome submissions from your user community. If you run an email list associated with your gateway, (***CROSS REFERENCE publicity and promotion) you can send out occasional reminders to subscribers

E X A M P L E

Examples of Web forms for users to submit resources

Resource Discovery Strategies for Staff

Gateway staff do the 'leg work' for SOSIG users - joining the lists, monitoring the sites and doing the searches that many users do not have the time to do, filtering out items that are of poor quality or irrelevant to the users.

It's easy to waste time when surfing the Internet - gateway staff need to develop efficient and effective strategies for locating high quality Internet resources. Some strategies are suggested below.

Resource discovery tools and methods

Browsing strategies
Mailing lists and their archives
Distribution lists and current awareness services
Search tools
Newsgroups and discussion forums
URL-minders and Web agents
Non-Internet sources

1. Browsing strategies

One of the richest sources of resources will be existing Web pages - especially authoritative ones in your field which list related or recommended resources. Trawling these sites is the equivalent of citation pearl-growing or snowballing, traditionally done by researchers looking for references - if they find one useful resource, they will follow the references from that resource to find others.

Trawling home pages of known experts

If you know of experts in your field, do a search to see if they have their own Web page. You may find that:

They have published their work on the Web.
They have collected a list of links (and, given their knowledge and expertise, they will be worth checking out!)

Bookmark any that look as if they may be developed over time, so that you can check them again in the future.

Trawling organisational home pages

Many organisations now have their own Web sites. These can be useful in two ways:

They may include primary resources for you to catalogue.
They may have lists of links selected by people with subject knowledge which you could trawl.

Consider which organisations are relevant to your audience and try to keep in touch with developments concerning them.

Take time to do a search for the most relevant organisational sites for you and organise them in a bookmark folder, so you can take a look at them periodically. Only bookmark the best - you won't have time to trawl too many.

If you are creating a gateway for an academic audience then it can pay to monitor university Web pages. Look for:

library Web sites - as many librarians are now building collections of Internet links
academic departments' Web sites - where lecturers and researchers may publish their work or may create lists of links

E X A M P L E

Examples of some starting points useful for academic gateways:

College and University Home Pages (world-wide) - alphabetical listing
EUNI - List of European Universities
Library and Related Sources (includes a list of libraries worldwide)

Trawling subject-based sites

Many sites have a section of 'links' which can be mined for new resources. The better quality the original site, the better the related links are likely to be:

find the most important sites in your field and look at all the links they recommend
look for 'What's New' or 'Latest News' features on trusted sites
bookmark these link pages or 'What's New' pages to check regularly, or consider putting the URLs into a Web Agent or URL-minder (see below) so that they can let you know when anything new is added

E X A M P L E

Examples of the types of pages that could be bookmarked or monitored by a minder/agent:

'What's New' on Europe - the Web server of the European Union
NewJour: Recent Issues

2. Mailing lists and their archives

Joining and monitoring email lists/checking mailing list archives

People often use email lists to announce new resources they have made available on the Internet.

You have two possible strategies here:

Joining the lists and reading messages via your email
Bookmarking the Web archives of the lists (if they have them) and making periodic checks on them

Don't join so many lists that your own email becomes unmanageable. If you can, filter your email so that messages from lists don't get mixed up with all your other mail. For very busy email lists it is probably more time-effective to make a regular scan of the archives. Set up a bookmark file for 'Archives to Check Regularly'

Subject-based lists

If you can find a list that is relevant to your subject area and audience, you have a rich source. In the early days it's worth doing a search for relevant lists and asking colleagues to recommend them.

E X A M P L E

Examples of sites which can help you to find mailing lists

Liszt - Directory of email groups and discussion lists
A directory of email groups and discussion lists, including listserv, listproc, majordomo and Mailbase lists. Also offers a directory of newsgroups. The search facility makes this a quick way of finding lists on a particular subject.
Mailbase - The UK's major electronic mailing list service
The Directory of Scholarly and Professional E-Conferences - A directory designed to list the Internet communication groups and services likely to be of interest to academics and professionals.

Generic email lists that announce new Internet sites

A number of email lists exist to alert people to new Internet sites. Be warned - these lists can be prolific!

3. Distribution lists and current awareness services

Internet current awareness services come in different forms and are becoming more sophisticated. Free email subscription services will send you updates, bulletins and email publications on a regular basis. It may be worth subscribing to services that are run by key individuals or organisations in your subject area. Other services are emerging where you can create your own personal profile on the Web, which the service then uses to email you incoming information that is likely to interest you.

E X A M P L E

Examples of current awareness services

What's New in WWW Social Sciences Online Newsletter - users can subscribe to receive emails listing of new or improved WWW sites
Internet Resources Newsletter - A free, monthly, non-subscription newsletter for academics, students, engineers, scientists & social scientists. ISSN: 1361-9381

4. Search tools

Searching the Internet can be time-consuming, since many of the search tools retrieve huge numbers of hits which take a lot of time to work through. However, searching can be a good strategy in some cases:

targeted searching, i.e. looking for a specific resource
building up a specific section of your collection

In our experience, search engines can be a waste of time if broad search terms such as 'social psychology' are used. Highly focused searching based on known sources, however, can be fruitful. For example, if you have a list of well-respected journals or organisations in your field, you could search for them by name, to see whether they have a presence on the Internet. A number of hints for finding the leads for focused searching are recommended:

use other sources, e.g. directories, to find things to search for
use a subject-specific site to get lists of dates/organisations/names to search on
search for Internet equivalents of printed materials, e.g. scholarly journals or academic publishers
search for specific dates or people
search for important organisations to see if they are publishing anything of value on the Internet
use leads from your knowledge of the field

Search Engines

These are good for finding LOTS of information and for finding very precise pieces of information (so if you know exactly what you're after they can be very effective).

Get to know how to use one search engine very well, rather than lots of them very badly. Take time to read the Help pages for the search engine and learn how to use the Advanced Search options.

Be aware that search engines change over time and that different ones are more effective for searching for different types of information - do some research to find the best one for your needs.

Bookmark complex searches so that you can run them again periodically to see if anything new has appeared.

E X A M P L E

Examples of ways to find out about Internet search tools

Search Tools - a list constructed by Manchester Metropolitan University's Department of Information and Communications
Search Engine Corner (a regular column in Ariadne)
Search Engine Watch

5. Newsgroups and discussion forums

Internet discussion forums are a powerful and fun way to communicate with people around the world who are interested in the same things as you. Thanks to the Internet's rapid growth and the exploding popularity of the World Wide Web, people from all walks of life now participate on a regular basis.

E X A M P L E

Example of a source for Newsgroups

DejaNews offers access to tens of thousands of Usenet groups and discussion forums. It can help you to find those forums relevant to your user groups, but it may also be worth following a few yourself to see if any other Internet resources are talked about that would be appropriate for your gateway.

6. URL-minders and Web agents

Some free Web services exist that help you to monitor changes made to Internet resources or to inform you of new sites that might interest you. You register the URLs of the sites you wish to monitor or search queries you would like to have done and the service sends you an email whenever a change is made to these resources or the search yields new results.

E X A M P L E

Examples of URL-minders and Web agents

Remember that these are automated services and will not always yield high quality results.

Remember that the more URLs you register, the more email you will get - so don't set up more than you can cope with! If you can, set up email filters to separate these messages from the rest of your mail.

7. Non-Internet sources

You don't have to use the Internet to learn about Internet sites. Consider using non-Internet sources:

talk to people - your users/experts in your field/Internet enthusiasts and get their recommended sites
look at the bookmarks of these people if they publish them on the Web - if not, then ask them to let you get access to them another way
scan printed publications e.g. specialist journals, newspapers, newsletters, magazines
watch out for URLs - which are increasingly appearing everywhere from billboards to TV to the side of cornflake packets!

R E M E M B E R

It's chaos out there so don't expect resource discovery to be without its problems:

expect information overload and develop systems to manage it effectively
let serendipity play a role
be open to adopting new strategies and changing your old ways - the Internet is always changing
be open minded - take the Alexander Fleming attitude - there may be millions of petri dishes containing nothing more than a load of jelly, but keep your wits about you - what looks like a mould may turn out to be penicillin!

Issues for new gateways

New gateways may have different priorities for resource discovery from mature gateways as they will be focussing on developing a core collection very quickly. New gateways may want to consider the following issues:

target efforts to make sure that you include the most important resources first
balance the collection to ensure you have at least a few resources for all the subject areas you cover
divide responsibilities among your team
don't duplicate other gateways
be absolutely clear of your scope and selection criteria before you start the resource discovery process

Issues for mature gateways

Mature gateways will have already developed a core collection and may have widened their scope. Staff will need to adjust their resource discovery strategies in line with this. Mature gateways may consider the following issues:

collection management - you need to ensure that all the different subject areas within your collection are growing at the same rate - target efforts at areas that are falling behind and require development.
ensure that all areas of the collection are comparable in quality
focus on strategies for finding new resources AS THEY APPEAR
build your community - to encourage more submissions from users and information providers

Quality selection; Changing your selection criteria over time

Glossary
	DutchESS Dutch Electronic Subject Service EEVL Edinburgh Engineering Virtual Library EUNI List of European Universities, provided by Adminet in France SOSIG Social Science Information Gateway URL-minder a service based in California, USA, twhich enables you to track changes made to Web sites and URLS

References

College and University Home Pages (world-wide), http://www.rirr.cnuce.cnr.it/universities/univ.html

Dejanews, http://www.dejanews.com/

The Directory of Scholarly and Professional E-Conferences, http://www.n2h2.com/KOVACS/

DutchESS, http://www.konbib.nl/dutchess/

EEVL, http://www.eevl.ac.uk/

EUNI - List of European Universities, http://www.ensmp.fr/~scherer/euni/euni_list.html

The Informant, http://informant.dartmouth.edu/

Library and Related Sources, http://www.exeter.ac.uk/~ijtilsed/lib/wwwlibs.html

Liszt, http://www.liszt.com/

Mailbase, http://www.mailbase.ac.uk/

Mind-it, http://mindit.netmind.com/

NewJour: Recent Issues, http://gort.ucsd.edu/newjour/nj2/

Search Engine Corner, http://www.ariadne.ac.uk/issue19/search-engines/

Search Engine Watch, http://searchenginewatch.com/

Manchester Metropolitan University's Department of Information and Communications Search Tools, http://www.mmu.ac.uk/h-ss/dic/main/search.htm

The Social Science Research Grapevine, http://www.grapevine.bris.ac.uk/

SOSIG, http://www.sosig.ac.uk

What's New in WWW Social Sciences Online Newsletter, http://www.mmu.ac.uk/h-ss/dic/main/search.htm

'What's New' on the Web server of the European Union, http://europa.eu.int/geninfo/whatsnew.htm

A. S. McNab & I. R. Winship, How to find out about new resources on the Internet, The New Review of Information Networking (1995), 147-53.

Association of Public Data Users and International Association for Social Science Information Service and Technology (IASSIST), Strategies for Searching for Information on the Internet.
http://dpls.dacc.wisc.edu/www_searchers.html

TERENA & M. Isaacs, Internet Users' Guide to Network Resource Tools, Addison Wesley Longman: 1998

E. Worsfold, Finding Internet resources for SOSIG - strategies and sources, 1997
http://sosig.ac.uk/desire/esig.html

Credits
	Chapter author: Emma Place With contributions from: Lisa Gray (OMNI), Debra Hiom (SOSIG), Linda Kerr (EEVL), John Kirriemuir (OMNI), Roddy McLeaod (EEVL), Kate Sharp (Biz/ed)

2.3. Metadata formats

In this chapter...
	why create metadata records? types of metadata attributes standard metadata formats choosing metadata attributes and formats for your gateway format conversion and future proofing

Introduction

Information gateways are characterised by their creation of third-party metadata records - individual descriptions of Internet resources held in a database that have separate fields for different attributes of the resources, such as title, author, URL etc. These resource descriptions are used to:

help users learn more about the Internet resources (from a trusted third-party)
support information search and retrieval

Gateways adopt the approach where metadata is created by a third party ie. an independent subject specialist or information professional, rather than the creator of the resource. This enables the quality control for which gateways are renowned - the resource descriptions all assume a standard format and are generated manually (at least in part) to enable high quality metadata that benefits for semantic judgements about the nature and origin of the resources.

The metadata created by gateways is their greatest asset - adding value to the Internet resources by creating independent, standardised third-party descriptions.

The decision of which metadata format to use is an imporatnt one as it impacts on the searching capabilities of the gateway and the value of the descriptions to the end-users. The creation of metadata will be one of the most time-consuming tasks in running a gateway and so a balance between value and cost may be required in deciding on a format.

This chapter will introduce some of these issues and provide some background information that information gateway managers will need to consider when choosing a metadata format for their gateway.

Why create metadata records?

Information gateways are services that give access to networked resources in particular subject areas, linguistic domains, and so on. Many Internet portals simply comprise of sets of Web pages with lists of hyperlinks on a static Web page, perhaps with annotations, however, this approach has distinct disadvantages:

the portal can be browsed, but with no database it cannot be searched effectively
maintaining the portal is time consuming as all edits and additions require manual changes to the HTML

Gateways take advantage of database technologies which gets over both these problems, but requires that a standard format be used for creating and storing the resource descriptions. Metadata formats are structured formats for Internet resource descriptions. For gateways, the metadata fomats are the forms or templates that need to be filled in by the cataloguers to create a resource description.

The use of metadata by an information gateway has many benefits over the simple HTML list approach, for example:

the metadata has structure and so can form the basis of far more advanced search facilities within a gateway (e.g. fielded searching, such as searching by title or author)
the metadata can be converted to other formats or be otherwise persuaded to interoperate with different search and retrieve protocols
it is easier to maintain a database of resource descriptions than a large number of HTML files. Administrative metadata can also be used to record when resources need to be re-evaluated or removed from the database

Metadata attributes

Gateways staff will need to agree on the attributes of an Internet resource that they wish to describe. Metadata can be grouped into various kinds according to their use within the gateway. They might include:

Descriptive

Descriptive metadata contain information which may be usefully returned from a search of the gateway. A user may be able to decide from this information whether it is worth spending time looking at the resource itself.

title
short title (e.g. an acronym of the full title)
alternative title (e.g. title of resource in another language)
subtitle
description
URI (or other location)
author
language
character set encoding
organisation - either creating or hosting the resource-
medium (e.g. text/images/audio/video)
type of resource (using types appropriate to your gateway)
physical medium
copyright owner
availability (is payment or registration needed?)
software required for access (e.g. specific browsers, MIDI software)
quality rating
intended audience (e.g. undergraduate level)

Subject

Subject metadata can facilitate effective searching. They can also be used to organise the browsing structure of your gateway. A fuller discussion can be found in the

http://purl.oclc.org/dc/about/element_set.htm

keywords
classification code
classification system - must accompany classification code!
terms from thesauri
subject headings

Administrative

Administrative metadata are intended primarily to assist the gateway staff in maintaining the gateway. They are of less concern to users and may not be visible to them; however, they can be used, for example, to check that resource descriptions are still current.

resource maintainer
date of addition of resource to gateway
date record was last updated
date resource was last changed
review-by date
expiry date (e.g. of a conference announcement)
submitter of resource
cataloguer of resource
origin of record (if gateway has collaborators)
rights ownership

E X A M P L E

ROADS templates contain relatively simple administrative metadata attributes like the following:

To-Be-Reviewed-Date:
Record-Last-Verified-Email:
Record-Last-Verified-Date:
Comments:
Record-Last-Modified-Date:
Record-Last-Modified-Email:
Record-Created-Date:
Record-Created-Email:

Consideration of which particular administrative functions are required and an assessment of which particular administrative metadata elements are needed will be an important part of choosing (or adapting) a metadata format for use in a particular information gateway.

Core metadata

The possible metadata fields listed above are by no means exhaustive, but including them all would require considerable effort both in initial cataloguing and in keeping records up to date. Not all of them might be appropriate to your gateway.

Attempts have been made to define standards for a 'core' of metadata which should be regarded as a bare minimum. One such standard is the Dublin Core.

E X A M P L E

Dublin Core currently involves 15 core elements:

Title
Author or Creator
Subject and Keywords
Description
Publisher
Other Contributor
Date
Resource Type
Format
Resource Identifier
Source
Language
Relation
Coverage
Rights Management

ROADS offers a number of metadata templates designed for different types of Internet resources. Each template contains attributes specific to the type of Internet resource. For example, the template for describing a mailarchive will have a different set of fields from the template for describing a Web document. ROADS also maintains a 'template registry' where the metadata fields used in the various kinds of ROADS templates are recorded. This ensures that ROADS services are potentially interoperable in this area. New fields can be nominated for addition to the registry.

E X A M P L E

ROADS offers metadata formats for the following types of Internet resource:

ROADS template-types:

COLLECTION - experimental
DATASET
DOCUMENT
DUBLINCORE
EVENT - experimental
IMAGE
MAILARCHIVE
PROJECT
SERVICE
SOFTWARE
SOUND
TRAINING MATERIALS
USENET
VIDEO

http://www.ukoln.ac.uk/metadata/roads/templates/

Choosing metadata attribites

You should think carefully about which metadata attributes your gateway is going to use, and their format, when you first set up the gateway. If you do not, you may find yourself constrained by the absence of useful metadata, or have to add a new metadata field or convert an existing field to a different format when you already have several thousand resources in your database. Moreover, decisions about metadata will in turn affect the design of your interface (especially the parts of it used for cataloguing and/or submitting new resources for consideration).

Which metadata fields could be usefully searched on by your users?

You should consider your potential user community and also the nature of the resources which your gateway will cover. For example, if your gateway is intended to cover only geographically local resources in one language, a 'language' field will not be very informative unless your gateway is going to be cross-searched with others elsewhere.

And how are they going to search them?

This will affect not only what metadata fields you provide but also the cataloguing rules you adopt. For example, if you are ranking searches by the frequency of the occurrence of the search term, you may wish to make descriptions similar in length, otherwise resources with long descriptions may be more likely to returned high up the order.

Co-operation between gateways, Interoperability

Which metadata fields will be displayed to the users of the gateway?

Will they need to be converted from the form in which they are stored and if so does an easy way of converting them exist?

Which metadata fields will be used for housekeeping by the gateway staff and how?

Metadata can supply information for partially automating this otherwise laborious aspect of gateway management. For example, you can have an automatic email sent to maintainers of resources occasionally to ask whether they have made any changes, or set a web-page tracking tool to monitor changes to resources.

Collection management

Which if any are optional?

If you are collaborating (or thinking of it), which metadata fields will be shared with your collaborators? Are they likely to want extra information, such as language, which you would not otherwise include in your metadata? You will need to use the same schemes for e.g. classification or have a usable crosswalk to convert between schemes. You should also think about the issue of copyright.

Are you going to display your metadata in the same format as that in which you store it?

If not, you will need a way of converting between formats.

Can any of the software you are using generate useful metadata?

For example, ROADS automatically records when a template was last updated. You may wish to use in addition software for creating metadata (see below). Harvesting software, if used, may also be able to harvest metadata.

Harvesting, indexing and automated metadata collection

Who will generate metadata fields (and which ones?).

Metadata may be supplied by:

information providers
gateway users
cataloguers for the gateway
subject editors for the gateway
core gateway staff
another gateway working in collaboration with you
automatic generation by software

How much cross-checking will there be? (Time will need to be allowed for this).

If you are allowing gateway users or information providers to submit resources, what information should they supply?

What information may they also supply optionally? How important is it that (for example) descriptions or keywords are consistent across the gateway? If this is important, can you supply cataloguing rules or other guidance to help information providers and others who are submitting resources? How much effort can be expended on editing their contributions, given that gateway users and information providers cannot be compelled to follow your cataloguing rules?

Working with information providers

How might you ensure that information such as dates is in a consistent format? Possible methods include:

pulldown menus on forms
authority files
cataloguing rules

In what language are your metadata records going to be kept?

If this is different from the language of some of your resources, are you going to make any provision for searching in that language (e.g. an 'alternative title' field)?

Multi-lingual issues

Standard metadata formats

Information gateway managers will need to make decisions about which metadata format (or formats) to use within their service at a very early stage of its development. At present, however, the existence of a large and varied range of metadata formats and initiatives complicates these decisions.

It is worth remembering also that the choice of metadata formats will often be influenced by other factors, both technological and social. For example, an information gateway that wishes to use the ROADS software toolkit with little modification will currently need to use the ROADS template format, or something very similar to it. Again, where gateway cross-searching or interoperability is seen to be important, there may be technical reasons why one format may have advantages over another.

The nature of metadata development means that at any one time there are likely to be a variety of formats that could be chosen as the basis of an information gateway. For example, a review of metadata formats undertaken under DESIRE I identified and described over twenty formats that were in use (or under development) in 1996 (Dempsey et al., 1997). In order to help analyse the different metadata formats described in the review, the DESIRE I study produced a typology of metadata based upon their underlying complexity.

Band One	Band Two	Band Three
[simple]	---------------	---------------	[complex]
(full text indexes)	(simple structured generic formats)	(more complex structure, domain specific)	(part of a larger semantic framework)
Proprietary formats	Proprietary formats Dublin Core ROADS templates LDIF	FGDC MARC	TEI headers EAD CIMI

Figure 1. Typology of metadata formats (adapted from Dempsey and Heery, 1998).

Choosing a metadata format

Choosing a format from the variety of existing ones will depend upon various factors. In general, current information gateways tend to use relatively simple generic formats with some structure ('Band Two' formats such as ROADS templates or Dublin Core). These formats have the twin advantages of simplicity, which means that they are relatively easy to create and maintain, and the existence of some structure, which facilitates both interoperability and format conversion. However, in particular circumstances there may be good arguments for basing an information gateway on more complex formats ('Band Three' formats such as MARC or TEI headers) if this offers some competitive advantage to the gateway. For example, the USMARC format has been used for the cataloguing of Internet resources in the InterCat project and it would be possible to set up MARC-based information gateways. However, the use of these more complex formats may have implications for the level of expertise (technical and other) that would be required for cataloguing and may have other costs.

As noted before, the choice of a particular format may be dictated by technological or social factors. For example, particular gateway software may dictate the use (or non-use) of particular formats. Information gateways that, for example, are running the ROADS software without much modification will need either to use one of the existing templates defined by the ROADS project or to create new (and similar) templates in the form of attribute-value pairs.

Example format 1: Dublin Core

The Dublin Core (DC) is the result of an international and interdisciplinary initiative to define a core set of metadata elements for electronic resources, primarily for resource discovery on the Internet. DC was initially conceived as a simple format that could be used for author-generated descriptions of Web resources. However, the format has also attracted the attention of resource description professionals from a variety of communities such as libraries, museums, archives and government agencies.

E X A M P L E

Example of a DC based gateway

EdNA (Education Network Australia):

EdNA - an information gateway for Australian education resources - uses a metadata standard that is based on the DC element set. The owners of documents are encouraged to embed metadata within their documents where it can be read by the EdNA resource harvester and transferred to the EdNA database.

EdNA: http://www.edna.edu.au/EdNA/

The format has been developed by means of a series of invitational workshops, the first being held in Dublin, Ohio in March 1995. The workshop series and related work has resulted in the definition of fifteen core metadata elements as RFC 2413 (Weibel et al., 1998). These elements are intended to be repeatable and extensible in any application.

The initial focus of DC was the Web, so the initiative has concentrated on the production of draft guidance for the encoding of DC elements, first in HTML (Kunze, 1999) and more recently in XML/RDF (e.g. Miller, Miller and Brickley, 1999).

E X A M P L E

Example of DC metadata embedded in HTML

Metadata created by DC-dot, a service that will retrieve a Web page and automatically generate Dublin Core metadata, either as HTML <META> tags or as RDF/XML, suitable for embedding in the page header.

DC-dot: http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
Dublin Core: http://purl.oclc.org/dc

Example format 2: ROADS templates

ROADS templates are a development of the IAFA templates originally developed for anonymous FTP archives (Deutsch et al., 1994). IAFA templates are a simple text-based metadata format consisting of predefined sets of attribute-value pairs. Templates exist for a number of different resource types, but the templates most commonly used in existing ROADS-based gateways are those designated SERVICE, DOCUMENT and MAILARCHIVE.

E X A M P L E

Example of part of a ROADS SERVICE template

Template-Type: SERVICE
Handle: 840738289-29226
Title: Southampton Oceanography Centre
URI-v1: http://www.soc.soton.ac.uk/
Admin-Email-v1: webmaster@mail.soc.soton.ac.uk
Publisher-Name-v1: University of Southampton
Publisher-Postal-v1: Southampton Oceanography Centre, University of Southampton, Waterfront Campus, European Way, Southampton SO14 3ZH, United Kingdom
Publisher-City-v1: Southampton
Publisher-Country-v1: UK
Publisher-Phone-v1: +44 (0)1703 596666
Description: An introduction to the services provided by the Southampton Oceanography Centre - a joint venture between the University of Southampton and the Natural Environment Research Council. Includes information on internal departments and divisions, and the National Oceanographic Library
Keywords: Southampton Oceanography Centre; Natural Environment Research Council; NERC;
Subject-Descriptor-v1: 551.46
Subject-Descriptor-Scheme-v1: DDC21
Record-Last-Modified-Date: Wed, 12 May 1999 18:24:49 +0000
Record-Last-Modified-Email: cataloguer@subject-gateway.ac.uk
Record-Created-Date: Wed, 12 May 1999 18:24:49 +0000
Record-Created-Email: cataloguer@subject-gateway.ac.uk

ROADS project: http://www.ilrt.bris.ac.uk/roads/

Format conversion

One of the advantages of using well-defined and structured metadata formats is that this allows conversion into other formats when necessary. This is useful in two main circumstances:

When a gateway wants to change to using a different metadata format. For example, a gateway that currently uses a custom-built database management system with a Web interface might want to run the ROADS software to take advantage of cross-searching facilities. The gateway's existing records would therefore need to be converted into ROADS templates. These types of conversion will be required periodically as information gateway software and its associated metadata evolve.
To aid interoperability.

Format conversion is facilitated by the creation of crosswalks (or mapping tables) between metadata formats. Crosswalks can be used as the basis for the production of a specific conversion program or for the production of search systems that would permit the interrogation of heterogeneous metadata formats. A number of metadata format crosswalks have been published. One of the earliest DC-based crosswalks mapped Dublin Core to USMARC (Caplan and Guenther, 1996) and other crosswalks exist for other formats including Text Encoding Initiative (TEI) headers, ROADS templates and a variety of MARC formats, including the Universal MARC format (UNIMARC). A collection of metadata mappings is maintained on the UKOLN Web site (Day, 1996).

Interoperability

E X A M P L E

Examples of metadata conversion projects

Nordic Metadata Project

The Nordic Metadata Project produced a variety of tools designed to aid the wider utilisation of Dublin Core (Hakala et al., 1998). The toolkit included a utility called d2m, a Dublin Core to MARC converter that converts Dublin Core metadata embedded in HTML into a variety of Nordic MARC formats and USMARC.

d2m: http://www.bibsys.no/meta/d2m/

BIBLINK project

The BIBLINK project developed a custom-built software system (the BIBLINK Workspace) which converts metadata produced by publishers into the UNIMARC format for use by participating national bibliographic agencies (Day, Heery and Powell, 1999). The UNIMARC records can in turn be converted into other formats (usually MARC-based) used by these national bibliographic agencies, who can then enhance them for inclusion in their national bibliography and (possibly) for returning this enhanced record to the publisher. The metadata conversion process in the BIBLINK Workspace uses metadata crosswalks produced for the project by UKOLN (e.g. Day, 1998a).

BIBLINK: http://hosted.ukoln.ac.uk/biblink/

Future proofing

Any choices concerning metadata will need to take into account possible future developments. The gateway may decide to expand by including new types of descriptions (possibly for new types of resource such as images or multimedia) or to include additional metadata (such as descriptions aimed at alternative audiences, rights metadata, digital preservation data). At the simplest level, updates and extensions to existing metadata element sets need to be accommodated. The gateway may want to ensure that:

metadata creation tools can be easily extended to deal with new elements and new formats
the system has sufficient flexibility to allow a variety of formats to be imported and exported

Within the lifetime of the gateway, it may have to migrate to a different system which will require different metadata formats, whether these are new versions of existing formats or completely different. Re-structuring the metadata can be done more efficiently if the gateway follows some general guidelines for the content of metadata. Such guidelines might include recommendations that:

metadata formats and rules for content are agreed among collaborating gateways (this means that gateways can share costs of converting their data)
gateways implement local usages by means of local processing rather than by incorporating them into the data (for example, adding punctuation and other presentational enhancements by software processing rather than by storing it as part of the data)
there are as few local variants to standard metadata formats as possible. (For example, variant element names can be displayed using local processing rather than by storing non-standard element names.)
collaborate with other gateways so that migration can take advantage of economies of scale.

Conclusions

Choosing a metadata format is one of the most important decisions that needs to be made when setting up an information gateway. It is vital that the format is able to work with the software that forms the basis of the gateway service and it should also contain all fields (including administrative metadata) that have been identified as appropriate for the service in question (or the format should be extensible). It is possible that ongoing changes in technologies may require periodic conversion of the gateway database into new formats. This process will require the production of metadata crosswalks and/or format conversion programs.

References

BIBLINK, http://hosted.ukoln.ac.uk/biblink/

d2m, http://www.bibsys.no/meta/d2m/

DC-dot, http://www.ukoln.ac.uk/cgi-bin/dcdot.pl

Dublin Core, http://purl.oclc.org/dc

EdNA, http://www.edna.edu.au/EdNA/

InterCat, http://purl.org/net/intercat

ROADS, http://www.ilrt.bris.ac.uk/roads/

P. L. Caplan & R. S. Guenther, 'Metadata for Internet resources: the Dublin Core Metadata Element Set and its mapping to USMARC', Cataloging and Classification Quarterly 22 (3/4) (1996), 43-58.

M. Day, Interoperability between metadata formats (Bath: UKOLN, 1996).
http://www.ukoln.ac.uk/metadata/interoperability/

M. Day, Mapping BIBKLINK Core (BC) to UNIMARC. BIBLINK project document (Bath: UKOLN, 10 September 1998).
http://hosted.ukoln.ac.uk/biblink/wp10/bc-unimarc.html

M. Day, R. Heery & A. Powell, 'National bibliographic records in the digital information environment: metadata, links and standards', Journal of Documentation 55 (1) (1999), 16-32.

L. Demspey & R. Heery, 'Metadata: a current view of practice and issues', Journal of Documentation 54 (2) (1998), 145-172.

L. Demspey, R. Heery, M. Hamilton, D. Hiom, J. Knight, T. Koch, M. Peereboom & A. Powell, A review of metadata: a survey of current resource description formats (DESIRE deliverable D3.2 (1), March 1997).
http://www.ukoln.ac.uk/metadata/desire/overview/

P. Deutsch, A. Emtage, M. Koster & M. Stumpf, Publishing information on the Internet with Anonymous FTP (Internet-Draft, September 1994).
http://info.webcrawler.com/mak/projects/iafa/iafa.txt

J. Hakala, P. Hansen, O. Husby, T. Koch & S. Thorborg, The Nordic Metadata Project: final report (Helsinki: Helsinki University Library, July 1998).
http://linnea.helsinki.fi/meta/nmfinal.htm

R. Heery, 'Review of metadata formats', Program 30 (4) (1996), 345-373.

R. Iannella & D. Campbell, The A-Core: metadata about content metadata (Internet-Draft, 21 June 1999).
http://metadata.net/admin/draft-iannella-admin-01.txt

J. Kunze, Encoding Dublin Core Metadata in HTML (Internet-Draft, 25 May 1999).
http://www.ietf.org/internet-drafts/draft-kunze-dchtml-01.txt

O. Lassila & R. Swick, eds., Resource Description Framework (RDF) model and syntax specification (W3C Working Draft, 1999).
http://www.w3.org/TR/WD-rdf-syntax/

Making of America project, The Making of America II testbed project white paper (Version 1.03, March 16 1998).
http://sunsite.berkeley.edu/MOA2/wp-v1_03.html

E. Miller, P. Miller & D. Brickley, eds., Guidance on expressing the Dublin Core within the Resource Description Framework (RDF) (Dublin Core Metadata Initiative, Draft Proposal,1999).
http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/

S. Weibel, J. Kunze, C. Lagoze & M. Wolf, RFC 2413, Dublin Core metadata for resource discovery (Internet Engineering Task Force, Network Working Group, September 1998).
ftp://ftp.isi.edu/in-notes/rfc2413.txt

S. Weibel, 'The State of the Dublin Core Metadata Initiative', D-Lib Magazine 5 (4) (April 1999).
http://www.dlib.org/dlib/april99/04weibel.html

S. L. Weibel & C. Lagoze, 'An element set to support resource discovery: the state of the Dublin Core', International Journal on Digital Libraries, 1(2) (January 1997), 176-186.

Credits
	Chapter author: Michael Day With contributions from: Rachel Heery, Emma Place and Virginia Knight

2.4. Cataloguing

In this chapter...
	describing Internet resources: cataloguing and metadata approaches metadata formats and content rules types of information needed by an information gateway developing cataloguing guidelines for a gateway cataloguing interfaces and maintenance

Introduction

The role of cataloguing rules or guidelines is to specify how the content of a metadata format is entered. Once a metadata format has been chosen, consideration should then be given to how this metadata should be entered into the information gateway database and a set of cataloguing rules prepared.

One of the key roles of Internet subject gateways is the creation of descriptive metadata about networked resources which can be used as a basis for searching and browsing the gateway. These descriptions can also help gateway users to identify whether the resources are really what they need, potentially saving them a considerable amount of time browsing through the limited amounts of information available elsewhere on the Internet (Sha, 1995, p. 467). Therefore, one of the most important (and time-consuming) activities for a subject gateway will be the provision of these descriptions. This is the activity generally known as 'cataloguing' and is one of the key tasks of any information gateway.

Background

Cataloguing can be defined as the creation of surrogate records which can be used to facilitate the identification, location, access and use of resources (Levy, 1995). These descriptions are usually created in accordance with certain standards (cataloguing rules and metadata formats) and will often include additional features such as classification, subject analysis and authority control (Dillon and Jul, 1996, p. 198, Bryant 1980). These tools and standards were originally developed for the cataloguing and indexing of traditional - mostly printed - collections. However, many of them have been revised to take account of resources based on newer technologies. Recent developments include:

1. ISBD(ER). In 1997, the IFLA Universal Bibliographic Control and International MARC Programme (UBCIM) published a revision of ISBD(CF) for 'Computer Files' for both online and offline 'Electronic Resources' (ISBD(ER), 1997; Sandberg-Fox and Byrum, 1998).

E X A M P L E

Web page description according to ISBD(ER)

Southampton Oceanography Centre [Electronic resource]. - Electronic interactive multimedia. -- [Southampton] : University of Southampton, Southampton Oceanography Centre, cop. 199?.
Mode of access: World Wide Web. URL: http://www.soc.soton.ac.uk/.
Title from title screen.
Summary: An introduction to the services provided by the Southampton Oceanography Centre - a joint venture between the University of Southampton and the Natural Environment Research Council. Includes information on internal departments and divisions, and the National Oceanographic Library.

2. USMARC 856 field - 'Electronic Location and Access'. The use of this field enables the encoding of enough information to locate and retrieve networked resources, including an URL (Network Development and MARC Standards Office, 1997). Field 856 has been implemented in other 'flavours' of MARC such as UNIMARC (Holt, 1998).

The use of the MARC formats for describing Internet resources has been extensively tested in North America, particularly through the work of a series of OCLC projects.

E X A M P L E

OCLC Internet projects

The OCLC Internet Resources project (1991-92), which resulted in the proposal for the USMARC 856 field (Dillon, et al., 1994).

The OCLC Internet Cataloging (InterCat) project (1994-96) to test the use of the USMARC format (including the 856 field) and AACR2 cataloguing rules for describing Internet resources.
InterCat: http://purl.org/net/intercat The Cooperative Online Resource Catalog (CORC) project (1998-). The project is exploring the co-operative creation and sharing of metadata by libraries. At the centre of CORC will be a catalogue containing Internet resource descriptions from a variety of sources. The project is also investigating automated methods for subject assignment, authority control and the conversion of metadata formats.
CORC: http://www.oclc.org/oclc/research/projects/corc/index.htm

Information gateways build upon these practices, but have a particular focus on developing cataloguing practices and technologies that are designed specifically to manage Internet resources, taking into account the unique features of these resources.

Gateways tend to opt for more flexible and less formal cataloguing solutions, using less complex metadata formats like Dublin Core. This is largely because these formats can be flexible and quick to respond to new developments in the ever-changing Internet environment. It also helps gateways to cope with the volatility of Internet resources - one of the key challenges in Internet cataloguing - as resources change, their associated records become out of date and require frequent updating. Information gateways have sought to develop relatively simple technologies and cataloguing procedures, which provide adequate descriptions but which also support the high level of maintenance that is required.

As Clifford Lynch (1997, p. 44) has commented, if the Internet is to continue to thrive as a new means of communication, 'something very much like traditional library services will be needed to organize, access and preserve networked information'. This article also comments that combining 'the skills of the librarian and the computer scientist may help organize the anarchy of the Internet'.

Cataloguing issues for information gateways

Information gateways, like libraries, need tools that facilitate the identification, location, access and use of resources; they have therefore developed (or adapted) tools that can be used for the descriptive cataloguing of Internet resources and their indexing. In this, information gateways have the distinct advantage that they can build upon the past century and a half of experience which libraries and other organisations have of the task of cataloguing. Information gateways need to work on the following:

metadata formats
types of descriptive information required
content standards and cataloguing rules
cataloguing tools and interfaces
catalogue maintenance

Metadata formats

Firstly, it must be noted that cataloguing issues are to some extent related to the decisions that information gateways need to make about metadata formats.

Metadata formats

That said, the use of a particular metadata format does not necessarily determine the adoption of any particular description standard or set of cataloguing rules. Formats such as Dublin Core, MARC or ROADS templates are merely frameworks into which data can be entered and by which it can retrieved. The role of cataloguing rules or guidelines is to specify how the content of this format is entered. For this reason, once a metadata format has been chosen, consideration should then be given to how this metadata should be entered into the information gateway database and a set of cataloguing rules prepared.

Types of descriptive information required by an information gateway

During the cataloguing process for an information gateway, a resource will first be identified and selected and then described in some standardised way. Typically, a description will record a variety of different types of information:

Bibliographic-type descriptive information. This should include information primarily taken from the resource itself, including its title, its location (usually a URL) and the persons and organisations responsible for its content.
Subject information. This would include any terms added from subject schemes, such as classification codes, terms from thesauri and subject heading lists as well as any keywords added by a cataloguer. More information can be found in the chapter on Classification.
Administrative metadata. This includes any other information that may be useful to the management of the subject gateway. This may include information on individuals who selected or catalogued a given resource, the date that a catalogue record was created (or updated) and the dates when selected resources need to be reviewed.

Choosing content standards and developing cataloguing rules

Once a metadata format has been adopted and decisions have been taken on the particular information that resource descriptions need to contain, it is time to start the preparation of cataloguing rules or guidelines. Such guidelines can be as detailed (or not) as a particular gateway requires. In most cases, there will not be a requirement to develop rules as comprehensive as those in AACR2, for example, but cataloguing guidelines should often contain the following things:

a list of all possible data elements
a brief explanation of what particular information each element is supposed to hold
an explanation of how information should be entered into this element (the rule)
some guidelines on the use of formats for dates, language codes, etc.
notes of (and links to) external standards used, e.g. classification schemes, name authorities

Once developed, these guidelines can be distributed to those people who will be responsible for providing resource descriptions for the gateway.

E X A M P L E

ROADS Cataloguing Guidelines

The ROADS project has developed some cataloguing guidelines for the two most commonly used ROADS template types (SERVICE and DOCUMENT) which can be used as a framework for the development of cataloguing rules for new or existing information gateways (Day, 1998). These guidelines were adapted from existing practice (notably from guidelines developed by ADAM (Bradshaw, 1997) and SOSIG) and could be used as the basis for other gateways, whether based on ROADS tools or not.

http://www.ukoln.ac.uk/metadata/roads/cataloguing/

Many of the decisions that need to be made relate to the particular formats that need to be used for things like dates, language codes or names.

Date formats

Dates tend to be important parts of content metadata. As well as being used to record the time when a resource was created or last modified, dates are also used to record administrative data about the metadata itself. For this reason, dates need to be entered in some agreed format so that they can be automatically processed by software. The main date formats currently in use are ISO 8601:1988 - as recommended for use in Dublin Core descriptions (Wolf and Wicksteed, 1996) - and the modified RFC 822 format used by ROADS templates (Deutsch, et al., 1994, p. 14):

ISO 8601:1988:
1998-06-01
RFC 822 (as modified by RFC 1123):
01 Jun 1998 12:00:00 GMT

Language codes

Resource descriptions tend to include an element recording the language of the intellectual content of a resource. Gateways could (and some do) record these by using the names of languages in full, e.g.:

Language: Portuguese
Language: Deutsch

However, natural language may not be the best way of recording this information. It would be difficult (if not impossible) for machines to be able to tell that, for example, the words 'Welsh' and 'Cymraeg' refer to the same language, or that the terms 'English' and 'Old English' refer to quite different ones. For these reasons, a number of standardised language codes have been proposed, usually based on either two or three letters (e.g. ISO 639-1:1988, RFC 1766). The best current candidate for language codes is the three-letter (known as 'Alpha-3') code ISO 639-2:1998 with more than 460 codes (Byrum, 1999):

ISO 639-2:1998
Language: eng
Language: emn

Name formats and authority files

Names are one of the more problematic areas for information gateway cataloguing rules to make decisions about content. There are (in general) two main ways in which personal names can be ordered:

Direct order:
Author-Name-v1: Conrad Russell
Author-Name-v1: R. Po-chia Hsia
Inverted order:
Author-Name-v1: Russell, Conrad
Author-Name-v1: Hsia, R. Po-chia

However, there are a number of variations that exist within each of these ways. There is a need for rules that deal with things like titles, pseudonyms and hyphenation. These can be extremely complex. Rules concerning 'headings for persons' in AACR2 (1988 rev.), for example, take up 54 pages. Similar rules for corporate bodies take up 41 pages. In addition, in some cases there will be a requirement to be able to distinguish between two persons (or organisations) with the same name. Rules like AACR2 usually achieve this by adding more information to the name itself, e.g. dates of birth and death and titles, with appropriate punctuation:

Author-Name-v1: Hsia, R. Po-chia, 1955-
Author-Name-v1: Newman, J. H. (John Henry), 1801-1890
Admin-Name-v1: University of Southampton

Libraries have considerable experience of dealing with names in catalogues, as can be attested by the extremely full treatment of name entries in codes such as AACR2. The sharing of bibliographic records between institutions has additionally led to the foundation of authoritative lists of names (i.e. verified access points) with cross-references, known as name authority files.

A number of name authority lists exist, mostly produced by national libraries or national bibliographic agencies, for example:

Library of Congress Name Authority File (LCNAF) - used by the majority of US libraries
British Library Name Authority File - originally created for the British National Bibliography (BNB) but also now used in the British Library's own catalogues
German-based name authority files include the Gemeinsame Körperschaftsdatei (GKD) for corporate body names and the Personennamendatei (PND) for personal names (Münnich, 1996)

At the present time name authority data tends to be national in origin, based on a variety of national formats and made available in a wide variety of ways, not always in electronic form. As one response to this problem, the AUTHOR project, funded by the Commission of the European Communities (DG XIII) as part of Computerised Bibliographic Record Actions (CoBRA), has investigated the feasibility of the international exchange of name authority data (Zillhardt and Bourdon, 1998).

If information gateways want to implement name authorities, the most logical place to start would be with the relevant national file, possibly supplemented by reference to LCNAF.

Authority files can also be used for things like geographical names or subjects. Indeed, the Library of Congress Subject Headings (LCSH) are probably the best example of a library-originated subject authority file.

Subject information

Subject information, in the form of keywords, classification scheme codes, subject heading terms and so on, forms an important part of the resource descriptions provided by information gateways. Subject information can form the basis of part of the search system, or - in the case of classification codes or terms from a subject hierarchy - can form part of the gateway's browse structure. As Vizine-Goetz (1998, p. 93) has said, the 'knowledge structures that form traditional classification schemes hold great potential for improving resource description and discovery on the Internet and for organising electronic document collections'. More information on these issues can be found in the chapter on Classification.

Any cataloguing guidelines developed for information gateways need to contain information on the selected (or adapted) subject schemes and documentation will be required so that terms from these schemes can be added at the cataloguing stage. This may require reference to the published scheme itself or a link to the selected part being implemented. So, for example, a gateway based on a limited implementation of the 21st edition of the Dewey Decimal Classification (DDC21) will need at least a list of all of the classification codes in use and their meaning. More detailed implementations may require the use of the published DDC21 manuals and the employment of suitably trained staff.

Cataloguing tools and interfaces

The creation of Internet resource descriptions for information gateways will largely take place via an interface or cataloguing tool. With some metadata formats it may be possible to create resource descriptions using text editors (e.g. for ROADS templates) or Web based tools (e.g. DC-dot for Dublin Core in HTML and RDF).

Ideally, however, information gateways need cataloguing interfaces that can be adapted for their particular needs, which contain, for example, the subject schemes used by that particular gateway as its default and including some help in the form of cataloguing rules and examples. In principle, it should be possible to embed most of the cataloguing rules developed for an information gateway inside the cataloguing interface. It should also be able to automatically validate certain elements (e.g. language codes or dates) before adding records to the database and to add certain administrative metadata.

Developing a catalogue interface, however, is a time-consuming and specialised task which is influenced by the choice of underlying software tools and metadata formats. The ROADS toolkit, for example, comes with a template editor which can be used for creating resource descriptions but this would in most cases require some customisation by the addition of guidelines for the use of subject schemes and other guidelines. Other metadata formats may have their own creation tools; for example, most MARC formats could be created using a proprietary library-based cataloguing interface.

CORC, http://www.oclc.org/oclc/research/projects/corc/index.htm

Catalogue maintenance

Another important factor that needs to be considered is the ongoing maintenance of the information gateway database. One of the characteristics of Internet information is that it is subject to rapid (and unadvertised) change. The content of Web pages can be frequently updated (not always for the better), their virtual locations (usually in the form of URLs) can change, and even IP addresses can expire or move to another - sometimes inappropriate - organisation. For these reasons, a considerable task for any information gateway is keeping its resource descriptions up to date. This will, in part, require the use of automated tools like link-checkers, but may also entail some periodic checking of information content (possibly based on 'expiry-date' administrative metadata or random sampling). In any case, resource descriptions will need to be periodically updated (or removed) and any cataloguing tools will need to facilitate this.

Collection management

Conclusions

As we have seen, the creation and maintenance of resource descriptions (or cataloguing) is an important part of the role of any information gateway. Gateways, therefore, need to consider in detail any cataloguing requirements that they have. This will mean decisions being made on:

content standards - these need to be developed, whether based on Internet cataloguing guidelines such as those produced by the ROADS project or on implementations of existing standard descriptive standards like ISBD(ER)
subject schemes - important for any browse interface to the gateway and for subject searching
cataloguing interfaces - to ease the creation of surrogate records by gateway staff or others
database maintenance issues - to ensure that the gateway's database is as up to date as possible

All of these decisions (and their associated activity) will require the input of specialised staff and considerable commitment in terms of time to produce (or adapt) some cataloguing guidelines, to implement a suitable cataloguing interface and to train those people who will carry out the cataloguing task itself. Of course, there are a growing number of gateways with experience of doing these things, so new gateways would be advised to build on this experience before developing new solutions.

Glossary

AACR2 - Anglo American Cataloguing Rules, 2nd edition
ADAM - Art, Design, Architecture & Media information gateway
BNB - British National Bibliography
CoBRA - Computerised Bibliographic Record Actions
CORC - OCLC Cooperative Online Resource Catalog project
DDC21 - Dewey Decimal Classification, 21st edition
GKD - Gemeinsame Körperschaftsdatei
IFLA - International Federation of Library Associations and Institutions
InterCat - OCLC Internet Cataloging project
ISBD - International Standard Bibliographic Description
ISBD(CF) - International Standard Bibliographic Description for Computer Files
ISBD(ER) - International Standard Bibliographic Description for Electronic Resources
ISO - International Standards Organisation
LCNAF - Library of Congress Name Authority File
LCSH - Library of Congress Subject Headings
MARC - Machine-Readable Cataloguing
OCLC - Online Computer Library Center
PND - Personennamendatei
RDF - Resource Description Framework
RFC - IETF Request for Comments
ROADS - Resource Organisation and Discovery in Subject-based services
SOSIG - Social Science Information Gateway
UBCIM - IFLA Universal Bibliographic Control and International MARC Programme
UNIMARC - Universal MARC format

References

Intercat, http://purl.org/net/intercat

H. Alvestrand, RFC 1766, Tags for the identification of languages (Internet Engineering Task Force, Network Working Group, March 1995).
ftp://ftp.isi.edu/in-notes//rfc1766.txt

R. Braden, ed., RFC 1123, Requirements for Internet hosts - application and support (Internet Engineering Task Force, Network Working Group, October 1989).
ftp://ftp.isi.edu/in-notes//rfc1123.txt

R. Bradshaw, Cataloguing rules for the ADAM database: a procedural manual (ADAM, the Art, Design, Architecture & Media Information Gateway, 1997).
http://www.adam.ac.uk/adam/reports/cat/

P. Bryant, 'Progress in documentation: the catalogue', Journal of Documentation 36 (2) (1980), 133-163.

J. D. Byrum, 'ISO 639-1 and ISO 639-2: international standards for language codes. ISO 15924: international standard for names of scripts', 65th IFLA Council and General Conference, Bangkok, Thailand, 20-28 August 1999.
http://www.ifla.org/IV/ifla65/papers/099-155e.htm

D. H. Crocker (rev.), RFC 822, Standard for the format of ARPA Internet text messages (Internet Engineering Task Force, 13 August 1982).
ftp://ftp.isi.edu/in-notes//rfc822.txt

M. Day, ROADS cataloguing guidelines (Bath: UKOLN The UK Office for Library and Information Networking, 1998).
http://www.ukoln.ac.uk/roads/cataloguing/cataloguing-rules.html

P. Deutsch, A. Emtage, M. Koster & M. Stumpf, Publishing information on the Internet with Anonymous FTP (Internet Engineering Task Force Internet Draft, September 1994).
http://info.webcrawler.com/mak/projects/iafa/iafa.txt

M. Dillon & E. Jul, 'Cataloging Internet resources: the convergence of libraries and Internet resources', Cataloging & Classification Quarterly 22 (3/4) (1996), 197-238.

M. Dillon, E. Jul, M. Burge & C. Hickey, 'The OCLC Internet Resources Project: Toward Providing Library Services for Computer-Mediated Communication' in A. P. Bishop (ed.), Emerging communities: integrated networked information into library services (Urbana-Champaign, Ill.: University of Illinois at Urbana Champaign, Graduate School of Library and Information Science, 1994), 54-69.

M. Gorman & P. W. Winkler (ed.), Anglo-American Cataloguing Rules, 2nd ed. (Ottawa: Canadian Library Association; London: Library Association Publishing; Chicago, Ill.: American Library Association, 1988).

Guidelines for the Use of Field 856 (Network Development and MARC Standards Office, Washington, D.C.: Library of Congress, 1997).
http://lcweb.loc.gov/marc/856guide.html

B. Holt, 'Presentation of UNIMARC on the Web: new fields including the one for electronic resources', 64th IFLA General Conference, Amsterdam, Netherlands, 16-21 August 1998.
http://www.ifla.org/IV/ifla64/110-161e.htm

ISBD(ER) International Standard Bibliographic Description for Electronic Resources: revised from the ISBD(CF): International Standard Bibliographic Description for Computer Files (UBCIM publications, New Series, 17. Munich: Saur, 1997).

ISO 639-1:1988, Code for the representation of names of languages (Geneva: International Organisation for Standardization, 1988).

ISO 639-2:1998, Codes for the representation of names of languages - Part 2: Alpha-3 code (Geneva: International Organisation for Standardization, 1998).

ISO 8601:1988, Data elements and interchange formats - Information interchange - Representation of dates and times (Geneva: International Organisation for Standardization, 1988).

E. Jul, InterCat year-end statistics (E-mail to OCLC Internet Cataloging project list INTERCAT@oclc.org, 4 January 1999).

D. M. Levy, 'Cataloguing in the digital order', Digital Libraries '95: The Second Annual Conference on the Theory and Practice of Digital Libraries, Texas A & M University, Austin, Texas, USA, 11-13 June 1995.
http://csdl.tamu.edu/DL95/papers/levy/levy.html

C. Lynch, 'Searching the Internet', Scientific American 276 (3) (March 1997), 52-56.

M. Münnich, 'German authority work and control', Authority Control in the 21st Century, Online Computer Library Center (OCLC), Dublin, Ohio, 31 March-1 April 1996.
http://www.oclc.org/oclc/man/authconf/muennich.htm

N. B. Olson (ed.), Cataloging Internet resources: a manual and practical guide, 2nd. ed. (Dublin, Ohio: OCLC Online Computer Library Center, 1997).
http://www.purl.org/oclc/cataloging-internet

A. Sandberg-Fox & J. D. Byrum, 'From ISBD(CF) to ISBD(ER): process, policy, and provisions', Library Resources and Technical Services 42 (2) (1998), 89-101.

V. T. Sha, 'Cataloguing Internet resources: the library approach', The Electronic Library 13 (5) (1995), 467-476.

D. Vizine-Goetz, 'OCLC investigates using classification tools to organize Internet data', in P. A. Cochrane & E. H. Johnson (eds.), Visualizing subject access for 21st century information sources (Urbana-Champaign, Ill.: University of Illinois at Urbana Champaign, Graduate School of Library and Information Science,1998), 93-105.

M. Wolf & C. Wicksteed, Date and Time Formats (submission to World Wide Web Consortium (W3C), 15 September 1997).
http://www.w3.org/TR/NOTE-datetime-970915

S. Zillhardt & F. Bourdon, AUTHOR project: final report (Paris: Bibliothcque nationale de France, 5 June 1998).
http://www.bl.uk/information/author.pdf

Credits
	Chapter author: Michael Day With contributions from: Emma Place

2.5 Subject classification, browsing and searching

In this chapter...
	classification schemes keywords and thesauri staff issues browsing and searching future developments - automated solutions

Introduction

The use of classification schemes, keywords and thesauri are central features of the formal resource descriptions provided by your service. The appeal of information gateways is based not only on the guaranteed high quality of the selected resources, but also on the facilities for subject-based access to the collection. In particular, information gateways typically provide access for both searching and browsing. Browsing (through a directory-like structure) is usually based on subject classification schemes or, exceptionally, thesauri. There are many such classification schemes from which to choose. You will need to decide which scheme suits the purpose of your gateway and the requirements of your target user group.

Issues for gateway managers

This chapter should help you answer the following questions:

do I want to use a classification scheme?

What are the pros and cons?
Which schemes are available?
How do I decide which one is the most appropriate scheme for my service?
Is it better to design my own scheme instead of using an existing scheme?
Can I adapt or extend existing schemes?

is it useful to adopt a keyword system as well as a classification scheme?

What are the pros and cons of using controlled and uncontrolled vocabularies
What are thesauri?

will my users require both searching and browsing facilities?

Is there an existing classification scheme which might be the best basis for a browsing structure or could a thesaurus or keyword system be adapted for this purpose?
How do I create a browsing structure from a classification scheme?

how will my choices affect interoperability issues?

how will my choices affect multilingual issues?

Classification schemes

What is subject classification?

Libraries have long experience of classifying resources, mainly books. The purpose of classification is to make it easier for users to find and retrieve resources. Subject classification is a method of describing resources by their subject. Universal classification schemes designed for use by libraries were first developed in North America during the nineteenth century. The most famous (and most widely used) scheme is the Dewey Decimal Classification (DDC) system, which was first produced for a small college library in 1876.

Classification schemes differ from other subject indexing systems, such as subject headings and thesauri, by trying to create collections of related resources in a hierarchical structure. The use of notations or codes facilitates the creation of hierarchical subject trees. For example, using UDC we can create the following hierarchy (adapted from McIlwaine, 1995, p. 17):

5	Natural science
504	Environmental science
504.05	Adverse effects of human activity on the environment
504.054	Effect of harmful materials. Pollution
504.054(44)	The effect of pollution on the environment in France

By building a hierarchical structure, a classification scheme enables users to look for related items which might otherwise be missed. This facilitates browsing, both within a physical library or online.

One advantage of an on-line system is that you can assign more than one classification number to a resource, since they do not need to be put in numerical order on a shelf; they can be (virtually) kept in two places at once. An Internet service can easily offer several different classification 'views' of the same resources.

Types of classification schemes

Classification schemes can be broadly divided into:

Type	Characteristics	Examples
Universal schemes	General (covering all subject areas) Designed for worldwide usage	DDC (Dewey Decimal Classification) UDC (Universal Decimal Classification)
National general schemes	General in subject coverage Usually designed for use in a single country/language community	BC (Nederlandse Basisclassificatie) - Dutch SAB (Sveriges Allmänna Biblioteksförening) - Swedish
Subject-specific schemes	Designed for use by a particular (national or international) subject community	NLM (National Library of Medicine) Ei (Engineering Information Classification Codes)
Home-grown schemes	Designed for use in one particular service	Yahoo!

All of these classification types are used to some extent on the Internet (Koch and Day, 1997). Universal schemes like DDC and UDC are used by many Internet services and are readily available in machine-readable form. Subject services, however, are more likely to use a subject-specific scheme.

Advantages of using a classification scheme for organising Web resources

The use of classification schemes offers one way of providing improved access to Web resources. It is not enough to build a collection of resources on the Web of a specific standard or relevant to a particular audience. It is also necessary to organise and present those resources in such a way that the user can retrieve all the relevant resources quickly and easily. There are many Web guides which present resources in some kind of listing, either alphabetical or divided into ad hoc, constructed subject categories. These lists can soon become long and cluttered.

Classification schemes have therefore begun to replace less sophisticated ways of listing resources. A site which uses a classification scheme to organise knowledge demonstrates several distinct advantages over sites which do not (Koch and Day, 1997):

1. Ease of browsing

Classified subject lists can easily be browsed in an online environment. Browsing is particularly helpful for inexperienced users or for users not familiar with a subject and its structure and terminology. In addition, the structure of the classification scheme can be displayed in different ways as a navigation aid. The classification notation does not even need to be displayed on the screen, so an inexperienced user can have the advantage of using a hierarchical scheme without the distraction of the notation itself.

2. Narrowing searches and viewing related resources

When queries are limited to individual parts of a collection (filtering), the number of false hits is reduced, i.e. precision is improved. Classification schemes are hierarchical and therefore can also be used to get an overview of resources covering broader or narrower topics as you move up or down the hierarchy. This offers users the opportunity to view related resources which may be relevant to their information needs.

3. Providing context

The use of a classification scheme gives context to the search terms used. For example, the problem of homonyms (words which have the same spelling but a different meaning) can be partly overcome, because the context of the broader subject area or discipline will in most cases unambiguously indicate their meaning.

4. Partitioning and manipulating databases

Large classified lists can be divided logically into smaller parts if required.

Using an established or standard classification scheme has further advantages:

5. Potential to permit multilingual access to a collection

Since classification schemes often use language-independent notations (numerical or alphanumeric), these notations can be linked to as many of the available translations of the classification terms as you need. This offers the possibility of searching for terms belonging to a particular notation in various languages, and it also allows for the creation of browsing sections in more than one language. Other languages can be added later with very little effort, and without the need to classify the resources again. DDC and UDC have a good multilingual capability as the codes they produce are entirely numerical and their schedules have been widely translated (into nearly as many as 30 different languages). A version of a scheme in an appropriate language will not always be available.

6. Improved interoperability

The use of an agreed classification scheme could enable improved browsing and subject searching across databases.

7. Greater stability

An established classification does not usually become obsolete. The larger schemes are undergoing continuous revision, although they are normally also formally published in numbered editions. Some classifications may have to be changed when a new edit ion of a scheme is published, but it is unlikely that every single resource will have to be reclassified.

8. Greater familiarity

Some classification schemes are well known by a large user group. Regular users of libraries will be familiar with at least part of one or more of the traditional library schemes. Members of a subject community are likely to be familiar with their (subject-specific) schemes as well. Indeed some classification schemes are available in machine-readable form. Internet services which use established classification schemes may therefore have an advantage over those which use a home-grown scheme or none.

A list of Web-accessible classification systems and thesauri is maintained at: http://www.ub2.lu.se/metadata/subject-help.html

Disadvantages of using a classification scheme for organising Web resources

However, classification schemes also have some disadvantages:

1. Splitting up logical collections of material

Classification schemes often split up collections of related material, although this can be partly overcome with good cross-references and by assigning multiple class numbers to one resource.

2. The illogical subdivision of classes

Some popular schemes do not always subdivide classes in a logical manner. This can make them difficult to use for browsing purposes.

3. Delays in assimilating new areas of interest

Classification schemes, since they are usually updated through formal processes by organised bodies, often have difficulty in reacting promptly to new areas of study and changing terminology.

The most appropriate classification scheme for your service

There are many factors to consider before choosing the most appropriate classification scheme for your service. Comparing the different types of scheme is one useful approach.

1. Creating your own scheme versus using an existing scheme

When a new gateway is being developed, you may be tempted to invent a new classification scheme for it. Inventing a new scheme has some advantages, but may also create new problems.

Advantages of creating a new classification scheme:

A customised scheme, adapted specifically to the content and user groups of the gateway, should be able to meet all of its specific requirements. This should allow for easier and more consistent browsing. For example, there should be no unnecessary parts of the structure which would end up being unused.
Home-made schemes are flexible and easy to change and therefore should be able to absorb new areas of interest relatively easily.

Creating a new classification scheme also has disadvantages:

It is time-consuming - and therefore expensive - and requires extensive specialist subject knowledge.
Even when the time and specialist knowledge is available, it is relatively easy to overlook something in a home-made scheme. For example, a gateway may find it difficult to fit a new term or hierarchy into its own scheme which was not considered when it was created. In addition, subject classification is a very subjective activity and this can easily lead to a lack of consistency.
Custom-made schemes are not familiar to users, as existing universal or subject-specific classification schemes may be.
Probably the main disadvantage is the almost complete lack of interoperability with other services and databases when it comes to subject description for browsing and searching.

Choosing an existing classification scheme avoids having to deal with some of the above issues. The scheme has already been made and it does not require any additional time or money to develop it.

Use an existing classification scheme, unless there is absolutely no suitable or adaptable system available or only schemes which cover a small part of the subject area. In this case it might pay to develop something completely new or adapt existing schemes which are only partly useful.

2. Established library classification schemes versus schemes developed for Internet usage

The established library classification schemes have developed over a long period of time, sometimes as long as 100 years. This means that their conception of the world can be outdated and this may be reflected in the structure. For example, all universal schemes have had to take account of the rapid growth in electronics and computing in the second half of the twentieth century. Updating classification schemes takes a long time and sometimes the updated versions lack consistency, with new concepts being placed under illogical headings. Due to their size, the classification schemes do not update very often and, when they do, they tend to update one subject at a time. Traditional schemes can therefore be rather complex to use.

The good thing, however, about general library classification schemes is that they are universal schemes. They are built to classify an entire world with all its content. The schemes developed for Internet usage are of course relatively young, often developed over the last few years. This means that they are often still incomplete and continuously updating, trying to cover new subject areas as they go along. These schemes mirror the modern and changeable world. Sometimes they concentrate on a few areas of interest, ignoring the rest, sometimes they try to cover the whole world in the same way as the universal library classification schemes.

However, many home-grown schemes display severe weaknesses which hamper correct and efficient usage: failures in logic and hierarchy; incorrect subdivision of classes and application of multiple hierarchies; errors in terminology and in internal links and relationships between classes, and so on. There is also no requirement for subject services to use all layers of the classification hierarchy in an established system. Some current schemes organise material based on the first three levels only of a decimal scheme like DDC.

E X A M P L E

Two good examples of classification schemes used for the Internet, the first an established one, the second home made:

BUBL LINK is a comprehensive service covering academic resources in all subject areas. It uses the Dewey Decimal Classification (DDC) to classify documents
Yahoo! is a commercial search service covering most popular subjects. Yahoo! uses its own universal classification scheme with 14 main categories

3. Universal classification schemes versus subject-specific schemes

Universal classification schemes and subject-specific schemes are designed with different purposes in mind. A new gateway would need to choose a scheme relevant to the target audience for whom the service is being created. Where a gateway gives access to resources from all areas of knowledge, published throughout the world and in many languages, and intended to be offered to an international multi-disciplinary community of users, an existing universal scheme should be selected. If the service is a subject-specific one aimed at researchers within, say, the engineering community, it would be better to use a subject-specific classification scheme, if a suitable scheme is available. An alternative might be to use the appropri ate part of a universal scheme.

Problems will occur for services covering subjects for which several different schemes exist (e.g. the earth sciences) or services which cover more than one subject area (e.g. the social sciences). In these cases, mapping and linking between schemes, the use of concordances for conversion, or extensions of a scheme may help.

E X A M P L E

Two examples of subject-specific classification schemes:

SOSIG (Social Science Information Gateway) uses part of UDC to generate a browsing structure (at the moment the categories are only displayed in alphabetical order)
EELS is structured according to the subject classification scheme produced by Engineering Information Inc.

4. National (monolingual) schemes versus international multilingual schemes

The choice between a national monolingual scheme and international multilingual schemes also depends on your subject and target group as well as on the purpose of the service. If a gateway only aims at a single user group within a country or at a specific language community and does not see any other potential users for the service, it could probably successfully use a national or language-based classification scheme. You would also possibly gain from the familiarity of a nationally-based scheme if you use one which is common in libraries. If, on the other hand, a gateway aims at a user group which is international (or which is intended to become international in the future), it would be better to use an international multilingual scheme, if one is available. If a gateway is thinking of cross-browsing or cross-searching with other gateways, it needs to consider the possibility of mapping to other schemes at this stage.

Note that some national schemes are available in a multilingual version, for example, the Nederlandse Basisclassificatie, which is the national scheme designed for use within the Dutch national cataloguing system. This scheme is available in English and (an adapted) German version as well. The English scheme is used on the Web in DutchESS; the German one is used by some German libraries which have adopted the Dutch Pica library cataloguing system.

E X A M P L E

Using a national monolingual scheme:

Link Larder - is a Swedish catalogue for quality assessed Internet resources (especially aimed at children). It uses the Swedish SAB for all subjects. The scheme is widely used in public and school libraries

Using an international multilingual scheme:

GERHARD - the German academic Web index classifies all documents using the UDC classification from ETH Zürich in three languages

Making your choice: issues to consider

Your decision about the classification scheme you are going to use should also entail exploring the following important issues:

1. The scope and coverage of your service, and its primary target audience

The scope of the service, its subject, language and geographic coverage, and its target user population should be the most important consideration in the choice of classification scheme. If the service includes all subjects and is aimed at a wide audience of Internet users, a universal classification scheme would be a good choice. If, however, the collection focuses on a limited subject area and there is a suitable international subject-specific scheme available, this should be used; if your service is a national service, you may want to consider a national general scheme. If no comprehensive scheme covering the geographic area or subject is available, a classification structure will have to be created especially for the service, either from scratch or (preferably) by extending an existing scheme.

2. Maintenance issues

The decision concerning which scheme to adopt may also be affected by the level of familiarity that your staff have with a specific scheme, as well as by the maintenance level provided by the owner of the classification system. If the staff are not familiar with the chosen scheme, this could slow down the growth of the gateway in the initial period.

3. Quality, status and availability of the scheme

Questions to be asked regarding this issue are:

how do the considered systems compare in quality and controlled revision?
is the scheme you want to use available in machine-readable form?
is it available in the language you wish to use?
is the scheme you want to use freely available for use on the Internet or do you need to acquire a licence?

4. Interoperability issues

The important consideration here is whether there are any mappings available between the candidate schemes and other established subject-specific or universal schemes which can secure interoperability to other services, now or in the future.

Interoperability

5. Costs

How do the costs of the different schemes and methods compare? This includes costs for information specialists, technicians and (if necessary) translators as well as for servers and software being used.

The initialization of a service will require more investment, because all the issues discussed here need to be investigated, and the system chosen will have to be set up. When the service is up and running the costs will be lower.

Amending and mapping classification schemes

Implementing classification schemes may present you with a number of issues. You may wish to adapt, restrict or extend the scheme you have chosen. There are also a number of very good reasons why you may want to map between multiple schemes. This section briefly summarises these issues.

Adapting a classification scheme

For classification schemes to be effective as browsing aids in subject gateways, they need in some cases to be reduced in complexity and/or reordered.

A detailed table of the changes made should be kept, so that the locally used variant can be adapted easily whenever the original scheme is updated. For instance, when the hierarchy is rearranged, a mapping to the equivalent placings in the original scheme should be kept.

There are several ways in which classification schemes can be adapted:

1. Omitting empty classes

A very unequal distribution of resources throughout a classification scheme can be confusing for the user and frustrate the browsing process. Omitting empty classes may be necessary in order to create a user-friendly browsing structure. If there are only a few empty classes or branches, the best policy is to mark the classes as empty in your browsing structure and navigation area (as done in EELS). The system will still appear as a coherent and logical whole. If there are many empty areas, the display could hide the empty classes. Our advice, however, is to classify the individual resources in as much detail as possible in the chosen system, but to display them for the time being in the broader/parent category. This allows for a fully expanded display as soon as there are enough resources for a meaningful finer substructure, without requiring any reclassification efforts. In any case, all resources should be displayed in order to keep consistency between browsing and searching the service.

2. Rearranging hierarchies

It may be necessary to rearrange the hierarchy to make the browsing structure easier to use. Sometimes the hierarchy needs a more logical arrangement to help users to find their way through it. Sometimes an important 'branch' deep down in the tree structure needs to be lifted closer to the top of the hierarchy so that it can be found more easily. In the end, if there is a potential conflict between the purpose of the gateway and the purpose of the classification scheme, it is the classification scheme which needs to be rearranged. If you are planning to include cross-browsing facilities in your gateway, rearranging hierarchies should be avoided as it complicates interoperability with other systems.

3. Renaming captions

Renaming captions is another way of adapting a classification scheme. A classification scheme may use complicated technical terms which would be difficult for the target audience to understand in a gateway designed for schoolchildren. In these cases, renaming adds value and user-friendliness to the service (cf. DDC for children and DDC for end-users). The renaming should be done in a similar way throughout the service in order to keep the service consistent and the language level the same.

Extending a classification scheme

Sometimes an existing classification scheme is not detailed enough in particular areas or omits subject categories closely related to the gateway's coverage. If these are important areas for the gateway, then the classification scheme needs to be extended.

There are several different ways of extending a scheme:

add a topical substructure to certain classes, without changing the existing classes; besides your own creations, bits and pieces from established more specific systems could be used
add facets to the classification which allow subdivision of classes, e.g. a geographical or historical facet or a facet for document types or languages; the facets should preferably be taken from established systems
'glue' (parts of) an established system as a new branch on to your scheme to extend its topical coverage.

Again, document your extensions carefully so that you can identify these parts of your service and exclude them when carrying out operations based on your original scheme, such as adding resources from another service or cross-browsing. Remember, any mappings also need to be changed when changing your local scheme.

Consider that you have to maintain all the changes throughout the lifetime of your service. The extensions could be very useful and necessary for the service, but remember that they always involve extra costs, for instance in the form of extra work when adding resources to the service.

Conversion and mapping between classification schemes

Mapping between different classification systems will become an increasingly important activity for subject services, in order to perform the following tasks (among others):

Conversion between different systems to incorporate records into a local structure or exchange of metadata, including automatically converting from existing classifications of documents (such as OPAC records, database records, documents in Internet services) into another scheme used in a subject gateway. An example is the mapping between DDC and UDC within the subject domains of economics and business for the SOSIG and Biz/ed projects (Hiom, 1998).
Support for the translation of categories and terms into other languages, to represent the different coverage of terms in different languages and to make up for the occasional lack of equivalent terms. A combination of translation and mapping may be the best way to accomplish multilingual vocabulary access and support. The EU Language engineering projects Acquarelle and Term-IT have been working in this area.
Extension of the classification structure by 'gluing' different systems into each other. This will be tested by the DESIRE II project together with OCLC. In 1995, a study was published exploring a mapping between DDC and the Mathematical Subject Classification MSC. (Iyer and Giguere, 1995)
Provision of cross-browsing between different services (which keep their classification systems unchanged).
Securing wide and future-proof interoperability options with different and maybe as yet unknown services.

Multilingual issues, Co-operation between gateways

Producing such a mapping is often difficult and time-consuming because of theoretical, conceptual, cultural and practical differences between the systems. Mappings have to apply many different types of equivalence; one-to-one relationships are certainly not sufficient. The mapping can be carried out between two or more systems or as a mapping to a universal system like DDC as a 'switching system' or 'interlingua'. The latter alternative is needed when trying to secure wide interoperability or when there is a small overlap between the used classifications.

If there are no 'official' conversion tables available, an improvement in the task of classification could still be made by extracting, from existing databases, linkages between different classification schemes or between indexing terms and classification for the same object, and using these linkages to construct a conversion algorithm.

In this field, neither theory nor practice is very mature. We recommend that you should seek advice and assistance from experts in the area.

avoid as far as possible inventing anything on your own. This will help to ensure sustainability
guarantee that there is a mapping of the scheme you are using to at least one important established classification system, whether international, subject-specific or universal. This makes your browsing structure interoperable and future-proof

Keywords and thesauri

Why use keywords?

In addition to the use of classification within an information gateway, information retrieval can be enhanced through the insertion of terms, or keywords, in a keyword field within each record. Such a practice has been common in the library world for many years as a means of aiding users to search abstracting and indexing services and library catalogues.

While classification of the records in an information gateway allows the presentation of groups of related documents in well-defined subject areas, keywords are used to give a detailed description of the concepts covered by the individual document and are mainly used as an aid to searching. The concepts covered by keywords are usually more specific than those of classes within a classification scheme, and consequently several keywords may be needed fully to describe a document. Individual keywords may therefore describe sub-topics within the page or site catalogued, whereas usually only one or two class numbers will be assigned to describe the overall subject content.

As noted elsewhere, keywords are generally applied to records as an aid to searching the catalogue (although they may also occasionally be used as a method of browsing - see the section on thesauri). Depending on the type of keyword system used and the policy adopted by the gateway in applying it, the added terms should improve the accessibility of individual records. They may also aid searchers by providing a feel for the philosophy and likely coverage of the gateway. An important function is to suggest to users new or more focussed terms with which they can search.

Controlled versus uncontrolled

It is strongly recommended that some sort of keyword system be used when cataloguing sites for an information gateway, but it is important to decide whether or not to use a controlled vocabulary as the source of the keywords used.

A policy involving the use of uncontrolled vocabularies would consist of inserting into a keyword field terms relating to the subject content of the page or site which may or may not be contained within the title of the document or included in any description that may have been applied to it. The keywords used will usually be suggested by an inspection of the site being catalogued or from the cataloguer's knowledge of the subject area. If the keyword field is included in your search, then such keywords should improve the recall.

The drawback with the use of uncontrolled keywords is that there are no standard, agreed terms for particular topics. This can cause problems not only with different spellings but with the use of different synonyms or near-synonyms to represent the same topic. Thus a search for the term 'labour relations' will not pick up records indexed with the term 'industrial relations'. Recall can be further improved by the correct and comprehensive application of a controlled vocabulary of standardised keywords.

As with classification systems, controlled vocabularies may be general in nature, such as the Library of Congress Subject headings (LCSH), or else be devised for one particular subject domain, such as the MESH vocabulary devised for the field of medicine. Since the majority of controlled vocabularies have been created for use with journal abstracting services, a suitable subject-specific system can usually be found by studying the major services in your subject area. Permission from the authors of the vocabulary should of course be obtained before using it within your gateway.

A problem with the use of controlled vocabularies is the constantly evolving nature of human knowledge resulting in the continual development of new terminology. As with classification schemes, major vocabularies periodically appear in new editions incorporating new terms, but it may happen quite frequently that a term cannot be found to describe the required content. There may also be problems with the degree of specificity of the scheme; that is, a term which is sufficiently specific may not be found.

The above problems can be alleviated by adding uncontrolled terms to records where a suitable controlled term cannot be found.

A consequence of using a controlled vocabulary is the need to make users aware of the vocabulary so that they are able to search on the allowed or preferred terms. This adds an extra complication to the gateway's interface, since the user will need to be able to search a version of the vocabulary for a suitable term if they are to make fullest use of controlled vocabulary indexing.

If the user is expected to search a copy of the vocabulary to select terms for a search, it is best to maintain a local copy of it which features only those terms which are present in your catalogue. This is particularly the case when the vocabulary is a large one and many terms within it would result in 'no hits'.

Indexing policy

The search system your service uses and the search options you make available to the end users will, of course, have a critical effect on the users' experience of the service. However, as mentioned previously, the indexing policy of the gateway and how the ketwords are added will also have a significant effect. As well as deciding whether to supplement terms from a controlled vocabulary with uncontrolled terms, an indexing policy should stipulate to what degree of specificity documents are to be indexed. The main issue here is that in cases where only keywords representing the main topics of the document are applied, the precision of a search can be increased if the search system has a mechanism for restricting searches to the keyword field.

It is generally recommended that you include all relevant keywords, including those occurring in the document's title and description, in the keywords field. However, if you decide not to restrict searches to a keyword field, you should be aware of the potential problems this might cause. Search results are sometimes displayed using ranking mechanisms which look at the number of times a searched-for keyword occurs in each record found and use this to order the results. Repeating terms already used within the description, for instance, may skew this process.

Thesauri - hierarchical controlled vocabularies

Controlled vocabularies may consist of large numbers of terms; they are also likely to comprise terms which are related to each other in various ways, particularly in broader/narrower relationships. Most of the major controlled vocabularies consequently have their terms arranged into hierarchies very similar to those of classification schemes.

The most common relationships between terms are:

broader term (parent)
narrower term (child)
top term (the top of a branch of the hierarchy)
related term (related but not broader or narrower)

The HASSET thesaurus produced by the Data Archive at the University of Essex, as used in the Social Science Information Gateway (http://www.sosig.ac.uk/roads/cgi/thesaurus.pl)

A hierarchical vocabulary or thesaurus makes it much easier both for the indexer to add relevant terms to the record and for the catalogue user to search on them. In principle, the user can begin at a top level term and browse down through the thesaurus until they come to a term closest to the topic in which they are interested . Some method for searching the thesaurus by keyword will also be available. In practice, a combination of searching the thesaurus and then browsing a small part will often give the user the best results.

The hierarchical structure is also useful in providing an overview of the structure of the subject domain (in a subject-specific system) for users who are unfamiliar with it, as with the browse structure derived from a classification scheme. It may also be possible to use a thesaurus in place of a classification scheme for browsing a catalogue, but the structure may not be as suitable for browsing as that of a classification scheme built for the purpose.

The figure above shows the medical gateway OMNI (http://www.omni.ac.uk/search/thesaurus/), which uses the MESH subject-headings to index its records. Selecting a particular term within the thesaurus produces a display of all records which contain this term.

Multilinguality

You may wish to create your own multilingual database which will allow users to perform searches within the catalogue, even though the original language of the record is unknown to them. Another approach would be to allow several separate databases in different languages to use the same thesaurus. As with classification schemes, it is possible that terms within a thesaurus can be represented by a unique identifier. If such a notation is used within catalogue records as well as or in place of the terms themselves, the display of keywords in records (or within the thesaurus) can be done in any number of different languages. However, any multilingual approach will require a great deal of time and effort - which is one reason why there are very few such multilingual services available.

Multi-lingual issues

Staff issues

Subject classification and indexing are activities that in the library environment have been carried out by various trained professionals: subject specialists, cataloguers, information specialists or maintainers of (specialist) bibliographic databases. The quality of any browsing structure depends on the accuracy of the classification. The correct assignment of classification codes, keywords or thesaurus terms requires knowledge of the subject area as well as of the keyword system or classification scheme that is used. The process of assigning terms can be time-consuming.

Once you have decided that you want to add keywords and/or classification codes to the resource descriptions in your gateway, you will have to decide who among your staff has the necessary skills. This should be considered in relation to the question of who is going to be responsible for selection and/or cataloguing of the resources. One possibility is to let the same people select, index and catalogue the resources, which may be efficient; another option is to let people with different backgrounds and skills do the various tasks, which may make better use of the individual skills of various professionals.

A few possibilities:

Subject specialists, who select the resources, will usually have the required skills and/or experience with keywords and classification schemes, at least in their own subject areas.
Skilled (formal) cataloguers in some organisations will also be responsible for subject indexing. In other organisations their work will be restricted to the formal aspects of cataloguing, while index terms and so on will be added by a subject specialist. Whether cataloguers will be able to catalogue Internet resources, including subject indexing and classification, will depend on the situation in the organisation providing the service.
Trained librarians and information specialists, with various specific tasks within an organisation, will often have some skills in this area.
Another option is automatic assignment of classification codes or index terms. At the moment it is not possible to get the same high-quality results with automatic classification, without any intellectual human involvement.

Browsing and searching

The methods for classification and subject indexing discussed so far should be evaluated in terms of their use in enhancing search and browse facilities in your gateway.

Browsing

Most services offer some kind of browsing facility. This may be an established classification scheme, a home-grown scheme, or some controlled vocabulary. This structure is typically presented to the user as a hierarchy starting from a list of terms, narrowing down till the user arrives at a list of resources. A list of resources may also be presented at each stage of the hierarchy.

Probably the best way to create a browsing structure is to use a classification scheme. Apart from providing a basis for the browsing structure, the numerical codes as well as the terms in whatever language they are available may be used for searching purposes as well. Numerical codes used for classification need not be displayed on the browsing pages. As noted previously, thesauri with explicit and complete hierarchical structures are also suitable for this purpose.

Harvesting, indexing and automated metadata collection

Searching

Many services offer 'advanced' search options, where searches on formal attributes (author, title) can be combined with terms specifying the subject of the resource. The latter may be uncontrolled keywords or terms taken from thesauri, subject headings, authority files and other vocabularies. Searching free-text descriptions may also provide an additional way of finding resources, either in combination with controlled keywords and/or classification codes, or in searches restricted to this field.

Classification schemes, although mostly used to provide a browsing structure, may also be used to enhance searching. These search options can be integrated in various ways in the user interface of your service. Sections of the classification scheme can be offered as a filter on the search, limiting the results of the query to a certain subject category of the database. The best way to do this is probably to offer a list of all alternative sections/classifications for selection, allowing the user to choose either one or several sections. An expert alternative would be to offer the classification field for direct searching with a truncation option, if the notation is made visible. On the browsing pages a search option could be offered limiting the search to the currently viewed class and the subclasses below. EELS and Yahoo! are examples of this approach.

Harvesting the documents in your service (and/or in your subject area in general) and providing a full-text index are other ways of expanding the services offered by your gateway. The user could choose to search either the record descriptions and/or the full text database. The latter would of course increase recall (even dramatically), but reduce precision. One example of cross searching a catalogue with a harvested index can be seen at http://eels.lub.lu.se/aeels/search.html

Cross-browsing and cross-searching

Some subject areas are currently covered by more than one gateway; for example, engineering is covered by both EELS, EEVL and AVEL. This can be confusing for the users, who will have to have extensive knowledge about all existing gateways, to be able to decide which one(s) are most likely to answer their question. It is possible that one gateway may be more suitable for one subtype of resources than another, but users will have to compare various gateways, to get to know their strong and weak points, their exact coverage, biases and so on. The same problems arise for people interested in inter-disciplinary resource discovery. A possible way out of this dilemma from the service's point of view is to opt for more co-operation with other services in the same subject area. One way to co-operate is to enable the cross-searching and/or cross-browsing of gateways.

Interoperability, User interface implementation

Cross-browsing two or more gateways is potentially a useful way of combining logically separate or distributed services, but it is difficult to achieve in practice. The gateways have to use identical classification schemes and the classification codes must be the same, so that a combined service can be generated, enabling a user to browse everything within the same virtual space; if identical schemes are not used, this becomes extremely difficult, if not impossible. Furthermore, classification is often a subjective activity and this would affect how combined subject gateways could be browsed. Nevertheless, cross-browsing through visible links between the browse sections of two or more gateways, without hiding their independence, can be accomplished by mapping methods as described previously; DESIRE II is currently testing different methods.

Cross-searching is relatively easy to provide in a networked environment, especially where the same search and retrieval protocols are in use. The resource description format has to be similar, though, and fielded search requires in addition semantic equivalence between the content of the fields in all services. Cross-searching has been tested by the ROADS project and can already be implemented in gateways based on the ROADS software (Kirremuir et al., 1998).

Cross-searching of information gateways poses a problem for the use of controlled vocabularies. As with cross-browsing using classification schemes, cross-searching only becomes possible if either the different catalogues use the same controlled vocabu lary or if a mapping has been made between two or more different schemes. The latter possibility poses the same problems as are found when cross-mapping classification schemes and clearly it would be easiest if agreement could be reached on the best vocab ularies to use within particular subject areas.

Cross-searching and cross-browsing are more extensively covered in the Interoperability chapter. The User Interface Implementation chapter will tell you more about how to present browse and search facilities in your user interface.

Future developments - automated solutions

Automatic classification

As traditional classification is a time-consuming and expensive process, it is obvious that investigations into the use of automated solutions are worthwhile. At the same time, classification is an activity where a significant level of human expertise, abstract thinking and understanding is needed and this is not easy to replace by artificial intelligence or expert systems. There are no known examples of traditional library classification being undertaken completely by computer software. Knowledge structuring on the Internet has to cope with far larger numbers of resources, exponential growth rates and a high risk of changes occurring in documents which already exist.

This is the background to a growing number of research projects and experimental systems which are trying to support knowledge-structuring activities on the Internet with automatic methods. Most of these projects use methods of derived indexing, i.e. they extract information from the documents and then use it for structuring tasks.

Automated classification will probably not replace intellectual classification as far as quality subject services are concerned, but will rather support and complement selection and subject indexing efforts. Intellectual classification is always needed to validate and improve the automatic methods. However, robot-generated databases, as an add-on to quality services in a subject area, will be automatically classified. One practical goal in DESIRE II is to explore simple applications of automated classification methods on a robot-generated subject index to the Web. Many different tests will be carried out on the 'All' Engineering (AE) robot-generated database of engineering documents from the Internet. The effort required will be studied and the resulting outcomes evaluated. A pilot service of the 'All' Engineering Web index will offer a full classification and browsing structure with the most suitable solution found during the project. In addition, a comprehensive state-of-the-art report on projects, methods, alternatives and problems concerning automatic classification will also be presented. The results of DESIRE II will be included in the next edition of this handbook.

Clustering

Clustering is a method which, like classification, aims to bring together groups of closely related documents. However, clustering is an automatic process, which groups documents according to specific criteria expressed in an algorithm. The groups are normally not (hierarchically) related to each other and are of very different sizes. The subject covered by a cluster is very hard to describe. Every time that new documents are added to the collection the clusters have to be calculated again and the outcome can be different. Documents can frequently move to other clusters. Clustering methods (which is a form of derived, a posteriori classification) should rather be compared with methods of automatic classification using established (a priori) classification systems used to assign classification to documents. Clustering is not suitable for presenting a stable structure for browsing large gateways in which documents need to be grouped into clearly defined and related subject sections; indeed, it is not meant to be used for that purpose.

Further information

A more detailed analysis of the use of classification schemes in Internet resource description and discovery and a list of services using them can be found in the DESIRE I report produced by Koch and Day (Koch and Day, 1997). This report describes the use of several classification schemes on the Internet in some detail and provides an introduction to the use of automated classification techniques on the Internet.

Another useful Web page which lists some Internet-based services that use classification schemes for organising resource discovery services is Gerry McKiernan's Beyond Bookmarks page (McKiernan, 1996 and ongoing).

Glossary

Assigned indexing Manual addition of meaningful terms to the records in a gateway to facilitate searching, usually taken from a pre-existing controlled vocabulary (see also derived indexing)
BC Nederlandse Basisclassificatiel (Dutch Basic Classification, a Dutch national classification scheme used in the Pica Shared Cataloguing System.
Browsing Information retrieval by navigating through a set of Web pages containing lists of resources grouped by subject
Cross-browsing Browsing, where the Web pages contain resources from more than one gateway
Cross-searching Searching, where the search takes place across more than one gateway
DDC Dewey Decimal Classification
Derived indexing Automatically extracting a list of terms from the documents in a collection to facilitate searching (see also assigned indexing)
EELS Engineering Electronic Library, Sweden
Ei Engineering Information
Free-text searching Searching using uncontrolled vocabulary, such as that found in titles, abstracts, or full text.
LCC Library of Congress Classification
LCSH Library of Congress Subject Headings
MeSH Medical Subject Headings
NLM National Library of Medicine
OPAC Online Public Access Catalogue
Precision The number of relevant documents retrieved divided by the total number of documents retrieved.
Recall The number of relevant documents retrieved divided by the total number of relevant documents in the collection.
SAB Sveriges Allmänna Biblioteksförening
Searching Information retrieval by entering one or more keywords into a search engine
Thesaurus A device for vocabulary control, usually for a specific subject area, indicating preferred terms, non-preferred terms, and semantic relations between terms; the terms are in ordinary human language.
UDC Universal Decimal Classification

References

Biz/ed, http://www.bized.ac.uk/

DESIRE, http://www.desire.org/

EELS, http://eels.lub.lu.se/

OMNI, http://www.omni.ac.uk/

T. Koch, Controlled vocabularies, thesauri and classification systems available in the WWW. DC Subject,
http://www.ub2.lu.se/metadata/subject-help.html

D. Hiom, Mapping classification schemes (Bristol: SOSIG, 1998)
http://www.sosig.ac.uk/desire/class/mapping.html

E. Miller, P. Miller & D.Brickley, Guidance on expressing the Dublin Core within the Resource Description Framework (RDF), 1999
http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/WD-dc-rdf-19990427.html

J. Kirriemuir, D. Brickley, S. Welsh, J. Knight & M. Hamilton, 'Cross-Searching Subject Gateways - The Query Routing and Forward Knowledge Approach', D-Lib Magazine (January 1998).
http://www.dlib.org/dlib/january98/01kirriemuir.html

T. Koch & M. Day, The role of classification schemes in Internet resource description and discovery (DESIRE project: UKOLN, Bath, 1997).
http://www.ukoln.ac.uk/metadata/desire/classification/

T. Koch, 'Nutzung von Klassifikationssystemen zur verbesserten Beschreibung, Organisation und Suche von Internet Ressourcen', Buch und Bibliothek 50:5 (1998), 326-335.
http://www.ub2.lu.se/tk/publ/bubmanus.html

T. Koch, A. Ardö & L. Noodén, 'The construction of a robot-generated subject index', EU Project DESIRE II D3.6a, Working Paper 1, 1999.
http://www.lub.lu.se/desire/DESIRE36a-WP1.html

T. Koch & D. Vizine-Goetz, 'Automatic Classification and Content Navigation Support for Web Services. DESIRE II co-operates with OCLC' in Annual Review of OCLC Research 1998 (1999).
http://www.oclc.org/oclc/research/publications/review98/koch_vizine-goetz/automatic.htm

T. Koch, Controlled vocabularies, thesauri and classification systems available in the WWW (ongoing).
http://www.ub2.lu.se/metadata/subject-help.html

I. C. McIlwaine, Guide to the use of UDC: an introductory guide to the use and application of the Universal Decimal Classification, rev. ed. (The Hague: International Federation for Information and Documentation (FID), 1995).

G. McKiernan, Beyond bookmarks: schemes for organising the Web (Iowa State University, 1996 and ongoing).
http://www.iastate.edu/~CYBERSTACKS/CTW.htm

Credits

Chapter authors: Phil Cross, Michael Day, Traugott Koch, Marianne Peereboom, Ann-Sofie Zettergren

2.6. Collection management

In this chapter...
	the importance of keeping collections up to date methods for maintaining collections what do those error codes really mean? a link checking case study: SOSIG creating a collection management policy priorities for administrators

Introduction

This chapter will look at some of the day-to-day administrative tasks required for running and maintaining an information gateway and the staff effort required for these tasks.

Whilst setting up and configuring a database for a gateway is labour intensive, it is a one-off task. The longer-term and time-consuming work is involved in creating and maintaining the collection: notably, in keeping the records up to date and error free. An out-of-date collection of resource descriptions is little use to anyone and may even be potentially harmful to users. It is important that sufficient staff effort is allocated for regular housekeeping duties, the main ones being:

checking that resources are still available and links within records are still correct
making sure that descriptions of resources are up to date and still adequately reflect the content of the resources themselves

The Internet is a volatile and fast changing environment; resources and information that are available today may not be available tomorrow. It has been estimated that at any one time between 5 and 8% of the Web's content is unavailable (Pitkow, 1998). There may be a number of reasons for resources not being available, ranging from networks being out of action, servers being out of order, or information being updated, to the resource's being removed permanently from the network. Whatever the reason, resources that are not available should be removed from your collection (if only on a temporary basis while the problem is solved).

Similarly, Internet resources do not tend to be static; they grow and change on a regular basis. Unless resource descriptions are checked on a routine basis, you may find that the records bear no resemblance to the resource itself, which may have changed or expanded beyond recognition within a few months or weeks.

Maintaining collections

There are various tasks involved in making sure that an information gateway's collection maintains its integrity:

validating records (spell checking, etc.) to ensure that the record is accurate
link checking records to ensure that resources are still physically available
updating resource descriptions to ensure that the record still adequately reflects content of the resource or Web site

Validating records

A basic housekeeping duty is to ensure that catalogue records are as accurate as possible, not only in terms of the factual information they provide about a resource, but also in terms of the content of the record itself, e.g. making sure they do not contain spelling mistakes, that cataloguing guidelines are adhered to, etc. There are various internal procedures which can help gateways maintain accuracy within their records. These include:

Spell checking records. This can be done manually; some gateways employ staff to check and edit records before they are added to a live database. A less time-consuming way would be to use an automatic spell checker; however, there can be problems with spell checkers understanding discipline-specific or technical terms.
Cutting and pasting URLs and other pieces of factual information to avoid the possibility of typing errors.
Authority files. The use of lists of controlled terms and vocabulary can help enormously to cut down spelling mistakes and ensure consistency within the records.

For further information on ensuring accuracy and consistency within the collection see the chapter on cataloguing.

Link checking

Much of the information available over the Web is intentionally ephemeral in nature, designed only to be useful in the short term (e.g. TV listings, news bulletins, price lists). The average life span of a Web document is estimated at around 50 days, with HTML files being modified or deleted more frequently than images or other media (Pitkow, 1998). Gateways generally try to ensure that the resources they catalogue will have a degree of longevity and often include URL stability as one of their selection criteria. However, the inconstant nature of the Web means that it is still necessary to check resources regularly and update the records of those that have moved, are temporarily unavailable, or have been permanently deleted from the Internet. It is important to have collected contact information about the administrators or maintainers of the sites on which the resources reside. When a resource is unavailable, sending an email message to the administrator is often the quickest way to find out what the problem really is and whether or not it is temporary or permanent.

Automatic link checking software is available to help gateways keep a check on the resources described within their catalogues. The programs generally work by checking each of the URLs (often by requesting the 'HEAD' files of the pages) and compiling a report of any errors they find. The software can normally be scheduled to run at regular intervals (ideally at least once a week) and can be set to run at 'quiet' times, e.g. overnight, to reduce the load on the network. Once the error report has been generated, it usually then requires human effort to go through the report and decide which of the resources should be edited or removed from the catalogue. Working through an error report is much like detective work; you need to use patience, information finding skills and knowledge of the Internet to track down the problems and put them right.

As well as commercial software packages there are a number of link checking programs available in the public domain (freely available) or as shareware packages (for a small fee).

For a listing of some link checking shareware programs available see:

Link Checker Tools

What do those error codes really mean?

You will sometimes see error codes when you are attempting to connect to Web pages or looking at the output of link checking reports. These are HTTP status codes and whilst they appear to be frustratingly cryptic they can tell you a lot about the type of problem that you are encountering.

404 - Page Not Found

This is the most common error code that gateway administrators will come across. Web site maintainers often change the structure of their sites, as the information they provide grows or as the maintainers get new ideas about how to arrange and present the information. One of the most common reasons for a 404 error is simply that the resource has been moved to a different part of the site. To find the new location you can often systematically move up the directory structure of the URL deleting the text before each trailing slash (/) until you find a link to the resource. Sometimes the resource may have moved to another Web site altogether (this often happens when the resource is located on a commercial site); it is worth doing a search on one of the big search engines (such as Alta Vista) to try to locate its new address. In the worst case, the resource has been deleted permanently and the record should be removed from the collection. If you cannot locate the resource simply by looking around the site, an email message to the administrator will often solve the mystery.

Some of the other frequent error codes are:

Error Code	Problem	Possible Reason and Action
401	Unauthorised Request Access	The resource may be protected by a username and password - contact the maintainers for more information.
402	Payment Required	The request requires a charge to be applied to the transaction.
403	Forbidden	Access to the directory is forbidden. The resource may no longer be available for public access or the Web site administrator may have changed the directory permissions by mistake!
500	Internal Error	These types of error messages are very frustrating, as it is often hard to pin down what the problem is. It may be a problem caused by attempted execution of a CGI script. The best course of action is to monitor it as a problem and email the maintainer of the site for more information about the nature of the problem and to find out whether it is temporary.
501	Not Implemented	The server does not support the method being requested.
503	Server Busy	The server is unable to process the request for the page because of the high number of other requests. These tend to be temporary errors; try again at another time.

A link checking case study: SOSIG

SOSIG uses the link checking software that is supplied as part of the ROADS system. The program is scheduled to run automatically just after midnight on Sunday when the network traffic is generally low. The program runs through each of the URLs in the SOSIG database (over 7,000) and for each it requests the HEAD file from the page. If the request is successful the software moves on to the next URL; if it encounters a problem it writes the URL and the unique ID number for the record into a file. Once the link checker has processed all of the URLs, the problem resources are sorted and presented according to the error codes discussed in the section above. The error report is made available through the SOSIG online administration centre (see Figure 1); additionally a copy is emailed to the SOSIG staff responsible for processing the report.

Figure 1 SOSIG Link Checking Summary Report

SOSIG currently has one member of staff assigned to link checking, who spends approximately one day a week going through the report and updating or deleting records as appropriate. As the number of records in the collection grows, so does the number of problem resources, and it is likely that the amount of time required to maintain the collection will increase over time.

The errors reported are given an order of priority and the '404 Page Not Found' problems are dealt with first of all. These are probably the most straightforward of the errors; either the resource has moved and the record has to be edited to have the new address or it is no longer available and it needs to be deleted from the database. Either way, having error pages appear when users try to connect to resources is likely to reduce their confidence in the collection.

The next errors dealt with would be any errors to do with authorisation (error 401), payment (error 402) or permissions (error 403). These errors are not as common as the 404 errors and they tend to appear when a resource that had previously been publicly available is now restricted to use within an organisation or community and some form of payment or authorisation is required. These problems may become more common as the Web matures and commercial practices become more established. Occasionally the problem is simply that the Web site administrator has inadvertently changed the permissions on the directory and is unaware that there is a problem. SOSIG has found that the best way to deal with these problems is to get in touch with the maintainers of the resource by email and ask what the situation is; generally replies return within a day and the record can be dealt with appropriately.

The final errors that are dealt with are the 500 errors, generated by the server from which you are requesting the resource. They tend to be more unpredictable and it is usually quite difficult to pinpoint the problem; often URLs listed as giving 500 errors are working perfectly well when checked again. The reason for this may be because that the server was undergoing maintenance or updating when the link checker requested the URL. SOSIG tends to monitor 500 errors over a few weeks and an email message will be sent to the maintainers of those resources that persistently record an error. The ROADS link checking software does have a feature which allows you to automatically delete URLs that are consistently unavailable, but this is not used as it is felt that the 500 errors are too unpredictable and staff prefer to make a judgement on each resource.

For more details of the link checking software and the ROADS software in general see:

ROADS Project Software/Documentation Server

Updating resource descriptions

The dynamic nature of the Web is a problem when it comes keeping manually catalogued records of resources up to date and relevant. Web documents, unlike their printed equivalents, are very easy to edit and modify; studies have shown that most Web pages are not static but expand and evolve over time. For a gateway's collection to maintain its integrity and usefulness, the records must also reflect the changes in the resources. This is a time-consuming job that requires ongoing staff effort to be assigned to the task.

There are a number of steps which gateways can take to help to identify and review resources that need their descriptions to be updated:

Making full use of administrative metadata such as review-by dates. When records are created, a date can be added by which this record should be reviewed. A simple script can pull out all of the records that require reviewing at any particular time.
Using automated processes to email resource maintainers to ask whether there have been any changes to the resource since the record was created.
Using automated processes to delete time-dependent resources, e.g. conference announcements.
Using Web page tracking tools (such as Mind-it http://mindit.netmind.com/) to monitor changes in resources (these generally report changes when the size of the file is altered).
Taking the opportunity to update descriptions of records that are being edited as a result of running a link checker.

Creating a collection management policy

The Web has often been described as a 'moving target'; it is constantly changing and expanding and trying to catalogue its content is a difficult business. Gateways need to think about what they are trying to provide for their users: a catalogue of the entire Web or a focused collection of selected material? A previous chapter on quality selection criteria has dealt with the need for gateways to consider formalising a Scope Policy to help clarify the type of service they are offering. It will also be helpful to think about a policy for managing collections. A collection management policy will allow you to formalise not only the scope and selection criteria for a gateway but also deselection criteria, that is the principles under which you may choose to edit or delete records from the collection. A collection management policy might include:

Guidelines for deselecting a resource:

if the resource is no longer available
if the currency or reliability of the resource has lessened
if another Internet site or resource offers more comprehensive coverage

Guidelines for editing a record:

if the information content of the resource has changed so that the resource description and keywords need to be updated
if any of the factual details of the resource have changed (e.g. new admin email, new short title)
to correct any errors made in the original record

Collection management policies may change over time to reflect the changing nature and content of the Web. As more resources become available it may be necessary to delete entries from the collection, replacing them with more suitable material.

For examples of gateway collection management policies see:

Priorities for administrators

When one is faced with limited time and resources, there will always be a conflict between building up the gateway collection and adequately maintaining the existing collection. In order to continue to offer useful services, gateway administrators need to ensure that they balance effort spent in creating new records with preserving the integrity of the current collection. It is advised that gateways make as much use as possible of automated tools to monitor and track changes in resources, so that any human effort is directed at the more intellectual tasks of revising and correcting records.

Glossary

ADAM Art, Design, Architecture and Media gateway (UK)
authority file cataloguing tool that offers the cataloguer a set list of options from which they must choose to fill a particular field - ensures consistency of entry within catalogue fields
ROADS Resource Organisation And Discovery in Subject based services. eLib funded project developing software for use by Internet subject services.

References

Mind-it by NetMind, http://mindit.netmind.com/

ROADS, http://www.roads.lut.ac.uk/

W. Koehler, 'Digital Libraries and World Wide Web Sites and Page Persistence', Information Research Volume 4 No. 4 (June 1999).

J. E. Pitkow, 'Summary of WWW Characterizations', in Proceedings of the Seventh International World Wide Web Conference, 14-18 April 1998, Brisbane, Australia (Elsevier Science B.V., 1998).

Credits
	Chapter author: Debra Hiom With contributions from: Phil Cross and Emma Place

2.7. Working with information providers

In this chapter...
	identifying the key information providers for your gateway building and maintaining relationships with information providers involving information providers in the metadata creation process

Introduction

One of the most time-consuming, and therefore costly, tasks for information gateways is maintaining up-to-date descriptions of relevant resources. Identifying and describing quality resources is critical for the gateway. One possible means of making this process more efficient is to involve the 'information providers' (otherwise described as 'publishers' or 'resource owners') in the metadata creation process and to encourage them to contribute to the content of the gateway. This benefits the gateway in terms of saving costs and at the same time helps ensure the currency of the information held by the gateway. The benefit to the information provider lies in improved dissemination of their information. This is an alternative approach to the creation of resource descriptions 'by hand', where metadata is created centrally by the information gateway's own staff, or by library staff who are working within other institutions, or by subject experts.

These various methods are in use to a greater or lesser extent in existing gateways. In the UK, for example, the Resource Discovery Network gateways have most of their metadata created by gateway staff or subject experts, but services such as the Arts and Humanities Data Service rely to a much greater extent on resource creators inputting data to the gateway.

In the case of those gateways where metadata is created automatically by harvesting or crawling the web, it is also possible to involve information providers; this may be by agreeing procedures for identifying relevant material automatically, or by the information provider's alerting the gateway to new or updated data.

In this chapter we will look at some of the issues which arise when gateways and information providers work more closely together. We will consider the benefits of this approach but also note any disadvantages.

Identifying information providers

Whatever method of metadata creation is followed, a primary task for any gateway is to identify the key information providers in its field. These key providers may be individuals, groups or institutions who are creating or have some level of ownership of high quality resources. In the case of Higher Education funded gateways, the key information providers may be individual researchers, university departments, publishers, scholarly societies or commercial organisations working in the relevant subject area.

The key providers may vary considerably as regards:

the volume of relevant resources they produce
the rate at which resources are updated, i.e. volatility of resources
whether they create metadata themselves at source for their own resources

Taking these factors into account, the gateway will need to consider the overall profile of its key information providers in relation to gateway policy for metadata creation. The gateway needs to consider its own policy by asking:

what is the optimum number of records in the gateway? Is there an imperative need to build up the volume of records in the service?
at what level of granularity are resources being described? Can information providers help the gateway to describe resources at a finer level of granularity?
how rich is the metadata in the gateway? If the gateway wishes to produce rich metadata, then contributions from providers may need to be enhanced. Careful consideration needs to be given to the cost of enhancement as compared with creation from scratch.
are there benefits in building relationships with providers over and above the value of the imported metadata? Key providers may be key users whom it is beneficial to have on board.

It will also be useful to look at the wider picture and consider the cost of involving information providers. In order to justify setting up complex systems, the gateway will want to be assured that information providers can contribute a significant quantity of metadata. It may be that, to create economies of scale, gateways will need to co-operate with one other in setting up common methods for importing metadata from information providers. It is also likely that the information providers themselves will be contributing to a range of gateways and they will want a common procedure to cover all gateways. Such procedures would need to be flexible enough to allow for differing practices among information providers while following internationally accepted standards and protocols which can be clearly defined.

Building relationships with information providers

Having identified key providers and decided that they can contribute to the content of the gateway, the gateway can then build on this information in various ways.

Monitor key information providers

At the simplest level the gateway can ensure that a system is in place to monitor regularly the web sites of key players. This may involve guidelines for staff and varying degrees of automated monitoring. For example, staff may bookmark sites to check regularly or use a URL-minder to notify them of changes made to key sites.

Resource discovery

Enable submission of metadata

The gateway can offer a means for information providers to provide data about new resources. This may be a 'Submit a Resource' form on the gateway Web site.

E X A M P L E

Example of encouraging submission of metadata from information providers

Within DutchESS, resources are selected by subject specialists in the participating libraries on the basis of quality and relevance to the academic community. On the Web site there is a page for 'adding a resource' which asks:

Do you want to contribute a new resource to DutchESS? Use this form to let us know. Your suggestion will be submitted to one of our subject specialist[s]. If the resource is according to the scope policy and quality criteria of DutchESS it will be added to the database.

http://www.konbib.nl/dutchess/index.html

Information providers create the metadata

Gateways can offer metadata guidelines for providers who publish large numbers of relevant resources, so that they can create the metadata required. The metadata can then be automatically transferred to the gateway. Metadata may be manual, using a web based form, or semi-automated, using one of the available metadata creation tools. (CROSS REFERENCE metadata creation chapter)

E X A M P L E

Examples of gateways using metadata created by trusted information providers

A full-text electronic journal, SocRes Online, undertook an experiment with SOSIG, whereby the journal created metadata for each article, which was then automatically imported into SOSIG. Quality guidelines were agreed with the journal. This saved SOSIG staff considerable time, as they did not need to create records for the articles but simply needed to check the records that had been automatically created.

http://www.socresonline.org.uk/socresonline/

Indoreg (Hansen and Hansen, 1997) is a Danish project looking at the bibliographic control of Danish Internet documents and is particularly concerned with the inclusion of Internet documents in the Danish national bibliography. The project concluded that 'self-registration' by authors or publishers would be needed if large amounts of information were to be registered. It recommended the use of Dublin Core for this self-registration and provided tools - a DC creator (based on the Nordic Metadata Project's DC creator) and a PURL server - that would facilitate this.

http://purl.dk/rapport/html.uk/

Endorsement by influential institutions

It can be a condition of a grant that data resulting from funded projects should be deposited with a specified data repository. It might be that gateways could persuade funding agencies to insist that metadata is deposited with the relevant subject gateway.

E X A M P L E

Example of institutionalised metadata creation

It is a stipulation of the UK Arts and Humanities Research Board that funded projects deposit the data produced by the project with one of the service providers of the Arts and Humanities Data Service (AHDS).

This data may be in the form of a dataset or a catalogue record. The Archaeology Data Service, an AHDS service, recommends depositing a catalogue record if the data is dynamic, or if it is non-digital. As well as being a mandatory condition archaeology organisations, depositing data benefits the individual researcher. Benefits are summarised by the Archaeology Data Service under the following headings:

professional recognition
avoiding duplication (of catalogue records in different locations)
building links between data sets
signposting data

Distributed collaborative cataloguing

The future business model for metadata creation may lie with distributed collaborative cataloguing. This would involve an incremental approach to building up metadata for resources. The 'publisher' or 'owner' of the resource might create initial simple metadata, using the Dublin Core element set, for example. Services that wish to offer access to the resource might enhance this basic metadata, for instance with a description targeted at the ultimate users of the service. If the resource meets the criteria for description by the national library and inclusion in a national bibliography, then the national library might augment the records with subject headings and classification codes and align names and headings with the relevant authority files. Other interested parties might create unique identifiers (ISSN, DOI, etc.) or add metadata concerned with rights management or digital preservation. In this model the information provider becomes the first step in a chain of metadata creators.

Accessibility and usability

There are pilot projects investigating shared metadata creation where a 'workspace' is used to create metadata collaboratively. At present, these projects are looking at collaboration between specific partners in the metadata creation process, for example libraries working together or publishers working with national libraries and identification agencies. Within these projects metadata can be enhanced incrementally and imported or exported in a variety of formats.

E X A M P L E

Examples of projects investigating shared metadata creation

Biblink

The BIBLINK demonstrator consists of the 'BIBLINK workspace' - a shared, virtual workspace for the exchange of metadata between publishers, National Bibliographic Agencies (typically national libraries) and other third parties such as the ISSN International Centre. The workspace will allow publishers to 'upload' metadata for electronic publications using email or the Web. National Bibliographic Agencies and third parties will be able to 'download' this metadata, enhance it in various ways and then 'upload' the enhanced metadata back to the workspace. The intention is that national libraries will use the enhanced metadata as the basis of a record in the national bibliography, if appropriate. Finally, publishers will be able to 'download' the enhanced metadata for use in their own systems. The metadata will be stored and exchanged in several syntaxes, including HTML, SGML, UNIMARC and the national MARC formats of the participating libraries.

CORC

CORC (Co-operative Online Resource Catalog) is an OCLC research project exploring the co-operative creation and sharing of metadata by libraries. CORC integrates recent metadata initiatives such as Dublin Core with MARC, enabling a more flexible approach to record creation. CORC emphasises the importance of exporting the records in syntaxes usable on the Web (e.g. HTML, XML/ RDF).

Community building

The gateway can build up a community of information providers. There may well be an overlap between providers and users of the gateway service, so this may be viewed as a marketing strategy. Traditional methods of dissemination (such as publishing, presentations, attending conferences) will form a basis for this activity. Growth of the community can be encouraged by invitational events for key players followed up by mailings and newsletters. A number of the eLib gateways in the UK have progressed from relatively simple catalogues of Internet resources to 'subject communities'. Depending on the business model by which the gateway is funded, membership of such a community of providers may confer benefits of preferential access costs or access credits.

E X A M P L E

Examples of gateways establishing links with information providers and building communities

EEVL, the engineering subject gateway, contains a range of information much wider than a search service; as well as a catalogue of selected 'quality' resources, it offers comprehensive searches of UK Engineering Web Sites, engineering e-journals and engineering newsgroups, and indexes to printed literature. As well as running the comprehensive Web site, EEVL organises training and awareness sessions.

SOSIG puts out calls to the social science community to request information regarding resources that they are publishing on the Internet. SOSIG now has good links with the academic social science community in the UK - as a result academics, government departments, the ESRC and others all send email to let SOSIG know when they put a new resource online. SOSIG has also run its own conference which brought together key information providers and users and established SOSIG at the centre of this community.

Biz/ed has responded to the most common information requests of their users by contacting key companies and organsiations to request information. They have established links with organisations such as the Bank of England, the Office of National Statistics and Penn World Data. The gateway has created primary resources collaboratively with these organisations. Biz/ed has also contacted companies such as McDonalds, BMW and the Body Shop to ask for information to add to the gateway. See also: http://www.bized.ac.uk/virtual/

Benefits and costs

There are a number of potential benefits resulting from information providers' providing metadata:

cost saving
assistance in keeping metadata up to date
accuracy of details

These need to be balanced against:

need to apply quality assurance
effort spent supporting information providers
instituting and maintaining processes for inputting data remotely

Is this right for your gateway?

Some factors that may affect the emphasis the gateway gives to metadata supply by information providers:

what is the likely scale of information provider contribution?
how many individual resources will the information provider supply?
what level of enhancement to metadata will be required to meet quality control criteria?
is the service aiming at comprehensive coverage of an area?
are information provider contributions seen as only as possible content for the gateway, or will information providers expect their data to be included (need to manage expectations)

Conclusions

It is worth while building relationships with key information providers, especially as in many cases they are likely to be users of the information as well as contributors.

Gateways may judge that at present information providers cannot provide enough metadata to make it worth while setting up systems to import metadata. However, it seems likely that, as metadata standards mature, organisations owning resources will recognise the advantages of creating metadata for their own purposes which may be for administration, rights management, marketing, their own resource discovery systems or to pass along the retail chain. Gateways need to be ready to take advantage of changes in the pattern of metadata creation when (if) this happens.

Gateways will need to move towards a viable business model for metadata creation to ensure their longterm sustainability.

Glossary

AHDS - Arts and Humanities Data Service
CORC - Co-operative Online Resource Catalog
DOI - Digital Object Identifier
Dublin Core - A metadata format defined on the basis of international consensus which has defined minimal information resource description, generally for use in a WWW environment.
DutchESS - Dutch Electronic Subject Service
EEVL - Edinburgh Engineering Virtual Library
Elib - The Electronic Libraries Programme (UK)
ISSN - International Standard Serial Number
MARC - MAchine Readable Cataloguing. A family of formats based on ISO 2709 for the exchange of bibliographic and other related information in machine readable form. For example, USMARC, UKMARC and UNIMARC.
PURL - Persistent Uniform Resource Locator.
RDF - Resource Description Framework
SGML - Standard general Mark-up Language
SOSIG - The Social Science Information Gateway
XML - Extensible Markup Language. A lightweight version of SGML designed for use on the Internet

References
	P. B. Hansen & J. Hansen, INDOREG: INternet Document REGistration: project report (1997). http://purl.dk/rapport/html.uk/

Credits
	Chapter author: Rachel Heery With contributions from: Emma Place

2.8. Publicity and promotion

In this chapter...
	publicity and promotion - what are the issues? the power of well planned publicity traditional promotion and publicity activities online promotion and publicity combining promotion and publicity with other activities examples of effective publicity and promotion

Introduction

Publicity and promotion are rarely at the forefront of people's minds when planning an information gateway, yet they are often essential ingredients for a gateway's success. Good publicity can help enormously to bring an information gateway to the attention of the people that really matter, i.e. the gateway's target users.

An effective publicity and promotional campaign takes time and effort to plan and deliver; it can also cost money. This chapter attempts to highlight some of the issues that should be considered when planning publicity and promotion activities.

What are the issues?

The key issues at stake with publicity and promotion are:what is the intended audience?

what kind of publicity and promotion is available?
are all types of publicity worth while?
how can a limited budget (time and/or money) be targeted most effectively?
are there any failsafe methods for successful publicity and promotion?
how can you retain the interest of your users?

What is the intended audience?
	You should think carefully about the audience which your publicity is intended to reach and win over. If you can characterise your user community carefully and target the publicity accordingly, it will be much more effective.

What kind of publicity and promotion is available?

Publicity and promotional methods for gateways may be divided into three distinct forms: traditional media, electronic media activities and face-to-face activities. The underlying aims of each are very similar: to communicate to as many people as possible (ideally your target users) that your gateway exists and to convince them that they should use it. Once users find the gateway, then the quality of the resources should make them into repeat visitors.

Traditional media activities

Traditional media activities are often overlooked as methods of publicity when Internet-related projects are planned. This is a shame, as they can be extremely powerful and far-reaching and can often produce the best results in terms of reaching the largest group of potential users. Traditional media can include paper-based materials (leaflets, posters, newsletters, papers, journals, magazines, etc.) as well as media such as television and radio.

Paper-based materials

Paper-based materials fall into two distinct groups: publications in the form of journals, magazines and newspapers and paper publicity materials such as information sheets, leaflets and posters.

Publications can be used effectively to access concentrated groups of target users directly. If you place an advertisement in a specialist journal that is read by large numbers of your target users, the results can be well worth the money. Paying for publicity by means of advertising is not the only route (although it should be considered, as the results can be impressive, far-reaching and cost effective). Writing review articles in journals or newsletters can be a good way to get some 'free' publicity. Obviously, the time involved in writing such articles should be considered and costed. Nevertheless, articles written by gateway staff are often a very successful means of publicity.

E X A M P L E

Example: Articles written by gateway staff

The following articles have been written by gateway staff and all act as good publicity materials, either directly or indirectly:

Biz/ed:
Catherine Sladen, 'Ethical Business', Business Review (April 1998).Catherine Sladen, 'Mergers and Take-overs', Business Education Today (May/June 1998).
SOSIG:
Debra Hiom, 'Around the table: Social scientists have their own favourite places on the Web', Ariadne 9 (May 1997).Debra Hiom, 'SOSIG: Providing access to internet information', Laser Link (Autumn 1998).
OMNI:
John Kirriemuir, 'A report on the third annual OMNI seminar: A cure for information overload', CTICM Update 8:2 (December 1997).

E X A M P L E

Example: Advertisements placed by gateways

NMM Port:

To coincide with the launch of the Port information gateway, a number of advertisements were placed in maritime-related journals and publications. These included:Times Higher Education Supplement (16/04/99)

Navy News (May 1999)
Managing Information (April 99-6:3)
History Today 49(5) (May 1999)
Museums Journal (May 1999)
Seabreezes 73 (641) (May 1999)

Another way for your gateway to appear in the user community literature is for it to be included or referenced in other people's articles. Of course this may be harder to achieve as it requires people to know about and value the gateway. However, as a gateway matures and becomes a feature of the user community, this kind of publicity becomes more likely. Targeting known journalists or writers within your user community can also pay dividends and produce some favourable results. Consideration should be given to all contacts that people associated with the gateway may have.

E X A M P L E

Example: Articles written about gateways by non-gateway staff

Biz/ed:
The Guardian newspaper (07/03/96)
The Guardian's regular Web site review column contained a glowing review of the early Biz/ed information gateway.
Times Higher Education Supplement (02/04/99)
An extensive review of one of the many features available from the Biz/ed information gateway: 'Website opens doors to No 11: Chris Johnston finds a site based on the economic model of the Treasury'. Although the article did not deal strictly with the information gateway resource catalogue, it did raise the awareness of the site as a whole. A good example of all publicity being good publicity!
NMM Port:
The Times newspaper (11/05/99)
The following article appeared in The Times newspaper: 'With no added salts: at last, an honest, unsentimental tribute to our maritime heritage' (by Libby Purves), containing several references to the Port gateway and its features.
Gateways in general (with reference to SOSIG):
Times Higher Education Supplement (08/01/99)
'Out of the morass: step through one of the Internet's subject gateways and you leave the information jungle behind, says Ayala Ochert'. An excellent review of information gateways in general and even references to the DESIRE project!

The benefit of carefully targeted articles or advertisements in your user community literature is that the materials immediately have context and are being viewed by people interested in the subject matter; this significantly increases the chances of people reading the article and subsequently visiting the gateway.

Other paper-based materials such as information sheets, leaflets and posters can also be very effective as promotional materials. Developing a visually attractive information sheet about your gateway and distributing it to key users can help to raise the profile of the gateway. Several gateways have used this idea to great effect. Promotional materials do not need to stop at information sheets. Bookmarks, mouse mats, mugs and T-shirts have all been used and have potential. Naturally, the exact kind of materials chosen may be largely dependent on cost and funding.

E X A M P L E

Example: Gateway information leaflets

Biz/ed: PDF (bized-flyer.pdf)

Port: PDF (port-leaflet.pdf)

OMNI: PDF (omni-leaflet4.pdf)

All of the materials above have been sent to key sections of the target user community (subject librarians, University libraries, subject-specific book shops and museums) who have been asked to display them where their users could see them. Having a Biz/ed information sheet available in the Social Science library near the networked computers has obvious benefits.

In several cases the promotional materials have been so popular that extra copies have been ordered by the people concerned.

Correctly targeting the recipients of promotional activities can produce a cascading effect, so that the targeted people then pass on their knowledge concerning the gateway to more people locally.

Television and radio

Though perhaps not as appropriate for publicising gateways as some of the other media mentioned in this chapter, the use of television and radio does have enormous potential. Obviously the idea of placing a commercial for your gateway on the television or radio may be in the realms of science fiction, but getting the gateway mentioned as part of another programme may be a more down-to-earth ambition. This is especially true with the recent growth in popularity of Internet-focused programmes.

Gateways are more likely to get mentioned if they are well established, by coming to the attention of television and radio programme producers and researchers. Well placed contacts can also help to raise the profile of a gateway within the relevant circles.

Electronic media activities

Search engines and directory listings

It goes without saying that an information gateway should make sure that it is registered and listed in the leading Web search engines and directories. Tools such as Submit It! or any of the many others now available (see Yahoo's listing in this area, can make online submission to search engines a quick and easy task. All of the leading search engines and Internet portals must be targeted, although the issue of context is again very important.

Your gateway needs to be included in search engines like Alta Vista and Yahoo, as many people use these as their starting points when searching the Web. However, subject-specific, geographically limited and specialist search engines should also be considered. Is there a local search engine that your users may frequent? If so, then registering your gateway with the site could pay off. If you can get listed on the most popular site (in terms of your target audience), then the relevance of the materials will be high and so the chances of people following links to your site are much greater.

Getting the most from search engines requires the use of metadata in your information gateway Web pages. This will not be a problem for a metadata expert!

Metadata formats

Mailing lists and newsgroups

Many people are now familiar with the benefits of newsgroups and mailing lists and their power to contact large numbers of people with a specific interest. These can be excellent tools via which larger numbers of target users can be contacted. All it takes is an email or a news posting and your gateway's latest features can be publicised to hundreds or thousands of people. It also only takes one inappropriate message to alienate lots of potential users. Be careful of sending too many or inappropriate messages to newsgroups or mailing lists, as promotion can easily turn to spam.

Face-to-face activities

The final area that should be considered in terms of promotion and publicity is that of face-to-face contact with potential users. Clearly, the effective way to do this is at large gatherings of potential users such as conferences and workshops. A presentation, paper or demonstration at a leading conference which will be well attended by potential users can communicate directly with a large group of users who may be influential. Running workshops for sections of your user community, especially for those who are themselves involved in training, can have similar results and is covered in more detail in the training and skills development chapter.

E X A M P L E

Examples: Gateway presentations

Biz/ed: EBEA (Economics and Business Educators Association) Annual Conference
One of the key groups of users targeted by Biz/ed comprises UK economics and business school teachers. Over the years Biz/ed has given a number of presentations to the EBEA annual conference, as well as running an information stand about the gateway; presentation topics have included 'An introduction to Biz/ed' and 'Using the Internet in GNVQ Business'. All the presentations have served to highlight the Biz/ed information gateway directly to key users.
SOSIG: IRISS 98: Internet Research and Information for Social Scientists 1998
The IRISS conference was a leading conference for social scientists interested in using the Internet in their teaching and research. Debra Hiom from SOSIG did a presentation 'The Social Science Information Gateway: Putting Theory into Practice' which detailed many of the uses and strengths of the SOSIG information gateway.

Are all types of publicity worth while?
	The old saying that all publicity is good publicity probably has some truth, even when talking about information gateways. Any promotion and publicity that raises the profile of your gateway in your target community should be considered a good thing. Of course being voted the worst Web site by your user community should probably be avoided, but it may bring you a few curious visitors!

How can I best target a limited budget?

The issue of how best to target a limited budget really depends on the makeup of your user community. If you have a wide user community, then you will have problems in targeting users. A well defined user community can often be more easily targeted as its members appear in concentrated groups or areas (within certain University Departments or organisations). A good example of this is the SOSIG user community, which can be relatively easily targeted via UK higher education social science dapartments.

Are there any failsafe methods for successful publicity and promotion?

Unfortunately the answer to this is no. However, some of the existing gateways have demonstrated that certain techniques can be very cost effective; training trainers within your user community can produce very good results (e.g. Biz/ed) and well-placed publicity leaflets and posters in HEI libraries and departments can also communicate with large numbers of target users (as has happened in the cases of SOSIG and NMM Port).

Your user community should be carefully characterised before any expensive promotional activities are embarked upon. Identify your users carefully and your promotional activities will be much more likely to succeed.

How can you retain the interest of your users?

Once you have persuaded potential users to look at your gateway, you would like them to come back to it. A well-designed gateway which fulfils the expectations of its users will encourage them to return, but publicity can also help them to keep the gateway in mind.

An email list can be a useful way of conveying information about developments in your gateway to interested users. Such a list has been run successfully for the SOSIG information gateway.

References

Alta Vista, http://www.altavista.com/

Biz/ed, http://www.bized.ac.uk/

NMM Port, http://www.port.nmm.ac.uk/

OMNI, http://www.omni.ac.uk/

SOSIG, http://www.sosig.ac.uk

Submit It!, http://www.submitit.com/

D. Hiom, 'Around the table: Social scientists have their own favourite places on the Web', Ariadne 9 (May 1997).

D. Hiom, 'SOSIG: Providing access to internet information', Laser Link (Autumn 1998).

J. Kirriemuir, 'A report on the third annual OMNI seminar: A cure for information overload', CTICM Update 8:2 (December 1997).

C. Sladen, 'Ethical Business', Business Review (April 1998).

C. Sladen, 'Mergers and Take-overs', Business Education Today (May/June 1998).

Credits
	Chapter author: Martin Belcher, Lesly Huxley With contributions from: Sarah Ashton (NMM Port), Kate Sharp (Biz/ed), Debra Hiom and Emma Place (SOSIG).

2.9. User interface design

In this chapter...

identifying your target users - who are the potential users of your gateway?
user consultation - asking your users about their wants, needs, likes and dislikes
task analysis - what kind of tasks are they going to carry out using your gateway?
usability and accessibility - what do these often-used terms really mean?
Web design issues - Web design = information gateway design?
developing a user interface design specification

Introduction

This chapter looks at the general user interface issues which should be considered when planning the development of an information gateway or when looking at the modification of an existing gateway. Many of the issues discussed apply to all online services and Web sites, so they can be re-used outside the information gateway arena.

The importance of good user interface design:

information gateways need to be usable
the user interface to an information gateway is the design employed in the Web pages of the gateway
good Web page design can significantly increase the ease with which users can complete tasks, i.e. it increases usability
users who can't complete tasks are frustrated users; frustrated users don't come back
users who complete tasks are happy users; happy users come back to a Web site and often tell their friends and colleagues about a great site/information gateway

Gateways in context

Information gateways are really just value-added Web sites. This statement is not meant to belittle the importance of information gateways (far from it!); rather it is meant to highlight the fact that they have many similarities with Web sites in general. For all that is said about the Web being an interactive medium and an empowering tool from the user's perspective, there is one small point often overlooked. This is that the only way a user can interact with even the most advanced Web site is via the user interface. The user interface is simply what the user sees on the screen through their browser. If what they see is hard to understand or difficult to use, then the vast majority of users will never make it to the real content or value-added features of the Web site. It doesn't matter how good the information on your Web site is - if the user can't access the information, they will go elsewhere.

Frustrated users

How many times have you visited 'great looking' Web sites and found them difficult to use, often so difficult that you have given up and gone elsewhere?

Poor user interface design can hide even the most powerful and useful Web sites from all but the most advanced and patient users. Web site developers (including information gateway developers) have to consider seriously the issues of user interface implementation. A poor user interface will mean low usage of the site and its ultimate failure. The failure of Web sites is often due to their designers' not considering their users and designing with the assumption of too much technical knowledge.

It should always be remembered that, by being in the position of developing or even just considering the development of an information gateway, you are probably in the category of an advanced user. You may not be as advanced as the system administrator or 'techie' in your organisation, but compared to the average man in the street you are an expert! Never overestimate the skills of your users, unless you have direct evidence on which to base your judgements.

Background

Definitions:

user interface: the means of communication between a human user and a computer system (in this case a Web site). A wider definition could be the means of interaction between a human being and any object
usability: the degree of ease with which human beings can interact with an object, in particular a computer system
accessibility: the characteristics of Web content and whether or not it is accessible to people with disabilities

The science of user interface design, usability and accessibility has its origins in software development and general engineering. Many of the things we take for granted have been through a lengthy process of user interface design and development. Generally we don't notice interface design unless there is a problem, resulting either from poor design or from our attempting to use an object for something other than the purpose for which it was designed.

E X A M P L E

Example of user interface design

Have you ever thought about the user interface design of a pair of scissors?

Scissors have actually been carefully designed for a specific range of tasks. Their design isn't really an issue unless you stress test a pair: if you are left-handed and try to use a pair of right-handed scissors, you immediately see the user interface design limitations. Ask a left-handed colleague to explain or try using a left-handed pair of scissors - you will wonder what the designers are playing at!

As mentioned above, most manufactured objects have some degree of user interface evolution and redesign involved in their development. Many household objects have been around for many years and so have the benefit of gradual development (scissors have been with us for hundreds of years). Unfortunately software design and development has been around for a much shorter period of time, and Web site design even less. The end result is that the usability of computer systems and Web sites is not completely understood or, in some cases, even recognised.

However, in order to develop successful information gateways you must consider the user interface design carefully and thoroughly. Without sufficient effort being put into this area you may be set for failure from the outset.

Useful resources

The following resources are extremely good and highly recommended as excellent introductions and background information on usability and user interface design (even if they do come from a single source):

So what issues do I need to consider in order to develop a successful user interface?

Identify your target users

It may sound obvious, but you can't really start thinking about the design of a user interface until your users have been identified and characterised. User identification is important in other aspects of the development of an information gateway (scope policy, gateway aims and objectives, planning an information gateway project), so that the question of who the target users are should have already been considered.

Different groups of users will vary in their characteristics. Wherever possible, you should try and include as large a range of users as you can, but think carefully about designing for everyone. If your target users have slightly different characteristics from the general public, then you have to prioritise which characteristics you wish to address.

When you are identifying your users, a minimum set of characteristics to consider might be:

location of users (organisational and geographical)
subject knowledge (educational level)
IT literacy/technological experience (do not overestimate)
access to technology and network connectivity
physical attributes (colour blindness, age, disabilities)

Some of these characteristics can be obtained from correlation with general population characteristics, while others must be uniquely researched.

User consultation

Once you have identified who your target users are, you may wish to consider having some degree of user consultation. Ideally, this would have been a part of the general development of the information gateway project/idea. The value of user consultation should not be underestimated. A few relatively simple techniques of user consultation can produce extremely powerful data which can influence the development of a user interface.

In the past, user consultation was often not considered, as it was thought to be time-consuming, difficult and contrary to the prevailing culture of 'we know best'. All these issues can be addressed by adopting a number of techniques that are simple to implement, low cost and able to provide convincing evidence of the power of user consultation.

Questionnaires and surveys

The development and implementation of a simple questionnaire and survey of potential users can also produce important information. Selecting the people to be surveyed is important (so as not to build any bias into data collected), as is the careful wording and development of the questions that are being asked. Again, you would be well advised to consult some of the leading literature or any in-house experts.

Useful resources

The following resource is an excellent starting place for further information on conducting questionnaires and surveys:

Questionnaire Design, Interviewing and Attitude Measurement. A.N. Oppenheim. 1992

A questionnaire is a good method of sorting and selecting the attendees for the next area of user consultation, focus groups.

Focus groups

The focus group is a simple concept, although easy to implement wrongly. The basic idea is to get some target users in a room, ask them questions about the proposed information gateway and collect their feedback on your questions and ideas. Suggestions and problems can often come to light from a simple focus group discussion. Participants can highlight areas that have never been considered by people too closely involved in the project.

Focus groups do need to be run with care, as they can often produce misleading information and are easy to run badly (for example, it is very easy for the person running the focus group to lead the answers as well as the questions!). The science of focus groups has its own extensive literature and it would be worth consulting one or two of the leading publications in this area.

Useful resources

The following resources are excellent starting places for further information on running focus groups:

The Focus Group: A Strategic Guide to Organising, Conducting and Analysing the Focus Group Interview. Jane Farley Templeton. 1994
Focus Groups : A Practical Guide for Applied Research. Richard A. Krueger. 1994
Focus Groups: A Step-By-Step Guide. Gloria E Bader, Catherine A. Rossi. 1998

User consultation warning

Although user consultation is an essential part of any detailed user interface implementation project, it must be treated with some caution; there can sometimes be a marked difference between what users say they want and what users actually use. This is particularly true when complex features have been developed and implemented; user tracking and logging may show that very few people use the features. Some of the lack of use may be due to usability problems and some may be because users just do not want to use complex features.

User consultation should ideally go hand in hand with user tracking and logging of behaviour. Much user behaviour tends to be common across the board and it would be extremely useful if the information gateway community actively shared such information.

E X A M P L E

Guerrilla HCI

The term 'Guerrilla HCI' was coined by Jacob Nielsen in the field of software design and development. His basic premise is that software projects often fail to achieve their full potential because of the lack of user consultation, which is not considered because of the perceived high costs. Nielsen developed the idea of relatively low-cost user consultation and, although not directly related to Web site development, there are many useful issues raised in his publications on these issues.

The following document contains many insights and suggestions which may be directly applicable to gateway projects that are interested in a degree of user consultation and usability testing, but are not operating on a large budget:

Jacob Nielsen: Guerrilla HCI: Using Discount Usability Engineering to Penetrate the Intimidation Barrier

Task analysis

The outcome of any user consultation and/or user identification should be an understanding of the needs and requirements of the user community and an idea of what kind of tasks the average user is going to want to be able to perform. The ultimate aim of any user consultation should be to inform the gateway developers about the users' needs. Do the characteristics of the user community mean that they have any unique needs? For example, are they all on very slow network connections and only using text browsers, or are they all based in Higher Education Institutions (HEIs) and therefore have access to fast network connections?

The development of a description of and set of characteristics for a typical user will help to determine a set of user needs. This in turn will provide evidence to feed into a user interface requirements specification.

Information on task analysis can also be obtained from user consultation; getting participants in a focus group to discuss the kinds of tasks they might like to perform while using a gateway may help to decide the level of priority tasks should be given within the overall user interface design. Are the users' requirements, as described by the users, the same as those determined by the gateway developers? They should be similar but it is unlikely that they are the same.

R E M E M B E R

Existing gateways: user consultation?

If your gateway is already up and running, then user needs analysis and task analysis can significantly help you to improve the user interface design. User consultation and usage log analysis can help to refine an existing gateway, the better to meet user needs and expectations. Are users still using the gateway in the way originally envisaged? Asking them may reveal this, and looking at logs of how they use the current site can provide even more information. If your gateway offers browsing and searching, which one is being most heavily used? If there are significant patterns emerging from any data that you analyse, is a revised user interface called for?

Usability and accessibility

Usability and accessibility often go hand in hand; if a Web site is difficult to use then it may become inaccessible, as users cannot get to the information that they want. Making something more accessible often makes it more usable for all users. Designing for maximum accessibility helps designers to focus on users and content rather than on 'flashy' design issues.

But accessibility also needs to be considered with regard to people with disabilities and giving equality of access to a Web site or information gateway. By making sure that a Web site is accessible to as wide an audience as possible you also necessarily increase the usability of the site. Catering for disabled accessibility may be something that a gateway would like to do or something that it is legally required to do (Hotwired 'Sites Must Retool for Disabled'). In either case the issues need to be looked at and carefully considered. More detailed information on accessibility is contained in the Usability and Accessibility chapter.

General Web design issues

Web design is a science in itself and there are countless books and online resources that offer extensive advice in this area. A few key issues should be considered when designing:

always design for your users and not the person running/funding the project
be aware of and implement some degree of usability and accessibility standards
avoid proprietary technologies, unless a significant proportion of your user community demand them
try to use innovative and exciting Web design but don't overdo things

Developing a user interface requirements specification

Before any implementation of a user interface begins, a detailed user interface requirements specification should be developed. The document should state the characteristics of the target users and for which tasks they are going to use the information gateway. There should also be a list of user interface priorities, with clear indications as to what is an essential requirement and what is desirable. Without such a prioritised list, it is difficult to decide where staff effort should be spent in user interface development. Unless there is an order of priority, if only some things are implemented, there will be no guarantee that they will be important in terms of usability and accessibility.

A good example of a well structured and well planned requirements specification is the W3C Web Accessibility Initiative Standard (WAI) and in particular the List of Checkpoints for Web Content Accessibility Guidelines 1.0.

The document is useful in that it provides an excellent example of how to present a requirements specification document in an easy to understand and usable format. Additionally, it presents the definitive set of guidelines on how to implement a Web site of any description which has accessibility at its core. The document should be consulted by developers of all information gateways, current and planned.

Case Studies

E X A M P L E

Biz/ed usability audit

The Biz/ed information gateway was one of the early information gateways, originally launched in January 1996. The user interface that was developed then reflected the style and knowledge of the general Web technologies available at the time. Several minor redesigns were implemented as a result of internal changes to the site and general Web developments. The end result was something that looked acceptable and seemed to work.

Screen shot of Biz/ed homepage 08/07/96

In late 1998 it was decided that there would be some formal user consultation to see how users were using the site and to see whether there had been any changes over time. Analysis of the Web site user and search term logs indicated that people were using the site differently from the way in which they had used it earlier.

A series of focus groups and usability testing sessions were conducted over several months, to ascertain what it was that users liked about the information gateway as it then stood. Biz/ed also wanted to see if some proposed changes to the site would be popular. The outcome of the user consultation was that some changes to the site were implemented as planned, some were modified and some left out altogether. The participants in the focus groups and usability testing sessions also contributed significantly to the new user interface design. Simple techniques such as naming and grouping, user tracking, focus group issue investigation and task completion analysis were all employed to provide data for the gateway redesign.

Screen shot of Biz/ed homepage 11/07/99

Cost of user testing a Web site

It takes 39 hours to test a Web site for usability the first time you try. This time estimate includes planning the test, defining test tasks, recruiting test users, conducting a test with five users, analysing the results, and writing the report. With experience, Web user tests can be completed in two working days.

Jacob Nielsen: Cost of user testing a Website

Glossary

Accessibility - the characteristics of Web content and whether or not it is accessible to people with disabilities
Guerrilla HCI - Term coined by Jacob Nielsen to describe the rationale behind discount usability engineering and how to put it into practice. Further information can be found at http://www.useit.com/papers/guerrilla_hci.html
HCI - Human Computer Interaction
HEI - Higher Education Institution
Heuristic evaluation - Heuristic evaluation is a discount usability engineering method for quick, cheap and easy evaluation of a user interface design. Further information is available at http://www.useit.com/papers/heuristic/
Usability - the degree of ease with which human beings can interact with an object, in particular a computer system
WAI - Web Accessibility Initiative Standard

References

Biz/ed,
http://www.bized.ac.uk/

G. E. Bader & C. A. Rossi, Focus Groups: A Step-By-Step Guide (1998).

R. A. Krueger, Focus Groups : A Practical Guide for Applied Research (1994).

J. Nielsen, Cost of user testing a Website
http://www.useit.com/alertbox/980503.html

J. Nielsen,Guerrilla HCI
http://www.useit.com/papers/guerrilla_hci.html

J. Nielsen,Differences between print design and Web design
http://www.useit.com/alertbox/990124.html

J. Nielsen,How users read on the Web
http://www.useit.com/alertbox/9710a.html

J. Nielsen,Be succinct! (writing for the Web)
http://www.useit.com/alertbox/9703b.html

J. Nielsen,The top ten new mistakes of Web site design
http://www.useit.com/alertbox/990530.html

A.N. Oppenheim, Questionnaire Design, Interviewing and Attitude Measurement (1992).

J. F. Templeton, The Focus Group: A Strategic Guide to Organising, Conducting and Analysing the Focus Group Interview (1994).

W3C, List of Checkpoints for Web Content Accessibility Guidelines 1.0
http://www.w3.org/TR/WAI-WEBCONTENT/checkpoint-list.html

W3C, Web Accessibility Initiative Standard (WAI)
http://www.w3.org/TR/WAI-WEBCONTENT/

Credits
	Chapter author: Martin Belcher, Phil Cross With contributions from: Jan Chipchase

2.10. Integration of robot and manual indexes

In this chapter...
	This chapter will be available spring 2000, when the handbook will be revised and updated.

2.11. Distributed cataloguing

In this chapter...
	advantages of distributed cataloguing distributed cataloguing models management issues a case study: SOSIG examples of distributed cataloguing

Introduction

This chapter introduces the concept of distributed cataloguing and the potential for working collaboratively across the Internet. It looks at some of the human issues involved in distributing cataloguing effort, presents some models currently in use within information gateways and in particular looks at the experiences of SOSIG in employing a distributed model. Some further examples of distributed cataloguing models are also presented.

Because of the open nature of the Web there is considerable potential for distributed collaborative cataloguing of networked resources. Information gateways can be built by teams of staff who are geographically dispersed but who can add resources to a database from their desktops via the WWW. This chapter concentrates mainly on issues surrounding distributed cataloguing into a central database. However, an additional or even complementary model is that of collaborative work with other gateways (see the chapter on co-operation for more details).

http://www.desire.org/results/training/D8-2af.html

Why would an information gateway want to consider distributed cataloguing?

Distributing the cataloguing effort allows you potentially to share the responsibility with a number of organisations or participants and to maximise the coverage of the collection. In particular it allows gateways to:

locate cataloguing effort within centres of subject expertise
locate cataloguing effort within centres of geographical knowledge
provide access to staff with a variety of language capabilities, enabling the development of multilingual gateways
aid economies of scale

Models for distributed cataloguing

There are numerous cataloguing models currently being employed by information gateways. The main contrast is that of the use of paid versus voluntary effort. However, even within this broad division there are several approaches, e.g.:

networks of volunteers
institutional commitment to provide staff effort as part of their main duties
paid staff
a mixture of paid staff and volunteers

And within these organisational setups there are various ways of assigning roles and responsibilities. These range from allowing members of the team to have full responsibilities and access to the database to a very defined division of labour between selecting, evaluating and cataloguing resources.

DESIRE 1 held a training workshop on the Distributed Cataloguing Model in 1997, which brought together staff from a number of European information gateways to share experiences of their models and the tools, training materials and methods of delivery to support them. A report summarising the outcome of the workshop can be found at:

Management issues

There are a number of issues to consider when setting up a distributed cataloguing system.

Recruitment

One of the most crucial issues for gateways is recruiting the right staff to work on the catalogue. The core skills of resource selection and cataloguing make librarians ideally placed to assume the role, as they have the training and the expertise required. However, academic subject experts or others with the appropriate subject knowledge may also be valuable. It is also important to bear in mind that as well as subject knowledge a fair degree of expertise in use of the Internet is also necessary and that these two skills are not always found together.

As well as deciding on the type of person required, gateways will also need to consider the best approach to finding and recruiting these people. Putting out a general call for staff will usually result in receiving replies from enthusiastic individuals who are keen to do this sort of work. However, they may have difficulties in getting the support they need to do this from their institution or place of work. Conversely, going through the institution will ensure commitment from the top down but may not result in the ideal candidates being selected from within the institution.

A key decision is whether the staff will be volunteers, will include the work as part of their jobs or be paid for their contributions. Paid staff will enable gateways to set and work to targets allowing for the development of the gateway to planned and monitored. With voluntary effort gateways are relying on the goodwill of the people concerned and the ability to fit these duties around their main jobs and activities. It is quite possible that there will be very little return for the considerable investment made in training and development. Perhaps the ideal situation is to have staff who are supported by their institutions to incorporate the role into their day-to-day work. Ensuring that paid staff have protected time to carry out their gateway duties may also be an issue; it is possible that external staff have been given this additional role on top of their existing work and will find it difficult to cope with both. Good communication between the central and distributed staff can help to prevent these problems arising.

Quality selection, Collection management

Support tools and mechanisms

Gateways need to develop a system for staff to be able to remotely recommend or catalogue into the system. Again, various methods are used by gateways; these range from emailing details of resources to central staff to Web based cataloguing systems such as ROADS.

Training

Training staff to contribute to the gateway is essential. They will require training in:

selection of resources
cataloguing and classification

Ideally this training would take place as a face-to-face workshop, although, given the possibility of contributors being located around the world, training could also take place through distance learning via email and the Web.

Documentation

Whether training is conducted remotely or face-to-face, extensive documentation is required to support the work of the staff. Various approaches are being used by existing gateways. Some have printed handbooks with all the information required; others have set up administration centres on the Web with online documentation and support.

Monitoring and support

Perhaps one of the greatest drawbacks of running a distributed team is dealing with the problems of working remotely. The job requires that staff should be self-motivated, yet it is very easy for staff to feel isolated without the advice and support of colleagues around them. A geographically dispersed team will rely heavily on remote communication through one-to-one email contact, use of mailing lists and Web conferencing systems for 'virtual meetings'.

A case study: SOSIG

SOSIG has successfully employed a distributed team of subject experts (known as Section Editors) for the past two years. Subject librarians from ten UK universities were appointed to select, evaluate and catalogue resources for the SOSIG catalogue. Each Section Editor is given responsibility for developing a subject area on the gateway. In some cases the Section Editors' roles are shared between two or more people at an institution, but total effort does not exceed more than one day per week.

A one-day workshop was held at the start of the project to train the staff on all aspects of working on an information gateway. This included:

introduction to the Scope Policy of SOSIG (this stipulates the audience and type of information to be included in the gateway)
finding resources on the Web
selecting and evaluating resources
cataloguing resources via the Web (including cataloguing rules)
introduction to SOSIG's Collection Management Policy (including guidelines on deselecting resources)

Prior to the workshop an online administration centre was set up, which included all the tools and guidelines required to catalogue resources for the gateway. After the workshop, additional support was offered through email contact with the core staff. This one-to-one contact was initially very important as the Section Editors had a very steep learning curve to ascend. The geographical distances between the staff meant that they were very reliant on email as a means of virtual support and assistance. As the Section Editors have direct access to the live database to begin with, all of the work submitted had to be checked centrally and any errors corrected and/or reported back to the appropriate Editor. This put a very high overhead on central effort for the first few months of the scheme; however, this requirement diminished gradually and now only random checks are made on the records.

In addition to the Section Editors, SOSIG also has a number of European Correspondents. Correspondents are academics or librarians who have volunteered to submit new resources on an informal but regular basis. Correspondents have access to online training and support materials but they do not catalogue directly into the database; rather they are responsible for selecting resources and submitting the suggestions to the central team through an online form.

The responsibilities and duties for the gateway can be represented visually in two ways:

Figure 1: Workflow

Figure 2: Tasks and responsibilities

There have been various general lessons learnt in the process of establishing this distributed approach as a result of other attempts by SOSIG to encourage distributed input, which may be relevant to other gateways. These are:

an institutional commitment, backed by a financial arrangement, is a far more reliable way of establishing a broad range of participation than using volunteers or making financial arrangements with individuals
such collaborations require a great deal of co-ordinating and supporting effort from the service centre, including training, responding to queries and general reassurance as well as monitoring and encouraging effort
an essential ingredient has been the personal contact between the Section Editors; bringing them all together regularly for information-sharing and morale-boosting sessions has noticeably improved quantity and quality of results. Even though these face-to-face sessions are relatively expensive exercises, they have been well worth while.

E X A M P L E

Other examples of distributed cataloguing models

DutchESS

DutchESS (The Dutch Electronic Subject Service in the Netherlands) has a number of volunteer subject specialists from university libraries around the Netherlands. The subject specialists select resources and submit them to a local editor who checks the resources and edits the catalogue descriptions as appropriate. The local editors feed resources to DutchESS, where they are entered into a database. Face-to-face training for the subject librarians has been conducted. Interestingly. in this model the subject librarians involved work on this gateway as part of their day-to-day library work.

For more information see: http://www.konbib.nl/dutchess/docs/info.html#8

EELS

EELS (Engineering Electronic Library in Sweden) is an engineering subject gateway. EELS has ten Section Editors from university libraries around Sweden who volunteer to submit resources to the database. They catalogue resources directly into the database and are also able to delete records. The Section Editors receive face-to-face training in the key skills.

For more information see: http://www.ub2.lu.se/eel/about.html

EEVL

EEVL (Edinburgh Engineering Virtual Library in the UK) is a subject gateway to engineering information on the Internet. They have had a distributed team of academic librarians attached to the service from its inception. These librarians work voluntarily, but are part of the EEVL project consortium and so have a stake in the project. As such, they have been heavily involved in creating the selection and cataloguing process, so they have not required formal training, but they do have a printed procedures manual and regular meetings to discuss policy. The librarians can add records directly to the database, but these are checked by central staff before they are made publicly available on the gateway.

For more information see http://www.eevl.ac.uk/volunt.html

Friends of ADAM

ADAM (Art, Design, Architecture and Media Information Gateway in the UK) has created the 'Friends of ADAM' system. The Friends are a volunteer and support network from the arts and media community recruited through email, web publicity, and conferences and events. The system involves accredited online training in three areas:

evaluation
nomination
cataloguing

At the end of the training period volunteers are issued with a certificate of competence and can then assume different levels of responsibility in the service. Those gaining a certificate in evaluation and nomination feed their suggestions to the central team at ADAM who catalogue the resources into the database. They also assist in evaluating suggestions sent to ADAM by members of the public. Those gaining a certificate in cataloguing can create new resource descriptions which are checked by team members before being added to the central database.

For more information see: http://adam.ac.uk/friends/

Länkskafferiet (Link Larder)

The Link Larder is a database for educational use and is intended as a pedagogical aid for Swedish pupils, especially those between 10 and 15 years of age, in their search for useful information on the Internet. All the web sites are selected, quality assessed and described by eight subject editors.

For more information see: http://lankskafferiet.skolverket.se/information/brief_presentation.html

Recommendations

There is great potential for distributed cataloguing systems, as they open up the possibility of national or international strategies. They also provide a successful model for involving the library community in Internet resource discovery. Existing gateways have invested effort in developing systems that support the work of distributed teams, so that a librarian can work on a gateway from anywhere in the world as long as they have access to a networked PC and a Web browser. Distributed Internet cataloguing means that libraries can contribute to a shared service, rather than having each to build a local service. This is an efficient way of working, as it avoids duplicated effort and collaboration allows large-scale gateways with much better coverage to be developed.

Building and managing distributed teams is a challenge; there are a number of issues that need to be dealt with. In summary, some of these are:

dealing with problems of distance and feelings of isolation - constant email contact and personal feedback on work is crucial to help alleviate these problems
little control over individual work patterns - it is important to ensure that paid staff have protected time to carry out the work
monitoring consistency between staff - this is much harder in a distributed environment, but providing clear and comprehensive documentation such as selection criteria and cataloguing rules can help

Glossary
	ADAM - Art, Design, Architecture and Media gateway DutchESS - Dutch Electronic Subject Service EELS - Engineering Electronic Library, Sweden EEVL - Edinburgh Engineering Virtual Library SOSIG - The Social Science Information Gateway

References

DutchESS Manual: handleiding voor vakspecialisten, http://www.konbib.nl/dutchess/manual/

EELS Project, http://www.ub2.lu.se/eel/about.html

EEVL, http://www.eevl.ac.uk/volunt.html

Friends of ADAM, http://www.adam.ac.uk/friends/

Länkskafferiet (Link Larder), http://lankskafferiet.skolverket.se/information/brief_presentation.html

SOSIG Correspondents Pages, http://www.sosig.ac.uk/desire/ecorresp.html

T. Hooper, L. Huxley & P. Hollands, DESIRE: Subject-based training materials
http://www.desire.org/results/training/D8-2af.html

L. Huxley, 'DESIRE on Planet SOSIG: Training for the Distributed Internet Cataloguing Model', Ariadne 12 (1997).
http://www.ariadne.ac.uk/issue12/planet-sosig/

E. Worsfold, 'Distributed and Part-Automated Cataloguing: A DESIRE Issues Paper' (March 1998).
http://www.sosig.ac.uk/desire/cat/cataloguing.html

Credits
	Chapter author: Debra Hiom With contributions from: Rebecca Bradshaw, Roddy Macleod, Emma Place and Kate Sharp.

2.12. Multilingual issues

In this chapter...
	providing a multilingual service technical issues interface issues metadata and cataloguing cross-language information retrieval

Introduction

Gateways need to address the language needs of their audiences. Users may want to search a multilingual collection by using queries in one language or to retrieve documents in a number of specific languages, preferably also via an interface in the language of their choice. In some cases they may require some translation or summary in another language than that of the document. Ideally you should provide your audience with the language support it needs. In reality this will very likely be restricted, depending on the available technologies, the language skills of available staff involved in selection and cataloguing and cost considerations.

Background

Multilinguality: praxis, trends and developments

There are two basic issues relating to multilingual access:

the storing, processing and presentation of information in many languages (this is a question of enabling technology)
multilingual search and retrieval

A lot of research has been going on in these areas for some time, especially in the retrieval of documents in languages other than that used for the query (cross-language information retrieval) (Oard, 1997). An overview of projects and demonstration systems can be viewed on the Web (compiled by Oard: http://www.ee.umd.edu/medlab/mlir/systems.html).

Nevertheless, existing gateways in general do not have much to offer yet in terms of multilingual support. Quite a few gateways - at least if they are not based in the UK or the US - do have a bilingual interface, usually the language of the country where the gateway is maintained and English, but more sophisticated facilities, such as multilingual search and/or browse support, are not often available. The main conclusion from a review conducted as part of the DESIRE I project in 1997 (Worsfold et al., 1997) was that that there was considerable inconsistency in the way existing services deal with language issues. Not only did different gateways vary in their policies, there was also a lot of inconsistency within individual gateways. For example, titles are sometimes displayed in the language of the resource, and sometimes only in English, and when resources are available in more than one language this is only sometimes mentioned. Some Internet search engines also offer a form of multilingual support, such as interfaces in various languages, localised search by country usually based on domain name, or automatic translation (such as Alta Vista's Babelfish, based on the Systran translation system). The services hardly ever describe the extent of their provisions in a detailed way, so it is difficult to assess what exactly they have to offer.

However, recent developments in the standardisation of metadata and resource description formats, electronic messaging and WWW technology can provide a solid basis for multilinguality in information gateways.

The European Multilingual Community

The number of indigenous European languages, according to CEN TC 304, is 160. The Internet European multilingual community uses more than 30 languages, represented by many character sets with different repertoires and encodings. A property common to all of them is the use of the character-box (or glyph-box) representation or single-byte character sets (SBCS), i.e. each character uses one displayable position. In this they differ from other languages used outside Europe.

Most of the European languages use the Latin script, which consists of the 26 basic characters of the English alphabet (A through Z) in upper and lower case. Some languages, such as French, Spanish or Icelandic, need some additional characters, as well as a number of characters that are composed from the basic ones and the diacritical marks specified in a few basic ISO standards (such as ISO 6937). Fourteen diacritical marks, commonly called 'accent marks', which permit the support of nearly 200 diacritical combinations, complete the set for European Languages. [Demchenko]

The repertoires of the official European languages of the members of the European Union (EU) are specified in ISO 8859-1, while the repertoires of Central and Eastern European languages using the Latin alphabet are specified in ISO 8859-2. The Greek alphabet is specified in ISO 8859-7 and the Cyrillic alphabet used in Europe is specified in ISO 8859-5. The most widely used operating systems, such as UNIX and Microsoft Windows, use their own character set encoding (e.g. Windows Code Pages 1250-58 or ANS) for support of the European Languages including the Cyrillic languages (Russian, Ukrainian, Belorussian, Bulgarian, etc.) in CP1251 [Freed]. The de facto standards for mail and news exchange as well as for WWW information in Russian and Ukrainian speaking communities are KOI8-R (RFC 1489) and KOI8-U (RFC 2319). These different character set encodings implemented in different operatingsystems are the main source of problems in accessing Internet/WWW content with client software running on these systems.

Issues for Gateway Managers

Gateway managers will be confronted with various choices relating to the language support of the service they want to provide. Those choices for monolingual or multilingual support present itself at many different levels:

Scope and selection policy.
Data presentation and resource description formats.
Metadata and cataloguing rules.
Searching and browsing.
The user interface.

1. Scope and selection policy

Gateway managers will not be able to avoid language issues when trying to determine the scope and coverage of their service. They will need to decide whether to select all relevant documents, independently of their language, or to restrict the scope of the service to documents in one language or a number of specified languages. The following questions will have to be asked - and answered!

will the service include resources written in more than one language, in any language or in a selection of languages?
will the service include documents that require the use of Unicode or ISO 10646 character sets to support multiple languages and scripts in one single document, or it is possible to use single-byte character sets which normally contain characters from specific scripts together with the English alphabet/script (i.e. Latin 1, Latin 2, Cyrillic, Greek, Arabic, etc.)?

The choices made in this area directly determine the skills required of the staff responsible for selecting and/or cataloguing the resources as well as the choice of the relevant authoring and access tools and software. For example, creating an information gateway that includes resources in all European languages would require input from a team who had mastered all those languages between them. If the cataloguing is done by a separate team, this team would also have to consist of people with various language skills. Not many gateways will be able to manage such broad coverage with an in-house team. A distributed model - as opposed to a centralised model - could offer a solution, by getting input from a multinational team, located in various countries, providing their input via the WWW. In this case a multilingual development framework needs to be implemented, based on standards in resource description formats (metadata) and information retrieval and exchange.

SOSIG provides an interesting case study of such a model. As the core team of SOSIG consisted of native speakers of English with no other language skills, SOSIG created a system whereby European correspondents suggest resources in a number of other languages to SOSIG staff. Problems with this approach are that the service is dependent on the goodwill of unpaid staff and that communication takes place (almost) exclusively in a virtual environment.

Distributed cataloguing

R E M E M B E R

the needs of your target audience
technical features of the software underlying your service
the skills of the staff responsible for selecting and/or cataloguing the resource
the model for selection of resources (centralised or distributed), and (related to this) the available possibilities for ensuring the collaboration of staff or correspondents with the needed language skills
the possibilities for the implementation of a multilingual development framework based on standards in resource description formats (metadata) and information retrieval and exchange as well as supporting development/authoring software.

2. Data presentation and resource description formats

A multilingual gateway would require the WWW software lying behind the gateway to cope with multilingual data handling, search, retrieval and display.

Existing standards and recommendations provide a framework for multilingual support in data communications and information resource description formats and metadata.

A model for multilingual support in Internet protocols and applications is defined in RFC 2130. It is implemented both in interactive applications, such as the WWW, and in non-interactive applications, such as electronic mail. Basic for interoperability in those applications is character set encoding (charset), which uses registered MIME (Multipurpose Internet Mail Extension) types, and language tagging, which uses registered language values or names according to RFC 1766 or ISO 639.

The HTTP protocol, on which the WWW is based, includes information about the type of the transferred information and the character encoding for text-based information, for example:

http-equiv="Content-Type" Content="text/html; charset=euc-jp"

The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document:

http-equiv="Content-Type" Content-Language=se

If no Content-Language is specified, the default is that the content is intended for all language audiences.

It is also recommended to include information about the character encoding being used in the META information of the HTML document:

Based on the exchange of information between client (browser) and server (HTTP Server) it is possible to provide character encoding and language negotiation between the information provider and the requester with regard to the accepted and preferred formats of the resources.

Recent developments in XML provide facilities for defining/labelling the language of the whole document, entity or item by including language attributes in the corresponding tag. For example:

The quick brown fox jumps over the lazy dog.
What colour is it?
What color is it?
<sp who="Faust" desc='leise' xml:lang="de">
<l>Habe nun, ach! Philosophie,</l>
<l>Juristerei, und Medizin</l>
<l>und leider auch Theologie</l>
<l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>

Although the default XML Character Set Encodings are UTF-8 and UTF-16 (which are encodings for ISO 10646 or UNICODE), specific encodings for XML documents can be defined in the initial XML declaration for the whole document or entity (which can be regarded as a separately stored part of the whole document), for example:

<? xml encoding='UTF-8' ?>
<? xml encoding='ISO8859-1' ?>

Dublin Core, as a particular realisation of metadata resource description, provides possibilities for defining the language of the intellectual content of the resource, the record and the labelling language of particular fields by means of assigning language attributes to the relevant Dublin Core field.

Examples

DC.Language Format

Field content language labeling/attributing.

A work in Spanish may be assigned the following metadata:

notwithstanding the future advent of total Unicode support in all system and application software, single-byte character sets will continue to be used as well for a long time. Your software should provide correct support and interoperability for both Unicode and single-byte encodings
make sure that your workware (client and server software plus font supplement and cartridges) fully supports all working languages and used character sets at all layers of the multilinguality framework model
configure your authoring tools (HTML and XML editors) in such a way that they insert metadata or attributes about language and character set encoding in the document. Don't forget to select proper parameters (character set encoding and Language) when you edit particular documents (in WYSIWYG HTML editors) or include this information when you use text editors for writing HTML documents
when you provide multilingual information and use encodings other than US-ASCII or Latin 1 (ISO 8859-1) encoding, it is recommended to provide information as to where users can find or download the necessary fonts
be sure that your HTTP server inserts correct information into the HTTP header. Note that different browsers may handle information about character set encoding in HTTP headers and metadatain the HTML document headers in different ways
consider providing some basic training on multilingual issues for your core development staff

3. Metadata and cataloguing rules

If you enable the end-user to specify preferred languages, the search mechanism can return matches for resources that are in a language the user can read. Sometimes you also need to provide a selection of character set encodings to be correctly (i.e. in a readable way) displayed to the user. The latter is especially important for communities that use multiple character set encodings, i.e. charsets. Such selections can be provided as part of the client's browser and WWW server negotiation if they are defined by modern standards and supported by modern multilingual client/server software. For this to be possible the record must contain appropriate information. In other words, in order to be able to provide this option, some investment in multilingual development software/authoring tools and effort on the cataloguing side is necessary.

Traditional library practice is to create one record for one resource. On the Internet the question is what exactly constitutes a resource - the granularity issue. This is also relevant to language issues. Do you include only complete versions of the document, or do you also register parts of a site that are available in another language? If so, how substantial does the translated section have to be? A related issue is the problem of whether to create a separate record for each language version. For books this has been traditional practice; the translation of a book will get its own cataloguing record. For the Internet environment, it may be worth while to store information about different language versions in one record, as long as the fields relating to one version are linked in some way. It will be less labour-intensive to keep one record up to date, and there is no need to maintain a system of cross-references between language versions in order to keep track of different versions of one document.

Some services only mention the language of the resource in the free text description of the resource, not in a separate field, and often this is not very consistently done within one service. This means that the user may search on the word 'Swedish' in the description field and will thus find resources of which it is noted that they are 'Available in Swedish', but no separate formal support for searching on language will be possible, as the system has no properly encoded language information available on which to base such facilities.

To be properly handled by different software, language and character set encoding should be incorporated into metadata and resource description formats explicitly and in a correctly formalised way. The chosen metadata format will have to be able to accommodate this language information. For example both the Dublin Core element set and ROADS enable the storage of language information in a separate, repeatable element or field. ROADS allows the labelling of different variants of informative fields expressed in different languages. Dublin Core provides a mechanism to define the language of the content of a particular field as an attribute of this field. XML encoded DC (or RDF in general) can use an XML language attribute and character set encoding (***on XML and DC, see above).

The metadata largely determine the search support that you will be able to provide. The more sophisticated your metadata set, and the more consistent the cataloguing practice, the more advanced the information retrieval options you will be able to support. On the other hand, 'garbage in = garbage out'.

Two of the most widely used protocols for library and general network information retrieval, HTTP and Z39.50, allow language and character set encoding negotiation for each particular communication (HTTP-RFC2616, Z39.50-LANG). The general scheme for such negotiation is as follows:

the requester or client (in the case of a WWW browser), sends a list of accepted character set encodings (charsets) and an accepted language priority list together with the URL/URI identifier
the server/database returns the resource/document in the requested encoding and language, if it is explicitly labelled

Note that language and character set encoding negotiation that is provided on communication protocol level should normally coincide with correspondent information at document level (i.e. in the document itself). If this is not the case, the client can have problems in reading the requested information. It is the responsibility of the WWW server or database administrator to ensure that such a facility is implemented.

multilingual issues in cataloguing:

1. Cataloguing of the title.

Normally the title will be catalogued in the language of the resource. Titles for the same resource in other languages may be catalogued in an 'alternative title' field labelled with a language/variant label or attribute defining the language of the content. Some information gateways put alternative titles in the same field, separated by '=' or another symbol. It is recommended, however, to encode alternative titles in a separate field, with a language attribute or label, because this allows for more sophisticated handling of alternative titles in the search interface.

R E M E M B E R

formalised encoding of alternative title information in the metadata format allows for more sophisticated handling of this information by the software
defining a 'main' version and 'alternative' versions of a resource may cause problems, if it is not easy to determine what the main language of the resource is. For instance, what is the main title and what are the alternative titles for a Swiss resource, available in French, German and Italian?
giving each language version its own record and cross-referencing the records means more maintenance
when putting all the language information in one record, give all variants their own fields with attributes defining language
that it is labour-intensive to have to check periodically whether other language versions of the same pages have been added
what do you do with bits of a document that are in another language?
do you want to translate the title of non-English resources into English?

2. Language information in description/annotation.

In the free-text description the language(s) in which the resource is available may be mentioned. This has some major disadvantages, because it is hard to guarantee consistency of practice and it does not offer a basis to specify language in the search process.

R E M E M B E R

if you decide to adopt this approach, you could determine a default language to minimize effort. For instance, for resources available exclusively in English the language does not need to be mentioned, but an English page also available in French would get: 'Available in English and French.'
when storing language information in the description field, structured search support for searching on the language of a resource cannot be provided
it is almost impossible to check that the subject specialists/cataloguers consistently mention this information; the DESIRE review [Hiom et al.] indicated that this is not very consistently done

Another issue is the language of the descriptions themselves. There are several possibilities; the language of the description could be:

the language of the resource it describes
the language of the user interface and primary target audience of your service
English as the Internet 'lingua franca'
combinations of these, such as English and the language of your target audience

Descriptions in more than one language will of course multiply the necessary effort. A description in the language of the resource may be an option in a distributed model, with an international team of people without sufficient language skills in a common other language such as English, who select and catalogue resources in various languages. It may, however, be confusing to the user to be confronted with descriptions in various languages. Descriptions in a commonly used language such as English can give users information about documents in languages they can not read.

3. A separate language field.

The language of the resource may be in a separate field, preferably in a standardised format, e.g. ISO639 or RFC 1726. This facilitates search support for queries that specify the language of the resource. If different language versions are combined in one record, the alternative fields should be labelled so that they are linked to the title version that they belong to and the correct version of the title may be displayed to the user.

This practice is recommended instead of only mentioning the language(s) of the resource in a free text description.

4. URIs.

In the case where there is one record for different language versions, the URIs of all available language versions may be listed. In this case there should be some labelling of the URIs to link them to the title version to which they belong. Another option is to give just one URI, that of the home page, and let users choose their preferred language by using the language switch in the document. This will require less effort in creating the record and less maintenance; there can be only one possible 'dead link' instead of two or more. But, on the other hand, sometimes different language versions will be presented as equal, and it will be impossible to say which is the main version.

R E M E M B E R

the language skills of the staff responsible for cataloguing the resources
the way language is supported in the metadata format your are using (for instance Dublin Core, MARC, IAFA)
the way language issues are handled in the cataloguing rules you use
the search support you want to provide; these requirements must be met by the cataloguing format and rules

4. Searching and browsing

Cross-language information retrieval (CLIR) is the possibility of formulating queries in a natural language and retrieving documents in languages other than the language used for the query. The main approaches are defined (by Peters & Picchi, 1997) as:

Text translation via machine translation techniques.
Knowledge-based techniques - these involve the use of multilingual dictionaries, thesauri or general purpose ontologies.
Corpus-based techniques*.

*In this approach large collections of texts are analysed to extract the information needed to construct application-specific translation methods. This usually involves vector space and probabilistic techniques.

The first two approaches are the most relevant for Information Gateways:

1. Text translation via machine translation techniques

For cross-language information retrieval, machine translation of the documents does not seem to be the most realistic option, because of the costs (and the fact that some aspects of it, such as treatment of word order, are redundant for CLIR). More feasible is the translation of the query into the language(s) of the document. Retrieved documents may then be translated for the user, if required, a service that Alta Vista currently provides. It would be possible to add this service to an information gateway. Although results of machine translation are far from perfect, readers may prefer a flawed translation of a document they cannot read to none at all.

2. Knowledge-based techniques

First attempts involved matching the query to the document using machine-readable dictionaries, but the best results have been reached with thesaurus-based approaches. The drawback is that thesaurus construction and maintenance is expensive, and training is required for optimum usage. In the case of thesaurus-based controlled vocabulary indexing and searching, a set of monolingual thesauri is used which all map to a common system of concepts. Instead of the labour-intensive manual assignment of thesaurus terms by indexers, research is being carried out in the area of (semi-)automatic assignment of terms. Thesauri may also form the basis for more complex cross-language free text searching, where the query must be mapped to possible terms in the language(s) of the documents. ISO 5964 recognizes three approaches to the construction of multilingual thesauri:

Ab initio construction, i.e. the establishment of a new multilingual vocabulary without direct reference to the terms or structure of an existing thesaurus.
Translation of an existing monolingual thesaurus.
Reconciliation and merging of existing thesauri in two or more working languages.

EuroWordNet

This project, which ran till June 1999, aimed to develop a general purpose multilingual ontology: a multilingual database, which represents basic semantic relations between words in various European languages, with Princeton WordNet1.5 as starting point. The basic principle is the construction of monolingual wordnets, which maintain language specific differences, which are mapped to a common top-ontology.

Although some gateways use thesauri for subject access (OMNI) or to provide the user with additional assistance in the choice of search terms (SOSIG), little or no use has been made by gateways of the potential of using a thesaurus for multilingual retrieval.

3. Classification schemes

If resources are classified using the numerical code from a classification scheme which is available in more than one language, this enables language-independent searching as well as the possibility of offering a browsing structure in more than one language.

E X A M P L E

DutchESS offers a browsing structure based on the Nederlandse Basisclassificatie which is available in Dutch and English. A (slightly different) German translation of the same scheme is also available, which would make it easy to add a German interface in the future
Jyväskylä Virtual Library offers a browsing structure in Finnish and English (this does not apply to all sections of the distributed Finnish Virtual Library of which the Jyväskylä Virtual Library forms a part)

When choosing a classification scheme for your service, consider:

in which languages the classification scheme is available
whether it would be feasible to translate the scheme into another language in which it is not currently available but which you require for your service

4. Keywords

Keywords may be added to the resource description in any language. In this case also a consistent policy may enhance retrieval possibilities. A number of options are possible:

add keywords in the (primary) language of the service (user interface)
add keywords in the language of the document
add keywords in English as the Internet 'lingua franca'
add keywords in a number of languages

Keywords may be chosen from an uncontrolled keyword list or from a controlled vocabulary; when available in more than one language this will provide opportunities for searching documents in various languages by means of a query in one language. The user should be made aware of the available options.

5. The user interface

A monolingual user interface will probably be in the language of your primary audience or in a language familiar to a broad audience, such as English. The advantage of this is that it will require less effort to maintain, but you will exclude users who are not familiar with your chosen language. In the case of an academic audience, you may usually assume a certain proficiency in English, but a broader audience may not have those language skills. If the interface is in the national language only, this means that you narrow your target audience to one language community, dependent on the number of native speakers and others with a certain level of proficiency in that language.

Providing an interface in more than one language means that you will reach a broader audience, but you will have to put more effort in maintaining your service.

The target audience that you wish to serve will be of major importance when choosing the interface language(s). Another issue to consider is whether you are willing and able to match your multilingual interface with multilingual search support. For instance, if you provide a browsing structure based on a classification scheme which is available in one language only, do you want to put effort into translating the scheme into another language used in your interface?

In general users should be made aware of the consequences of the way they formulate their queries. This is easier said than done, if you want to avoid extensive help files or cluttered interfaces. For example: a simple query (all fields) in French may retrieve a document with the specified word in the title, but it will not result in any hits in the description field, if the language used for the description is English. As is well known, users are not very keen on reading help pages, so the search interface design should aim to present the language options in an clear and intuitive way.

User interface design

R E M E M B E R

the expected language skills of your audience; do you aim to address a well defined language community or do you wish to provide for a broader audience?
do you have staff with the necessary skills to translate the interface pages, or are you prepared to meet the extra cost of third party assistance (translation service)
are you willing and able to invest in extra creation and maintenance effort for your interface?
are you willing and able to match your multilingual interface with multilingual browsing and/or search support?

General conclusions

multilinguality is a complex issue. Although a lot of technology has become available in recent years, many problems have yet to be solved. In most cases gateways will not be able to provide more than very basic facilities if they need to keep costs within acceptable limits. However, from the above it may be clear that putting some effort into making consistent choices - based on user needs - concerning such issues as scope and selection policy, metadata and cataloguing, classification and subject indexing, as well as regarding the use of the appropriate technologies, may enhance the language support you will be able to provide in your service; it will allow you to project a clearer picture to your users of what your gateway is about. Any extra facilities will have their costs, though, in terms of extra initial effort, maintenance, required skills of staff and so on, and it is up to you to decide whether user benefits outweigh necessary efforts to provide them.

General recommendations

try to obtain knowledge about the language skills and needs of your audience
aim at an integrated and consistent approach to language issues for your gateway. Examples:
- when your documents are in Danish only, it is probably not worth while to provide your users with a bilingual Danish/English interface
- if you are not going to provide any multilingual search support, should you put effort into a bilingual or multilingual user interface?
- if your cataloguing system can't handle Japanese, shouldn't you exclude documents in this language from the scope of your service?
- consider the language skills of the staff responsible for selection and cataloguing when you develop the scope and selection policy of your service.
try to balance requirements of effort against expected results and benefits of multilingual support for your users
provide your users with information about your language policy, and integrate language related search options into your query interface design in a clear and unambiguous way

Glossary

CEN - European Committee for Standardisation
CLIR - Cross Language Information Retrieval
CTE - Content Transfer Encoding
DC - Dublin Core
DutchESS - Duthc Electronic Subject Service
IAB - Internet Activities Board
IETF - Internet Engineering Task Force
ISO - International Standards Organization
MARC - MAchine Readable Cataloguing. A family of formats based on ISO 2709 for the exchange of bibliographic and other related information in machine readable form.
MIME - Multipurpose Internet Mail Extension
OMNI - Organising Medical Networked Information (Medical gateway in the UK)
POSIX - Portable Operating System Interface
SBCS - single-byte character sets
SOSIG - The Social Science Information Gateway
Unicode - A universal 16-bit encoding for the scripts of the world's principal languages
UCS - Universal Character Set
UTF - UCS transformation formats - encodings for ISO 10646 or UNICODE
XML - Extensible Markup Language. A lightweight version of SGML designed for use on the Internet.

References

DutchESS, http://www.konbib.nl/dutchess/

EuroWordNet, http://www.hum.uva.nl/~ewn/

Jyväskylä Virtual Library, http://www.jyu.fi/library/virtuaalikirjasto/engroads.htm

Unicode Consortium, http://www.unicode.org

H. Alvestrand, RFC 1766, Tags for the Identification of Languages (UNINETT, March 1995).
ftp://ftp.isi.edu/in-notes/rfc1766.txt

G. Clavel et al., CoBRA+ working group on multilingual subject access : Final report (Bern, 9th March 1999).
http://www.bl.uk/information/finrap3.html

Y. Demchenko, i18n and multilingual support in Internet mail Standards. Overview.
http://www.terena.nl/multiling/

Encoding Dublin Core Metadata in HTML (Internet Draft).
http://www.ietf.org/internet-drafts/draft-kunze-dchtml-01.txt

Extensible Markup Language (XML) 1.0 (W3C Recommendation, 10 February 1998).
http://www.w3.org/TR/1998/REC-xml-19980210

The ISO 8859 Character Sets
http://www.terena.nl/multiling/ml-docs/iso-8859.html

ISO 639, 'Code for the representation of names of languages'.

ISO/IEC 10646-1:1993(E ), 'Information technology - Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic multilingual Plane' JTC1/SC2 (1993).

J. Knight, Internationalization in the DESIRE project
http://www.roads.lut.ac.uk/DESIRE/DesireI18N.html

D. W. Oard, 'Serving Users in Many Languages : Cross-Language Information Retrieval for Digital Libraries', D-Lib Magazine (December 1997).
http://www.dlib.org/dlib/december97/oard/12oard.html

D. W. Oard, Cross-Language Information Retrieval Resources (Overview).
http://www.ee.umd.edu/medlab/mlir/

C. Peters, & E. Picchi, 'Across Languages, Across Cultures : Issues in multilinguality and Digital Libraries', D-Lib Magazine (May 1997).
http://www.dlib.org/dlib/may97/peters/05peters.html

RFC 2413. Dublin Core Metadata for Resource Discovery
http://www.ietf.org/rfc/rfc2413.txt

RFC 2616. Hypertext Transfer Protocol -- HTTP/1.1
http://www.ietf.org/rfc/rfc2616.txt

The Unicode standard, version 2.0 (Unicode Consortium. Reading, Mass.: Addison-Wesley Developers Press, 1996).

C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin & P. Svanberg, RFC 2130 - Report from the IAB Character Set Workshop (April 1997).
ftp://ftp.isi.edu/in-notes/rfc2130.txt

E. Worsfold et al., Developing multilingual subject gateways (An issues paper written as part of the DESIRE Cataloguing Project)
http://www.sosig.ac.uk/desire/lang/language.html

F. Yergeau, RFC 2279 - UTF-8, a Transformation Format of Unicode and ISO 10646 (January 1998)
ftp://ftp.isi.edu/in-notes/rfc2279.txt

Credits
	Chapter author: Yuri Demchenko, Marianne Peereboom

2.13. Co-operation between gateways

In this chapter...
	strategic advantages of co-operation models for co-operation interoperability issues practical demonstrations of co-operative work key initiatives in gateway co-operation to date recommendations

Introduction

The Internet offers great potential for co-operation between gateway services, since it allows geographically distributed databases and people to communicate with one other and to work together to build integrated services.

Co-operation between gateways is increasingly being seen as a strategy for:

enhancing Internet resource discovery for end-users
improving the efficiency and sustainability of gateway services

There are a number of different models for collaborative work, and, as gateways are still a relatively new type of information service, there is still much scope for exploring the potential of co-operation. Those running gateways should consider the benefits of, and opportunities for, co-operation with other gateways.

Strategic advantages of co-operation

Why should a gateway consider co-operation with other gateways?

Enhancing Internet resource discovery for end-users

The development of a myriad of information gateways on the Web is, ironically, making it increasingly difficult for users to search the Internet effectively. Many gateways are claiming to offer a 'one-stop shop' for finding information and this may work for certain users; however, other users will benefit from searching more than one gateway. With lots of independent and uncoordinated gateways, this can involve making a series of searches in a number of services, all of which have different interfaces and ways of working. Not easy!

Collaboration can help gateways to offer integrated services for end-users. The advantages of this for users (depending on the co-operative model used) may include:

access to far broader collections than any single gateway could offer, including high quality Internet resources on many subjects, from many countries, written in many languages
access to a large number of metadata records via a single user-friendly interface
the ability to locate new gateways that they may not have heard about
the possibility of searching a selection of gateways simultaneously as opposed to one by one

Improving the efficiency and sustainability of gateway services

As more organisations invest in building gateway services, more opportunity for collaborative work arises. Collaboration can help organisations to develop their gateways more efficiently and effectively. It can also help them to sustain the gateways in the longer term. The advantages of co-operation for organisations may include being able to:

use established technologies, methods and practices - and avoid starting from scratch
divide responsibilities for creating or sharing metadata records - and avoid duplication of effort
combine effort for technical development - and avoid repetition of work and errors
create joint publicity, training and promotion
share staff effort (management/technical/administrative/cataloguing) - to make organisational efficiencies
create shared strategies for long-term sustainability

All of these factors have the potential to improve the service that an organisation can offer to its target users.

For some organisations, there will be a greater imperative for collaboration if they have a remit for creating a more comprehensive service than resources will allow. This applies particularly to libraries, which are often expected to offer access to large collections, despite having limited resources to build them.

Disadvantages of co-operation

There can be political or funding issues that rule out co-operation; indeed in some cases gateways will see competition as a natural alternative to collaboration! Disadvantages of gateway co-operation may include:

1. Extra expense.

To make some models for co-operation work, some extra effort will be required to set up the necessary systems. For example, to make gateways interoperable some work needs to be done on making different classification schemes, metadata formats and collection development policies compatible. In the longer term, savings may be made from having co-operative strategies but the initial setup may be too expensive to consider.

2. Intellectual property rights.

There is an issue surrounding ownership of metadata records which may stand in the way of co-operation. Gateways may have invested considerable resources into creating records and be unwilling to share them or give them away for free. The issue of intellectual property rights on the Internet is still a new one with some unresolved issues, and gateways would need to investigate these before entering co-operative agreements.

3. Agreeing on aims and objectives.

Gateways may have incompatible aims and objectives. Having developed with particular audiences in mind, they may have reservations about the value of co-operation for their users which need to be resolved. There may also be issues for funders or sponsors of gateways who have vested interests which need to be considered.

Models for co-operation

In the library world, co-operative agreements that support information search and retrieval are commonplace. For example, national libraries each take responsibility for collecting materials published in their country and then offer users access to these collections via inter-library loans. Another example is the sharing of cataloguing effort, where groups of libraries work together to create union catalogues and where the catalogue records are shared and re-used by many libraries, regardless of which library actually created the record.

This co-operation enables libraries to:

offer users access to far broader collections than could be offered by any single library
offer users a more comprehensive catalogue than could be created by a single library
achieve efficiencies in cataloguing and collection development without reducing the level of service to users

Such co-operation translates well into the Internet environment and the development of information gateways. Collaboration is particularly pertinent to organisations with a remit for providing access to scientific, cultural and educational resources on a large scale.

A number of different models for co-operation between gateways exist:

Co-operative agreements for metadata records

Gateways can create co-operative agreements regarding metadata records:

Co-operative agreements for creating metadata records

Gateways can share the effort required to create metadata records by dividing responsibilities. For example, a group of gateways can agree that each should spend time creating records for different parts of the Internet, each focusing its efforts on records for resources in a particular subject, language or from a particular country.

Co-operative agreements for using metadata records

Metadata records can be shared and re-used, and are not confined to the service which created them or to being used in only one service. Agreements on intellectual property rights would need to be established, and work is being done in this area, but the potential exists for gateways to create agreements that enable them to offer users access to records that have been created through a distributed network of gateways. Building integrated services

Co-operation can lead to the development of integrated gateway services, which offer users access to a number of gateways via a single interface. This interface might offer different levels of functionality:

Guiding users to other gateways/mirrors of gateways

The simplest form of co-operation is for gateways to point to other gateways that might support the user group. This may involve offering a set of hyperlinks to other related gateways, or offering mirrors of related gateways where access could be improved by keeping a local copy of the service. Although each of the gateways would have to be searched serially, the user would be alerted to other gateway services which they might not have otherwise found.

Fully integrating distributed gateways into a single service

In some cases it may be easier for users if they can access many gateways simultaneously. A fully integrated service offers users the chance to select a number of gateways and then to cross-search or cross-browse all the gateways in one go. A single interface offers users a single point of access to distributed gateway services. In some cases it will not be necessary to disclose to users the fact that they are searching distributed databases.

Gateways may offer different interfaces to the same collection of metadata records. For example, a shared pool of metadata records can be developed, where each gateway contributes records to the pool, but creates its own interface to the data. In this way, different user groups can be offered a tailor-made interface and gateway service.

Interoperability issues

Co-operation between gateways raises a number of interoperability issues. In the field of Internet resource discovery the term 'interoperability' refers to 'the transparent searching and retrieval of data from diverse systems and in different metadata formats' (Day, 1999).

A lot of research and development has been done on how gateways can be made to interoperate and this has highlighted the areas where standards are needed to make gateways interoperable. For gateways to co-operate they will need to work at:

technical interoperability - search and retrieval protocols, software
data interoperability - metadata formats, cataloguing rules

They will need to agree on:

quality selection criteria and scope policies - to develop coherent collections and services
areas of responsibility - to avoid duplication
organisational/political/management issues

A fuller description of interoperability issues is given in the 'Interoperability' chapter in this handbook. However, this overview highlights some of the issues that are being tackled by existing gateways in the co-operative work described in the following sections.

Interoperability

Practical demonstrations of co-operative work

Libraries and other organisations still have a lot of work to do on the political and organisational issues involved in co-operative work. However, a number of gateway projects are now able to demonstrate some of the ways in which issues of technical and data interoperability can be solved.

This section highlights a few examples of how gateways are co-operating in practical terms. These are ordered from examples of low-level co-operation, which is relatively easy to implement, to high-level co-operation, which requires agreements for a national or international strategy.

E X A M P L E

An EXAMPLE of a gateway pointing to the front pages of other gateways

EEVL and Pinakes

EEVL (The Edinburgh Engineering Virtual Library) offers users a page of links to other high quality information gateways. This is simply a page that has hyperlinks to the front pages of other gateways; however, it may help users to find gateways which they did not know about.

An EXAMPLE of gateways mirroring one other's services

SOSIG/Scout Report

The UK's SOSIG (Social Science Information Gateway) and the USA's Scout Report for the Social Sciences have a reciprocal agreement to mirror one another's services, to improve access for users on both sides of the Atlantic.

http://scout.cs.wisc.edu/addserv/mirror/sosig

EXAMPLES of cross-searching two gateways simultaneously

SOSIG and Biz/ed

In the UK, two gateways (SOSIG and Biz/ed) are offering users a service where two separate databases are simultaneously cross-searched via a single interface. Users are unaware that they are in fact searching two gateways, as the results are fully integrated.

Go to SOSIG: and search for industrial psychology.

You will retrieve records from both the SOSIG and the Biz/ed databases - displayed in a single list. Both gateways use the ROADS software which enables cross-searching

EELS and EEVL

This is an example of two gateways based in different countries being cross-searched. Both are engineering gateways - EELS is based in Sweden and EEVL in Scotland. This is a demonstration service, but illustrates the potential for cross-searching two gateways, regardless of the fact that they are geographically separated.

http://roads.ukoln.ac.uk/eels-eevl/

An EXAMPLE of gateway standards and software that support co-operative work

CrossROADS and Interoperability

The ROADS software has been developed specifically to support the development of gateways and to ensure that those gateways are interoperable. A demonstration of how distributed gateways can be cross-searched is available from the ROADS Web site:

CrossROADS

http://www.ukoln.ac.uk/metadata/roads/crossroads/
http://www.ariadne.ac.uk/issue14/metadata/
A paper discussing interoperability issues with metadata

EXAMPLES of plans for integrated gateway services on a national scale

RDN - The Resource Discovery Network

In the UK, government funding is being used to create the Resource Discovery Network - a gateway service for the higher education and research sectors. RDN will offer a single interface to a number of national subject gateways. Each of the services has its own identity and interface, but the RDN will offer another level of service to users - the ability to search for resources across several hubs at the same time.

http://www.rdn.ac.uk/

DEF Project - Denmark's Electronic Research Library

Within this project, a network of Danish libraries aims to form a virtual system to make the libraries' collective information resources (digital and traditional) available to users everywhere in the country in a simple, transparent way.

http://www.deflink.dk/english/def.ihtml

An EXAMPLE of plans for an integrated gateway service on an international scale

REYNARD

The REYNARD project proposal suggests that national libraries in Europe should each assume responsibility for creating metadata records that describe high-quality Internet resources created in their own country. An integrated broker service will then be set up to enable each of the gateways to be accessed from a single interface and to allow users to cross-search the gateways.

http://www.renardus.org

Key initiatives in gateway co-operation to date

Are there any important initiatives in gateway co-operation? There is still much potential for co-operative strategies to be developed, particularly within the library community, but some strategies for co-operation are already developing.

E X A M P L E

ROADS

An ideal solution for co-operation would be to have agreed standards that could facilitate interoperability. The ROADS project was developed with this aim; it has created a system of software and standards for developing information gateways that have the potential to be cross-searched with any other ROADS gateway. ROADS has produced an extensive collection of software, metadata templates and guidelines, all of which are freely available.

ROADS was initially funded by the UK's Electronic Libraries Programme. The project ended in July 1999; however, ROADS continues as an open source software project, where the gateway community works collaboratively to develop the software. The ROADS community has a number of committed partners from many countries, and the software is likely to go from strength to strength.

ISAAC

ISAAC is a research project of the Internet Scout project in the USA. It aims to create an architecture that enables distributed repositories of metadata records to be cross-searched.

http://scout.cs.wisc.edu/research/index.html

iMesh Toolkit

The National Science Foundation in the USA and the JISC in the UK are funding a new project (starting 1999) that will develop an architecture toolkit for distributed subject gateways. This will build on work being done within ROADS and ISAAC.

http://www.desire.org/html/subjectgateways/community/imesh/

DESIRE

The DESIRE project has been funded by the European Union to develop tools and methods for organisations interested in setting up large-scale information gateways that can support European researchers. The DESIRE Web site offers information, advice and resources for gateways to use.

http://www.desire.org/

IMesh

IMesh is an informal and independent group set up to facilitate international collaboration on Internet subject gateways. It was formed in 1998 after a meeting attended by staff from a number of gateways. The Web site points to a discussion forum for gateways interested in co-operation.

http://www.desire.org/html/subjectgateways/community/imesh/

Recommendations

Libraries, research organisations and educational establishments which are investing in the development of large-scale information gateways would be well advised to work together to create a co-operative strategy. Together they could provide the resources and expertise required to build a comprehensive collection of metadata records which describes large numbers of the high quality resources available on the Internet. Integrated services could offer users access to resources from many countries, on many subjects and in many languages.

An integrated service could offer users a valuable alternative to other Internet search tools such as search engines and directories, which are often either indiscriminate, pointing to resources of unknown quality, or popular, pointing to resources that are recreational as opposed to educational. An international network of information gateways could form the Internet equivalent of an academic research and education library, where users could go to locate high quality resources with confidence. This vision relies on co-operation and we hope that libraries and educational organisations will rise to the challenge.

Glossary

cross-browsing - Browsing, where the Web pages contain resources from more than one gateway
cross-searching - Searching, where the search takes place across more than one gateway
DEF - Danmarks Elektroniske Forskningsbibliotek (Denmark's Electronic Research Library)
DESIRE - Project funded under the Europena Union's Telematics for research Programme to enhance and facilitate Web usage among researchers in Europe (producer of this handbookk)
EELS - Engineering Electronic Library, Sweden
EEVL - Edinburgh Engineering Virtual Library
IMesh - An informal group for the discussion of international collaboration on Internet subject gateways
ISAAC - Project Isaac - A Distributed Architecture for Resource Discovery Using Metadata - managed by the Scout Project
RDN - Resource Discovery Network - the UK's centre for its national subject gateways
REYNARD - A project proposal for building a broker service to national gateways in Europe, managed by Koninklijke Bibliotheek, National Library of the Netherlands
ROADS - ROADS is a set of software tools to enable the set up and maintenance of Web based subject gateways.
SOSIG - The Social Science Information Gateway

References

Biz/ed, http://www.bized.ac.uk

CrossROADS, http://www.ukoln.ac.uk/metadata/roads/crossroads/

DEF Project, http://www.deflink.dk/english/def.ihtml

DESIRE, http://www.desire.org/

EELS, http://www.ub.lu.se/eel/

EEVL, http://www.eevl.ac.uk/

IMesh, http://www.desire.org/html/subjectgateways/community/imesh/

ISAAC, http://scout.cs.wisc.edu/research/index.html

Pinakes, http://www.hw.ac.uk/libWWW/irn/pinakes/pinakes.html

RDN, http://www.rdn.ac.uk/

ROADS, http://www.ilrt.bris.ac.uk/roads/

Scout Report Signpost, http://www.signpost.org/signpost/