DESIRE Information Gateways Handbook
HomeTable of contentsAuthors-
Search | Help   
-2.3. Metadata formats

In this chapter...
 
  • why create metadata records?
  • types of metadata attributes
  • standard metadata formats
  • choosing metadata attributes and formats for your gateway
  • format conversion and future proofing
Introduction
 

Information gateways are characterised by their creation of third-party metadata records - individual descriptions of Internet resources held in a database that have separate fields for different attributes of the resources, such as title, author, URL etc. These resource descriptions are used to:

  • help users learn more about the Internet resources (from a trusted third-party)
  • support information search and retrieval

Gateways adopt the approach where metadata is created by a third party ie. an independent subject specialist or information professional, rather than the creator of the resource. This enables the quality control for which gateways are renowned - the resource descriptions all assume a standard format and are generated manually (at least in part) to enable high quality metadata that benefits for semantic judgements about the nature and origin of the resources.

The metadata created by gateways is their greatest asset - adding value to the Internet resources by creating independent, standardised third-party descriptions.

The decision of which metadata format to use is an imporatnt one as it impacts on the searching capabilities of the gateway and the value of the descriptions to the end-users. The creation of metadata will be one of the most time-consuming tasks in running a gateway and so a balance between value and cost may be required in deciding on a format.

This chapter will introduce some of these issues and provide some background information that information gateway managers will need to consider when choosing a metadata format for their gateway.


Why create metadata records?
 

Information gateways are services that give access to networked resources in particular subject areas, linguistic domains, and so on. Many Internet portals simply comprise of sets of Web pages with lists of hyperlinks on a static Web page, perhaps with annotations, however, this approach has distinct disadvantages:

  • the portal can be browsed, but with no database it cannot be searched effectively
  • maintaining the portal is time consuming as all edits and additions require manual changes to the HTML

Gateways take advantage of database technologies which gets over both these problems, but requires that a standard format be used for creating and storing the resource descriptions. Metadata formats are structured formats for Internet resource descriptions. For gateways, the metadata fomats are the forms or templates that need to be filled in by the cataloguers to create a resource description.

The use of metadata by an information gateway has many benefits over the simple HTML list approach, for example:

  • the metadata has structure and so can form the basis of far more advanced search facilities within a gateway (e.g. fielded searching, such as searching by title or author)
  • the metadata can be converted to other formats or be otherwise persuaded to interoperate with different search and retrieve protocols
  • it is easier to maintain a database of resource descriptions than a large number of HTML files. Administrative metadata can also be used to record when resources need to be re-evaluated or removed from the database

Metadata attributes
 

Gateways staff will need to agree on the attributes of an Internet resource that they wish to describe. Metadata can be grouped into various kinds according to their use within the gateway. They might include:

Descriptive

Descriptive metadata contain information which may be usefully returned from a search of the gateway. A user may be able to decide from this information whether it is worth spending time looking at the resource itself.

  • title
  • short title (e.g. an acronym of the full title)
  • alternative title (e.g. title of resource in another language)
  • subtitle
  • description
  • URI (or other location)
  • author
  • language
  • character set encoding
  • organisation - either creating or hosting the resource-
  • medium (e.g. text/images/audio/video)
  • type of resource (using types appropriate to your gateway)
  • physical medium
  • copyright owner
  • availability (is payment or registration needed?)
  • software required for access (e.g. specific browsers, MIDI software)
  • quality rating
  • intended audience (e.g. undergraduate level)

Subject

Subject metadata can facilitate effective searching. They can also be used to organise the browsing structure of your gateway. A fuller discussion can be found in the

Cross reference
Subject indexing and classification

  • keywords
  • classification code
  • classification system - must accompany classification code!
  • terms from thesauri
  • subject headings

Administrative

Administrative metadata are intended primarily to assist the gateway staff in maintaining the gateway. They are of less concern to users and may not be visible to them; however, they can be used, for example, to check that resource descriptions are still current.

  • resource maintainer
  • date of addition of resource to gateway
  • date record was last updated
  • date resource was last changed
  • review-by date
  • expiry date (e.g. of a conference announcement)
  • submitter of resource
  • cataloguer of resource
  • origin of record (if gateway has collaborators)
  • rights ownership
E X A M P L E

ROADS templates contain relatively simple administrative metadata attributes like the following:

To-Be-Reviewed-Date:
Record-Last-Verified-Email:
Record-Last-Verified-Date:
Comments:
Record-Last-Modified-Date:
Record-Last-Modified-Email:
Record-Created-Date:
Record-Created-Email:


Consideration of which particular administrative functions are required and an assessment of which particular administrative metadata elements are needed will be an important part of choosing (or adapting) a metadata format for use in a particular information gateway.

Core metadata

The possible metadata fields listed above are by no means exhaustive, but including them all would require considerable effort both in initial cataloguing and in keeping records up to date. Not all of them might be appropriate to your gateway.

Attempts have been made to define standards for a 'core' of metadata which should be regarded as a bare minimum. One such standard is the Dublin Core.

E X A M P L E

Dublin Core currently involves 15 core elements:

  1. Title
  2. Author or Creator
  3. Subject and Keywords
  4. Description
  5. Publisher
  6. Other Contributor
  7. Date
  8. Resource Type
  9. Format
  10. Resource Identifier
  11. Source
  12. Language
  13. Relation
  14. Coverage
  15. Rights Management

http://purl.oclc.org/dc/about/element_set.htm


ROADS offers a number of metadata templates designed for different types of Internet resources. Each template contains attributes specific to the type of Internet resource. For example, the template for describing a mailarchive will have a different set of fields from the template for describing a Web document. ROADS also maintains a 'template registry' where the metadata fields used in the various kinds of ROADS templates are recorded. This ensures that ROADS services are potentially interoperable in this area. New fields can be nominated for addition to the registry.

E X A M P L E

ROADS offers metadata formats for the following types of Internet resource:

ROADS template-types:

COLLECTION - experimental
DATASET
DOCUMENT
DUBLINCORE
EVENT - experimental
IMAGE
MAILARCHIVE
PROJECT
SERVICE
SOFTWARE
SOUND
TRAINING MATERIALS
USENET
VIDEO

http://www.ukoln.ac.uk/metadata/roads/templates/


Choosing metadata attribites
 

You should think carefully about which metadata attributes your gateway is going to use, and their format, when you first set up the gateway. If you do not, you may find yourself constrained by the absence of useful metadata, or have to add a new metadata field or convert an existing field to a different format when you already have several thousand resources in your database. Moreover, decisions about metadata will in turn affect the design of your interface (especially the parts of it used for cataloguing and/or submitting new resources for consideration).

Cross reference
Cataloguing

Which metadata fields could be usefully searched on by your users?

You should consider your potential user community and also the nature of the resources which your gateway will cover. For example, if your gateway is intended to cover only geographically local resources in one language, a 'language' field will not be very informative unless your gateway is going to be cross-searched with others elsewhere.

And how are they going to search them?

This will affect not only what metadata fields you provide but also the cataloguing rules you adopt. For example, if you are ranking searches by the frequency of the occurrence of the search term, you may wish to make descriptions similar in length, otherwise resources with long descriptions may be more likely to returned high up the order.

Cross reference
Subject indexing and classification

Which metadata fields will be displayed to the users of the gateway?

Will they need to be converted from the form in which they are stored and if so does an easy way of converting them exist?

Which metadata fields will be used for housekeeping by the gateway staff and how?

Metadata can supply information for partially automating this otherwise laborious aspect of gateway management. For example, you can have an automatic email sent to maintainers of resources occasionally to ask whether they have made any changes, or set a web-page tracking tool to monitor changes to resources.

Cross reference
Collection management

Which if any are optional?

If you are collaborating (or thinking of it), which metadata fields will be shared with your collaborators? Are they likely to want extra information, such as language, which you would not otherwise include in your metadata? You will need to use the same schemes for e.g. classification or have a usable crosswalk to convert between schemes. You should also think about the issue of copyright.

Cross reference
Co-operation between gateways, Interoperability

Are you going to display your metadata in the same format as that in which you store it?

If not, you will need a way of converting between formats.

Can any of the software you are using generate useful metadata?

For example, ROADS automatically records when a template was last updated. You may wish to use in addition software for creating metadata (see below). Harvesting software, if used, may also be able to harvest metadata.

Cross reference
Harvesting, indexing and automated metadata collection

Who will generate metadata fields (and which ones?).

Metadata may be supplied by:

  • information providers
  • gateway users
  • cataloguers for the gateway
  • subject editors for the gateway
  • core gateway staff
  • another gateway working in collaboration with you
  • automatic generation by software

How much cross-checking will there be? (Time will need to be allowed for this).

If you are allowing gateway users or information providers to submit resources, what information should they supply?

What information may they also supply optionally? How important is it that (for example) descriptions or keywords are consistent across the gateway? If this is important, can you supply cataloguing rules or other guidance to help information providers and others who are submitting resources? How much effort can be expended on editing their contributions, given that gateway users and information providers cannot be compelled to follow your cataloguing rules?

Cross reference
Working with information providers

How might you ensure that information such as dates is in a consistent format? Possible methods include:

  • pulldown menus on forms
  • authority files
  • cataloguing rules

Cross reference
Cataloguing

In what language are your metadata records going to be kept?

If this is different from the language of some of your resources, are you going to make any provision for searching in that language (e.g. an 'alternative title' field)?

Cross reference
Multi-lingual issues


Standard metadata formats
 

Information gateway managers will need to make decisions about which metadata format (or formats) to use within their service at a very early stage of its development. At present, however, the existence of a large and varied range of metadata formats and initiatives complicates these decisions.

It is worth remembering also that the choice of metadata formats will often be influenced by other factors, both technological and social. For example, an information gateway that wishes to use the ROADS software toolkit with little modification will currently need to use the ROADS template format, or something very similar to it. Again, where gateway cross-searching or interoperability is seen to be important, there may be technical reasons why one format may have advantages over another.

The nature of metadata development means that at any one time there are likely to be a variety of formats that could be chosen as the basis of an information gateway. For example, a review of metadata formats undertaken under DESIRE I identified and described over twenty formats that were in use (or under development) in 1996 (Dempsey et al., 1997). In order to help analyse the different metadata formats described in the review, the DESIRE I study produced a typology of metadata based upon their underlying complexity.

Band One Band Two Band Three

[simple]

---------------

---------------

[complex]

(full text indexes)

(simple structured generic formats)

(more complex structure, domain specific)

(part of a larger semantic framework)

Proprietary formats

Proprietary formats
Dublin Core
ROADS templates
LDIF

FGDC
MARC

TEI headers
EAD
CIMI

Figure 1. Typology of metadata formats (adapted from Dempsey and Heery, 1998).


Choosing a metadata format
 

Choosing a format from the variety of existing ones will depend upon various factors. In general, current information gateways tend to use relatively simple generic formats with some structure ('Band Two' formats such as ROADS templates or Dublin Core). These formats have the twin advantages of simplicity, which means that they are relatively easy to create and maintain, and the existence of some structure, which facilitates both interoperability and format conversion. However, in particular circumstances there may be good arguments for basing an information gateway on more complex formats ('Band Three' formats such as MARC or TEI headers) if this offers some competitive advantage to the gateway. For example, the USMARC format has been used for the cataloguing of Internet resources in the InterCat project and it would be possible to set up MARC-based information gateways. However, the use of these more complex formats may have implications for the level of expertise (technical and other) that would be required for cataloguing and may have other costs.

As noted before, the choice of a particular format may be dictated by technological or social factors. For example, particular gateway software may dictate the use (or non-use) of particular formats. Information gateways that, for example, are running the ROADS software without much modification will need either to use one of the existing templates defined by the ROADS project or to create new (and similar) templates in the form of attribute-value pairs.

Example format 1: Dublin Core

The Dublin Core (DC) is the result of an international and interdisciplinary initiative to define a core set of metadata elements for electronic resources, primarily for resource discovery on the Internet. DC was initially conceived as a simple format that could be used for author-generated descriptions of Web resources. However, the format has also attracted the attention of resource description professionals from a variety of communities such as libraries, museums, archives and government agencies.

E X A M P L E

Example of a DC based gateway

EdNA (Education Network Australia):

EdNA - an information gateway for Australian education resources - uses a metadata standard that is based on the DC element set. The owners of documents are encouraged to embed metadata within their documents where it can be read by the EdNA resource harvester and transferred to the EdNA database.


The format has been developed by means of a series of invitational workshops, the first being held in Dublin, Ohio in March 1995. The workshop series and related work has resulted in the definition of fifteen core metadata elements as RFC 2413 (Weibel et al., 1998). These elements are intended to be repeatable and extensible in any application.

The initial focus of DC was the Web, so the initiative has concentrated on the production of draft guidance for the encoding of DC elements, first in HTML (Kunze, 1999) and more recently in XML/RDF (e.g. Miller, Miller and Brickley, 1999).

E X A M P L E

Example of DC metadata embedded in HTML

<link rel="schema.DC" href="http://purl.org/dc">
<meta name="DC.Title" content="Southampton Oceanography Centre (SOC)">
<meta name="DC.Creator" content="Bruce Dupee (b.dupee@soc.soton.ac.uk)">
<meta name="DC.Subject" content="oceanography, marine, technology, geology, seafloor, education, science, research, ships, vessels">
<meta name="DC.Description" content="An introduction to the services provided by the Southampton
Oceanography Centre - a joint venture between the University of Southampton and the Natural Environment
Research Council. Includes information on internal departments and divisions, and the National Oceanographic Library">
<meta name="DC.Publisher" content="NERC Computer Services">
<meta name="DC.Date" scheme="WTN8601" content="1999-06-08">
<meta name="DC.Type" content="Text">
<meta name="DC.Format" content="text/html">
<meta name="DC.Format" content="7985 bytes">
<meta name="DC.Identifier" content="http://www.soc.soton.ac.uk/">

Metadata created by DC-dot, a service that will retrieve a Web page and automatically generate Dublin Core metadata, either as HTML <META> tags or as RDF/XML, suitable for embedding in the page header.


Example format 2: ROADS templates

ROADS templates are a development of the IAFA templates originally developed for anonymous FTP archives (Deutsch et al., 1994). IAFA templates are a simple text-based metadata format consisting of predefined sets of attribute-value pairs. Templates exist for a number of different resource types, but the templates most commonly used in existing ROADS-based gateways are those designated SERVICE, DOCUMENT and MAILARCHIVE.

E X A M P L E

Example of part of a ROADS SERVICE template

Template-Type: SERVICE
Handle: 840738289-29226
Title: Southampton Oceanography Centre
URI-v1: http://www.soc.soton.ac.uk/
Admin-Email-v1: webmaster@mail.soc.soton.ac.uk
Publisher-Name-v1: University of Southampton
Publisher-Postal-v1: Southampton Oceanography Centre, University of Southampton, Waterfront Campus, European Way, Southampton SO14 3ZH, United Kingdom
Publisher-City-v1: Southampton
Publisher-Country-v1: UK
Publisher-Phone-v1: +44 (0)1703 596666
Description: An introduction to the services provided by the Southampton Oceanography Centre - a joint venture between the University of Southampton and the Natural Environment Research Council. Includes information on internal departments and divisions, and the National Oceanographic Library
Keywords: Southampton Oceanography Centre; Natural Environment Research Council; NERC;
Subject-Descriptor-v1: 551.46
Subject-Descriptor-Scheme-v1: DDC21
Record-Last-Modified-Date: Wed, 12 May 1999 18:24:49 +0000
Record-Last-Modified-Email: cataloguer@subject-gateway.ac.uk
Record-Created-Date: Wed, 12 May 1999 18:24:49 +0000
Record-Created-Email: cataloguer@subject-gateway.ac.uk


Format conversion
 

One of the advantages of using well-defined and structured metadata formats is that this allows conversion into other formats when necessary. This is useful in two main circumstances:

  1. When a gateway wants to change to using a different metadata format. For example, a gateway that currently uses a custom-built database management system with a Web interface might want to run the ROADS software to take advantage of cross-searching facilities. The gateway's existing records would therefore need to be converted into ROADS templates. These types of conversion will be required periodically as information gateway software and its associated metadata evolve.
  2. To aid interoperability.

Format conversion is facilitated by the creation of crosswalks (or mapping tables) between metadata formats. Crosswalks can be used as the basis for the production of a specific conversion program or for the production of search systems that would permit the interrogation of heterogeneous metadata formats. A number of metadata format crosswalks have been published. One of the earliest DC-based crosswalks mapped Dublin Core to USMARC (Caplan and Guenther, 1996) and other crosswalks exist for other formats including Text Encoding Initiative (TEI) headers, ROADS templates and a variety of MARC formats, including the Universal MARC format (UNIMARC). A collection of metadata mappings is maintained on the UKOLN Web site (Day, 1996).

Cross reference
Interoperability

E X A M P L E

Examples of metadata conversion projects

Nordic Metadata Project

The Nordic Metadata Project produced a variety of tools designed to aid the wider utilisation of Dublin Core (Hakala et al., 1998). The toolkit included a utility called d2m, a Dublin Core to MARC converter that converts Dublin Core metadata embedded in HTML into a variety of Nordic MARC formats and USMARC.

BIBLINK project

The BIBLINK project developed a custom-built software system (the BIBLINK Workspace) which converts metadata produced by publishers into the UNIMARC format for use by participating national bibliographic agencies (Day, Heery and Powell, 1999). The UNIMARC records can in turn be converted into other formats (usually MARC-based) used by these national bibliographic agencies, who can then enhance them for inclusion in their national bibliography and (possibly) for returning this enhanced record to the publisher. The metadata conversion process in the BIBLINK Workspace uses metadata crosswalks produced for the project by UKOLN (e.g. Day, 1998a).


Future proofing
 

Any choices concerning metadata will need to take into account possible future developments. The gateway may decide to expand by including new types of descriptions (possibly for new types of resource such as images or multimedia) or to include additional metadata (such as descriptions aimed at alternative audiences, rights metadata, digital preservation data). At the simplest level, updates and extensions to existing metadata element sets need to be accommodated. The gateway may want to ensure that:

  • metadata creation tools can be easily extended to deal with new elements and new formats
  • the system has sufficient flexibility to allow a variety of formats to be imported and exported

Within the lifetime of the gateway, it may have to migrate to a different system which will require different metadata formats, whether these are new versions of existing formats or completely different. Re-structuring the metadata can be done more efficiently if the gateway follows some general guidelines for the content of metadata. Such guidelines might include recommendations that:

  • metadata formats and rules for content are agreed among collaborating gateways (this means that gateways can share costs of converting their data)
  • gateways implement local usages by means of local processing rather than by incorporating them into the data (for example, adding punctuation and other presentational enhancements by software processing rather than by storing it as part of the data)
  • there are as few local variants to standard metadata formats as possible. (For example, variant element names can be displayed using local processing rather than by storing non-standard element names.)
  • collaborate with other gateways so that migration can take advantage of economies of scale.

Conclusions
 

Choosing a metadata format is one of the most important decisions that needs to be made when setting up an information gateway. It is vital that the format is able to work with the software that forms the basis of the gateway service and it should also contain all fields (including administrative metadata) that have been identified as appropriate for the service in question (or the format should be extensible). It is possible that ongoing changes in technologies may require periodic conversion of the gateway database into new formats. This process will require the production of metadata crosswalks and/or format conversion programs.


References
 

BIBLINK, http://hosted.ukoln.ac.uk/biblink/

d2m, http://www.bibsys.no/meta/d2m/

DC-dot, http://www.ukoln.ac.uk/cgi-bin/dcdot.pl

Dublin Core, http://purl.oclc.org/dc

EdNA, http://www.edna.edu.au/EdNA/

InterCat, http://purl.org/net/intercat

ROADS, http://www.ilrt.bris.ac.uk/roads/

P. L. Caplan & R. S. Guenther, 'Metadata for Internet resources: the Dublin Core Metadata Element Set and its mapping to USMARC', Cataloging and Classification Quarterly 22 (3/4) (1996), 43-58.

M. Day, Interoperability between metadata formats (Bath: UKOLN, 1996).
http://www.ukoln.ac.uk/metadata/interoperability/

M. Day, Mapping BIBKLINK Core (BC) to UNIMARC. BIBLINK project document (Bath: UKOLN, 10 September 1998).
http://hosted.ukoln.ac.uk/biblink/wp10/bc-unimarc.html

M. Day, R. Heery & A. Powell, 'National bibliographic records in the digital information environment: metadata, links and standards', Journal of Documentation 55 (1) (1999), 16-32.

L. Demspey & R. Heery, 'Metadata: a current view of practice and issues', Journal of Documentation 54 (2) (1998), 145-172.

L. Demspey, R. Heery, M. Hamilton, D. Hiom, J. Knight, T. Koch, M. Peereboom & A. Powell, A review of metadata: a survey of current resource description formats (DESIRE deliverable D3.2 (1), March 1997).
http://www.ukoln.ac.uk/metadata/desire/overview/

P. Deutsch, A. Emtage, M. Koster & M. Stumpf, Publishing information on the Internet with Anonymous FTP (Internet-Draft, September 1994).
http://info.webcrawler.com/mak/projects/iafa/iafa.txt

J. Hakala, P. Hansen, O. Husby, T. Koch & S. Thorborg, The Nordic Metadata Project: final report (Helsinki: Helsinki University Library, July 1998).
http://linnea.helsinki.fi/meta/nmfinal.htm

R. Heery, 'Review of metadata formats', Program 30 (4) (1996), 345-373.

R. Iannella & D. Campbell, The A-Core: metadata about content metadata (Internet-Draft, 21 June 1999).
http://metadata.net/admin/draft-iannella-admin-01.txt

J. Kunze, Encoding Dublin Core Metadata in HTML (Internet-Draft, 25 May 1999).
http://www.ietf.org/internet-drafts/draft-kunze-dchtml-01.txt

O. Lassila & R. Swick, eds., Resource Description Framework (RDF) model and syntax specification (W3C Working Draft, 1999).
http://www.w3.org/TR/WD-rdf-syntax/

Making of America project, The Making of America II testbed project white paper (Version 1.03, March 16 1998).
http://sunsite.berkeley.edu/MOA2/wp-v1_03.html

E. Miller, P. Miller & D. Brickley, eds., Guidance on expressing the Dublin Core within the Resource Description Framework (RDF) (Dublin Core Metadata Initiative, Draft Proposal,1999).
http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/

S. Weibel, J. Kunze, C. Lagoze & M. Wolf, RFC 2413, Dublin Core metadata for resource discovery (Internet Engineering Task Force, Network Working Group, September 1998).
ftp://ftp.isi.edu/in-notes/rfc2413.txt

S. Weibel, 'The State of the Dublin Core Metadata Initiative', D-Lib Magazine 5 (4) (April 1999).
http://www.dlib.org/dlib/april99/04weibel.html

S. L. Weibel & C. Lagoze, 'An element set to support resource discovery: the state of the Dublin Core', International Journal on Digital Libraries, 1(2) (January 1997), 176-186.


Credits
 

Chapter author: Michael Day

With contributions from: Rachel Heery, Emma Place and Virginia Knight


<< P R E V I O U S 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 N E X T >>
  Go to the table of contents  

Return to:
Handbook Home
DESIRE Home
Search | Full Glossary | All References

Last updated : 20 April 00
Contact Us
© 1999-2000 DESIRE