Search | Help |
2.3. Metadata formats |
||||
|
Introduction
|
|
Information gateways are characterised by their creation of third-party metadata records - individual descriptions of Internet resources held in a database that have separate fields for different attributes of the resources, such as title, author, URL etc. These resource descriptions are used to:
Gateways adopt the approach where metadata is created by a third party ie. an independent subject specialist or information professional, rather than the creator of the resource. This enables the quality control for which gateways are renowned - the resource descriptions all assume a standard format and are generated manually (at least in part) to enable high quality metadata that benefits for semantic judgements about the nature and origin of the resources. The metadata created by gateways is their greatest asset - adding value to the Internet resources by creating independent, standardised third-party descriptions. The decision of which metadata format to use is an imporatnt one as it impacts on the searching capabilities of the gateway and the value of the descriptions to the end-users. The creation of metadata will be one of the most time-consuming tasks in running a gateway and so a balance between value and cost may be required in deciding on a format. This chapter will introduce some of these issues and provide some background information that information gateway managers will need to consider when choosing a metadata format for their gateway. |
Why create metadata records?
|
|
Information gateways are services that give access to networked resources in particular subject areas, linguistic domains, and so on. Many Internet portals simply comprise of sets of Web pages with lists of hyperlinks on a static Web page, perhaps with annotations, however, this approach has distinct disadvantages:
Gateways take advantage of database technologies which gets over both these problems, but requires that a standard format be used for creating and storing the resource descriptions. Metadata formats are structured formats for Internet resource descriptions. For gateways, the metadata fomats are the forms or templates that need to be filled in by the cataloguers to create a resource description. The use of metadata by an information gateway has many benefits over the simple HTML list approach, for example:
|
Metadata attributes
|
||||||||||
Gateways staff will need to agree on the attributes of an Internet resource that they wish to describe. Metadata can be grouped into various kinds according to their use within the gateway. They might include: Descriptive Descriptive metadata contain information which may be usefully returned from a search of the gateway. A user may be able to decide from this information whether it is worth spending time looking at the resource itself.
Subject Subject metadata can facilitate effective searching. They can also be used to organise the browsing structure of your gateway. A fuller discussion can be found in the
Administrative Administrative metadata are intended primarily to assist the gateway staff in maintaining the gateway. They are of less concern to users and may not be visible to them; however, they can be used, for example, to check that resource descriptions are still current.
Consideration of which particular administrative functions are required and an assessment of which particular administrative metadata elements are needed will be an important part of choosing (or adapting) a metadata format for use in a particular information gateway. Core metadata The possible metadata fields listed above are by no means exhaustive, but including them all would require considerable effort both in initial cataloguing and in keeping records up to date. Not all of them might be appropriate to your gateway. Attempts have been made to define standards for a 'core' of metadata which should be regarded as a bare minimum. One such standard is the Dublin Core.
ROADS offers a number of metadata templates designed for different types of Internet resources. Each template contains attributes specific to the type of Internet resource. For example, the template for describing a mailarchive will have a different set of fields from the template for describing a Web document. ROADS also maintains a 'template registry' where the metadata fields used in the various kinds of ROADS templates are recorded. This ensures that ROADS services are potentially interoperable in this area. New fields can be nominated for addition to the registry.
|
Choosing metadata attribites
|
|
You should think carefully about which metadata attributes your gateway is going to use, and their format, when you first set up the gateway. If you do not, you may find yourself constrained by the absence of useful metadata, or have to add a new metadata field or convert an existing field to a different format when you already have several thousand resources in your database. Moreover, decisions about metadata will in turn affect the design of your interface (especially the parts of it used for cataloguing and/or submitting new resources for consideration). Which metadata fields could be usefully searched on by your users? You should consider your potential user community and also the nature of the resources which your gateway will cover. For example, if your gateway is intended to cover only geographically local resources in one language, a 'language' field will not be very informative unless your gateway is going to be cross-searched with others elsewhere. And how are they going to search them? This will affect not only what metadata fields you provide but also the cataloguing rules you adopt. For example, if you are ranking searches by the frequency of the occurrence of the search term, you may wish to make descriptions similar in length, otherwise resources with long descriptions may be more likely to returned high up the order.
Which metadata fields will be displayed to the users of the gateway? Will they need to be converted from the form in which they are stored and if so does an easy way of converting them exist? Which metadata fields will be used for housekeeping by the gateway staff and how? Metadata can supply information for partially automating this otherwise laborious aspect of gateway management. For example, you can have an automatic email sent to maintainers of resources occasionally to ask whether they have made any changes, or set a web-page tracking tool to monitor changes to resources. Which if any are optional? If you are collaborating (or thinking of it), which metadata fields will be shared with your collaborators? Are they likely to want extra information, such as language, which you would not otherwise include in your metadata? You will need to use the same schemes for e.g. classification or have a usable crosswalk to convert between schemes. You should also think about the issue of copyright.
Are you going to display your metadata in the same format as that in which you store it? If not, you will need a way of converting between formats. Can any of the software you are using generate useful metadata? For example, ROADS automatically records when a template was last updated. You may wish to use in addition software for creating metadata (see below). Harvesting software, if used, may also be able to harvest metadata.
Who will generate metadata fields (and which ones?). Metadata may be supplied by:
How much cross-checking will there be? (Time will need to be allowed for this). If you are allowing gateway users or information providers to submit resources, what information should they supply? What information may they also supply optionally? How important is it that (for example) descriptions or keywords are consistent across the gateway? If this is important, can you supply cataloguing rules or other guidance to help information providers and others who are submitting resources? How much effort can be expended on editing their contributions, given that gateway users and information providers cannot be compelled to follow your cataloguing rules?
How might you ensure that information such as dates is in a consistent format? Possible methods include:
In what language are your metadata records going to be kept? If this is different from the language of some of your resources, are you going to make any provision for searching in that language (e.g. an 'alternative title' field)? |
Standard metadata formats
|
|||||||||||||||||
Information gateway managers will need to make decisions about which metadata format (or formats) to use within their service at a very early stage of its development. At present, however, the existence of a large and varied range of metadata formats and initiatives complicates these decisions. It is worth remembering also that the choice of metadata formats will often be influenced by other factors, both technological and social. For example, an information gateway that wishes to use the ROADS software toolkit with little modification will currently need to use the ROADS template format, or something very similar to it. Again, where gateway cross-searching or interoperability is seen to be important, there may be technical reasons why one format may have advantages over another. The nature of metadata development means that at any one time there are likely to be a variety of formats that could be chosen as the basis of an information gateway. For example, a review of metadata formats undertaken under DESIRE I identified and described over twenty formats that were in use (or under development) in 1996 (Dempsey et al., 1997). In order to help analyse the different metadata formats described in the review, the DESIRE I study produced a typology of metadata based upon their underlying complexity.
Figure 1. Typology of metadata formats (adapted from Dempsey and Heery, 1998). |
Choosing a metadata format
|
||||||||||
Choosing a format from the variety of existing ones will depend upon various factors. In general, current information gateways tend to use relatively simple generic formats with some structure ('Band Two' formats such as ROADS templates or Dublin Core). These formats have the twin advantages of simplicity, which means that they are relatively easy to create and maintain, and the existence of some structure, which facilitates both interoperability and format conversion. However, in particular circumstances there may be good arguments for basing an information gateway on more complex formats ('Band Three' formats such as MARC or TEI headers) if this offers some competitive advantage to the gateway. For example, the USMARC format has been used for the cataloguing of Internet resources in the InterCat project and it would be possible to set up MARC-based information gateways. However, the use of these more complex formats may have implications for the level of expertise (technical and other) that would be required for cataloguing and may have other costs. As noted before, the choice of a particular format may be dictated by technological or social factors. For example, particular gateway software may dictate the use (or non-use) of particular formats. Information gateways that, for example, are running the ROADS software without much modification will need either to use one of the existing templates defined by the ROADS project or to create new (and similar) templates in the form of attribute-value pairs. Example format 1: Dublin Core The Dublin Core (DC) is the result of an international and interdisciplinary initiative to define a core set of metadata elements for electronic resources, primarily for resource discovery on the Internet. DC was initially conceived as a simple format that could be used for author-generated descriptions of Web resources. However, the format has also attracted the attention of resource description professionals from a variety of communities such as libraries, museums, archives and government agencies.
The format has been developed by means of a series of invitational workshops, the first being held in Dublin, Ohio in March 1995. The workshop series and related work has resulted in the definition of fifteen core metadata elements as RFC 2413 (Weibel et al., 1998). These elements are intended to be repeatable and extensible in any application. The initial focus of DC was the Web, so the initiative has concentrated on the production of draft guidance for the encoding of DC elements, first in HTML (Kunze, 1999) and more recently in XML/RDF (e.g. Miller, Miller and Brickley, 1999).
Example format 2: ROADS templates ROADS templates are a development of the IAFA templates originally developed for anonymous FTP archives (Deutsch et al., 1994). IAFA templates are a simple text-based metadata format consisting of predefined sets of attribute-value pairs. Templates exist for a number of different resource types, but the templates most commonly used in existing ROADS-based gateways are those designated SERVICE, DOCUMENT and MAILARCHIVE.
|
Format conversion
|
||||
One of the advantages of using well-defined and structured metadata formats is that this allows conversion into other formats when necessary. This is useful in two main circumstances:
Format conversion is facilitated by the creation of crosswalks (or mapping tables) between metadata formats. Crosswalks can be used as the basis for the production of a specific conversion program or for the production of search systems that would permit the interrogation of heterogeneous metadata formats. A number of metadata format crosswalks have been published. One of the earliest DC-based crosswalks mapped Dublin Core to USMARC (Caplan and Guenther, 1996) and other crosswalks exist for other formats including Text Encoding Initiative (TEI) headers, ROADS templates and a variety of MARC formats, including the Universal MARC format (UNIMARC). A collection of metadata mappings is maintained on the UKOLN Web site (Day, 1996).
|
Future proofing
|
|
Any choices concerning metadata will need to take into account possible future developments. The gateway may decide to expand by including new types of descriptions (possibly for new types of resource such as images or multimedia) or to include additional metadata (such as descriptions aimed at alternative audiences, rights metadata, digital preservation data). At the simplest level, updates and extensions to existing metadata element sets need to be accommodated. The gateway may want to ensure that:
Within the lifetime of the gateway, it may have to migrate to a different system which will require different metadata formats, whether these are new versions of existing formats or completely different. Re-structuring the metadata can be done more efficiently if the gateway follows some general guidelines for the content of metadata. Such guidelines might include recommendations that:
|
Conclusions
|
|
Choosing a metadata format is one of the most important decisions that needs to be made when setting up an information gateway. It is vital that the format is able to work with the software that forms the basis of the gateway service and it should also contain all fields (including administrative metadata) that have been identified as appropriate for the service in question (or the format should be extensible). It is possible that ongoing changes in technologies may require periodic conversion of the gateway database into new formats. This process will require the production of metadata crosswalks and/or format conversion programs. |
References
|
|
BIBLINK, http://hosted.ukoln.ac.uk/biblink/ d2m, http://www.bibsys.no/meta/d2m/ DC-dot, http://www.ukoln.ac.uk/cgi-bin/dcdot.pl Dublin Core, http://purl.oclc.org/dc EdNA, http://www.edna.edu.au/EdNA/ InterCat, http://purl.org/net/intercat ROADS, http://www.ilrt.bris.ac.uk/roads/ P. L. Caplan & R. S. Guenther, 'Metadata for Internet resources: the Dublin Core Metadata Element Set and its mapping to USMARC', Cataloging and Classification Quarterly 22 (3/4) (1996), 43-58. M. Day, Interoperability between metadata formats (Bath: UKOLN, 1996). M. Day, Mapping BIBKLINK Core (BC) to UNIMARC. BIBLINK project document (Bath: UKOLN, 10 September 1998). M. Day, R. Heery & A. Powell, 'National bibliographic records in the digital information environment: metadata, links and standards', Journal of Documentation 55 (1) (1999), 16-32. L. Demspey & R. Heery, 'Metadata: a current view of practice and issues', Journal of Documentation 54 (2) (1998), 145-172. L. Demspey, R. Heery, M. Hamilton, D. Hiom, J. Knight, T. Koch, M. Peereboom & A. Powell, A review of metadata: a survey of current resource description formats (DESIRE deliverable D3.2 (1), March 1997). P. Deutsch, A. Emtage, M. Koster & M. Stumpf, Publishing information on the Internet with Anonymous FTP (Internet-Draft, September 1994). J. Hakala, P. Hansen, O. Husby, T. Koch & S. Thorborg, The Nordic Metadata Project: final report (Helsinki: Helsinki University Library, July 1998). R. Heery, 'Review of metadata formats', Program 30 (4) (1996), 345-373. R. Iannella & D. Campbell, The A-Core: metadata about content metadata (Internet-Draft, 21 June 1999). J. Kunze, Encoding Dublin Core Metadata in HTML (Internet-Draft, 25 May 1999). O. Lassila & R. Swick, eds., Resource Description Framework (RDF) model and syntax specification (W3C Working Draft, 1999). Making of America project, The Making of America II testbed project white paper (Version 1.03, March 16 1998). E. Miller, P. Miller & D. Brickley, eds., Guidance on expressing the Dublin Core within the Resource Description Framework (RDF) (Dublin Core Metadata Initiative, Draft Proposal,1999). S. Weibel, J. Kunze, C. Lagoze & M. Wolf, RFC 2413, Dublin Core metadata for resource discovery (Internet Engineering Task Force, Network Working Group, September 1998). S. Weibel, 'The State of the Dublin Core Metadata Initiative', D-Lib Magazine 5 (4) (April 1999). S. L. Weibel & C. Lagoze, 'An element set to support resource discovery: the state of the Dublin Core', International Journal on Digital Libraries, 1(2) (January 1997), 176-186. |
Credits
|
|
Chapter author: Michael Day |
<< P R E V I O U S | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | N E X T >> |
Go to the table of contents |
Return to: Handbook Home DESIRE Home |
Search | Full Glossary | All References Last updated : 20 April 00 |
Contact Us © 1999-2000 DESIRE |