3.6 Interoperability - DESIRE Information Gateways Handbook

3.6. Interoperability

In this chapter...
	why interoperability is important for information gateways the role of protocols such as LDAP, Whois++ and Z39.50 interoperability between metadata formats, metadata crosswalks and metadata registries content issues: cataloguing rules and classification schemes

Introduction

No single information gateway will be able to describe each and every relevant Internet resource, even if it is limited to a relatively small subject area. Therefore, as the Internet continues to grow, gateways will need to co-operate (and interoperate) with each other to create distributed systems with wide geographical and linguistic coverage. Place (1999) suggests that the international library community is well placed to take up this challenge. She also notes that a collaborative network known as IMesh will provide an open forum for exchanging ideas and technology.

Indeed, the consistent use of existing standards and technologies already permits a large amount of inter-gateway collaboration. A lot of technical effort has gone into building interoperability between search protocols and metadata formats and into developing gateway software that is able to cross-search more than one gateway.

E X A M P L E

IMesh

IMesh provides an open forum and mailing list for exchanging ideas and technologies for promoting information gateways.

This chapter will not explain in technical detail how to implement interoperability features in a gateway, but will provide an overview of the various issues surrounding gateway interoperability.

Background

In a computer science context the term 'interoperability' is used to refer to the transparent management of different applications and software. In an information gateway context, interoperability generally means one of two things:

being able to search, browse and retrieve information from distributed gateways based on (broadly) the same technologies, protocols and metadata formats
being able to search, browse and retrieve information from distributed gateways based on a variety of software solutions, search and retrieve protocols and metadata formats

These two different challenges require slightly different solutions. Where the same protocols and metadata formats are in use, ensuring interoperability is usually a matter of making sure that each gateway is set up in a consistent manner and has the correct interfaces. For example, it should be relatively easy to ensure that all services based on the Whois++ search and retrieve protocol (e.g. services based on the ROADS software toolkit) can be cross-searched. Interoperability, in these circumstances, becomes less of a technical problem and more a matter of the consistent use of metadata formats and their related content standards (e.g. cataloguing and subject indexing).

Where services are based on a variety of protocols and metadata formats, however, these non-technical problems remain - indeed, they are usually more difficult to solve - but additional technical layers will also need to be developed, involving the production of inter-protocol gateways, 'middleware' and metadata crosswalks.

In practice, however, information gateways tend to be based on a relatively small number of technologies, protocols and metadata formats, at least when compared with the whole information universe. This means that any work carried out on integrating several selected protocols and formats will be applicable in a number of different situations.

Information gateways and interoperability

Ensuring that information gateways are interoperable will generally require the consistent application of available standards. There are four main 'standards-based' factors affecting interoperability among information gateways:

the use of different search and retrieve (or indexing) protocols
the use of different metadata formats
differences in cataloguing standards
differences in subject indexing schemes

Protocols

Interoperability among information gateways requires the consistent use of relevant protocols. The most relevant protocols for gateways are LDAP, Whois++ and Z39.50.

The Lightweight Directory Access Protocol (LDAP)

LDAP (cf. e.g. RFC 2251) was developed as a simple alternative to the ISO X.500 protocol, a directory access protocol designed for providing access to distributed information about people (names, email addresses, telephone numbers, etc). Accordingly, most existing applications of LDAP are so-called 'white pages' services. However, there is no reason why LDAP cannot be used for other services, including information gateways.

E X A M P L E

The Isaac Network

The Isaac Network - an initiative of the Internet Scout Project based in the Computer Sciences Department at the University of Wisconsin-Madison - is using an LDAP directory for Dublin Core metadata records about resources (Roszkowski and Lukas, 1998: Lukas and Roszkowski, 1999).

Whois++

The Whois++ protocol was originally developed for directory services, to operate as a simple (template-based), distributed and extensible information lookup service (RFC 1835). Its extensible architecture, however, meant that its developers expected it to find applications in a number of other information service areas. Whois++ also provides a general architecture that is designed for the indexing of distributed databases and then applies that architecture to link together a multiple number of these Whois++ servers into a distributed, searchable wide-area directory service (RFC 1913). Unlike other directory protocols (e.g. X.500 or LDAP), Whois++ does not require a hierarchical representation of data space, but servers 'refer' the clients to other servers in a Whois++ 'mesh' (RFC 1914). Queries are routed through this mesh based on 'forward knowledge' held by one server about another. In Whois++, this forward knowledge is maintained using the Common Indexing Protocol (CIP).

CIP is a protocol used between servers in a network to facilitate query routing, the 'act of redirecting and replicating queries through a distributed database system towards the servers holding the actual results via reference to indexing information' (Allen and Mealling, 1997). It is not part of Whois++ and indeed can be used with other protocols such as LDAP. CIP is based upon the concept of index summaries or centroids. A centroid can be considered as a summary of the structured information in a given server; for example, it could be a simple inverted index of the information contained within a database's templates. This can then be used, for (e.g.) query routing within a distributed database.

E X A M P L E

ROADS use of Whois++ and centroids

The ROADS software (from version 1) uses the Whois++ protocol to query (and retrieve information from) distributed servers containing structured descriptions (ROADS templates) of Internet resources. In addition, ROADS (version 2) makes use of the centroid facility of Whois++ to facilitate query routing between servers. It may be worth while describing these technologies in more detail.

In a cross-searching context, a ROADS 'index server' will periodically visit ROADS-based information gateways and generate an index summary (or centroid). The centroid for each service (or server) will contain all relevant index terms in that database, so that an initial search of the index server will determine which of the subject services has information that matches a given query. If desired, the query can then automatically be passed on to all the information gateways whose centroids indicate the existence of relevant index terms and the templates containing them returned for display to the end-user. Demonstrations of ROADS cross-searching are currently available on the Web (ROADS project, 1998), as are more detailed descriptions of the technologies that underlie it (e.g. Knight and Hamilton, 1995; Kirriemuir, et al., 1998).

ROADS

Z39.50

The Z39.50 protocol (e.g. Library of Congress, 1999) is a standard for information retrieval approved by the National Information Standards Organization (NISO) - a committee accredited by the American National Standards Institute (ANSI). It has also been recognised by the International Organization for Standardization (ISO), where it is known as ISO 23950:1998.

The Z39.50 protocol allows client applications to search databases on remote 'target' servers and to retrieve relevant information. It therefore supports the retrieval of information from distributed remote databases (Turner, 1995). The first applications using it, for example software for distributed searching of library online public-access catalogues, were developed specifically for bibliographic data, but attribute sets can be defined to allow the protocol to work with many other types of data. For example, systems using Z39.50 have been developed for libraries, archives, museums and data archives.

E X A M P L E

The AHDS gateway

The Arts and Humanities Data Service (AHDS) consists of five distributed subject-based service providers which, in addition to their other responsibilities, provide access to descriptions of digital resources in five separate subject domains:

Archaeology Data Service (ADS)
History Data Service (HDS)
Oxford Text Archive (OTA)
Performing Arts Data Service (PADS)
Visual Arts Data Service (VADS)

Each of these services operates within a resource description context specific to its own subject domain. For example, the Oxford Text Archive - a service provider for literary and linguistic texts - would normally describe resources using a metadata format known as 'Text Encoding Initiative (TEI) headers'.

The AHDS has implemented a resource discovery system which provides unified access to these heterogeneous (and distributed) resource descriptions using Dublin Core and a Z39.50 gateway (Miller and Greenstein, 1997). Greenstein and Murray (1997, p. 56) explain:

[The Z39.50-based] software acts as a mediating layer between on the one hand, a World Wide Web interface from which users query a range of different catalogue databases and to which merged result sets are returned to the user, and on the other, the underlying catalogue databases themselves. From the users point of view, this 'middleware' irons out any differences that may exist in the underlying databases (e.g. in their native record structure, query language, and record syntax).

AHDS gateway

Z39.50 has not been widely implemented by information gateways. However, there is a wider need to ensure that gateways can interoperate with other resource discovery systems (such as library OPACs, hybrid library systems) and different metadata formats. For these reasons, projects like ROADS have needed to address issues relating to gateway interoperability with Z39.50.

E X A M P L E

ROADS (Whois++) interaction with Z39.50

Although ROADS databases normally make resource descriptions available using Whois++, the ROADS project realised that in some situations it would be desirable to make such databases available to end-user client and intermediate systems that use the Z39.50 protocol.

Two main approaches were adopted:

A Z39.50 to Whois++ gateway. In this solution, the gateway functions as a Z39.50 server, accepting queries from Z39.50 client systems. It then converts them to Whois++ queries and passes them to the ROADS server. As the ROADS server returns results, they are converted into a suitable format for use by Z39.50 client systems and returned to the client as a Z39.50 results set. A Z39.50 to Whois++ gateway, known as ZEXI, has been developed as part of the ROADS project. It is based on the Isite Information System available from CNIDR. ZEXI returns simple, unstructured text-based records known as SUTRS.
Loading ROADS records into a Z39.50-based database. The second approach involves copying records from a ROADS database into another database that has a Z39.50 interface. Typically, the records will require some form of conversion during the copying procedure. Candidate Z39.50 database systems include Isite and the Zebra System developed by Index Data. The Zebra Z39.50 server can make converted ROADS records available in two structured formats (USMARC and GRS-1) and in an unstructured format (SUTRS).

Documentation (and software) on making ROADS databases accessible using this second approach (the ROADS Z39.50 Plugin) is available from the ROADS project Web pages.

ROADS Z39.50 plugin

Metadata formats

Metadata crosswalks

Different information gateways will often use different metadata formats. For this reason there is a need for crosswalks (or mappings) between formats that can be used as the basis of interoperable systems (such as middleware) or for conversion programs.

Metadata formats

A number of inter-metadata crosswalks exist, many based on Dublin Core (RFC 2413). Core metadata formats are well placed to act as intermediaries for semantic interoperability between heterogeneous resource description models. Weibel (1997, p. 18) suggests that the promotion of a 'commonly understood set of core descriptors will improve the prospects for cross-disciplinary search by unifying related attributes'. He additionally suggests that an important approach to interoperability in a heterogeneous resource description environment would be to map many description schemas into a common set (such as Dublin Core) which would give users 'a single semantic model for searching'.

A number of Dublin Core (DC) based mappings currently exist; for example, there are important crosswalks from Dublin Core to USMARC (Caplan and Guenther, 1996; Network Development and MARC Standards Office, 1997). Other people and organisations have also produced DC mappings for various other formats including TEI headers, the Nordic MARC formats (as part of the Nordic Metadata Project) and UNIMARC (for project BIBLINK). A collection of these metadata mappings is maintained by Day (1996).

The ROADS project has produced metadata crosswalks between ROADS templates, Dublin Core, SOIF and the USMARC format.

Metadata Registries

Metadata formats require consistent application. This is particularly a problem with formats that are easily adaptable and extensible, such as ROADS templates or Dublin Core. It would be possible for an information gateway to modify (or customise) a metadata format so much that the service based on it would no longer be interoperable (cross-searchable) with other gateways.

One solution would be to require all gateways to conform to an agreed set of metadata attributes. However this goes against the very flexibility that gateways require in order to provide a good service to their own users. What is needed is a way of recording current practice so that gateways can modify metadata formats in the knowledge of what other gateways have done and without the problem of 'reinventing the wheel'.

E X A M P L E

The ROADS Template Registry

ROADS templates are defined for 15 different resource types. These are known as template types. Some of these template types (e.g. DOCUMENT, MAILARCHIVE and SERVICE) originate in the original IAFA template specification (Deutsch et al. 1994). Other templates have been developed specifically for ROADS-based services (e.g. PROJECT). At least one of the others (TRAINMAT, for training materials) was independently developed and has been published as RFC 2007.

Each template type has a number of set attributes. Some of these are specific to one template type, others are not. ROADS templates use what the IAFA specification calls 'clusters' to group together information on names, addresses and other contact details. Clusters currently in use describe a USER (an individual) or an ORGANIZATION. ROADS-based services can also add new attributes and create new template types.

Experience with ROADS-based gateways demonstrated a need for a metadata registry. The creation of new template types and the adaptation and extension of existing template types by subject services meant that there was no central location where the latest forms of these could be recorded.

The ROADS Template Registry takes the form of a list of template types, including all metadata attributes that have been proven to be useful. The aim of the registry is to preserve flexibility - to allow the creation of new template types and attributes where necessary - but also to prevent the unnecessary proliferation of template types and attributes and to maintain some level of consistency.

Consistency is extremely important in the context of ROADS cross-searching and interoperability. It would be possible for a ROADS user to consider creating a new template type for (say) recorded music; it would be desirable to base this on an existing template type (e.g. VIDEO) and to use - wherever possible - attributes and clusters that are common to more than one existing template type.

ROADS template registry

What are needed are extensible metadata registries which provide canonical definitions of all elements and also disclose local uses. These registries should be understandable by both humans and machines. ISO/IEC 11179:1997 - Specification and standardization of data elements is a formal standard for expressing the semantics of data elements suitable for registries, but few metadata registries based on this standard currently exist.

E X A M P L E

ISO/IEC 11179 registries

Environmental Data Registry (EDR)

The U.S. Environmental Protection Agency (EPA) developed its Environmental Data Registry (EDR) as a comprehensive and authoritative source of reference information about environmental data. The registry provides information on data names, definitions, formats, and relationships and identifies organisations (or individuals) responsible for the various data. Registered users can also register new data elements in the EDR.

National Health Information Knowledgebase (NHIK)

The Australian Institute of Health and Welfare (AIHW) developed its National Health Information Knowledgebase (NHIK) as an 'electronic repository' for health metadata. Data elements within the Knowledgebase have been documented using ISO/IEC 11179.

NHIK

Content issues

Cataloguing

In practice, interoperability is not just dependent upon consistency in the use of the metadata format itself but is also dependent upon the consistency of the content contained within the format. For example, in the library community the MARC formats specify a framework for the description of bibliographic items while the content of MARC records will often conform to other standards, usually based on one of the International Standard Bibliographic Descriptions (ISBDs) or cataloguing rules derived from them.

For this reason, the formulation of cataloguing guidelines will be an important part of the interoperability strategy of a gateway (e.g. Day, 1998). This will mean taking account of cataloguing practice in other gateways and the production of standardised cataloguing rules, considering such issues as:

chief sources of information
capitalisation
date formats
language codes
formats for personal and corporate names

Cataloguing

Subject classifications

Another content-based area where interoperability is likely to become an issue is in the application of subject information in the form of classification schemes and thesaurus terms.

Classification schemes provide an information gateway with a browsing structure. It is possible that two or more distributed gateways could be combined to form a single service. Successful cross-browsing will depend upon the consistent application of the same classification scheme. Therefore, information gateways that want to facilitate cross-browsing should, wherever possible, use the same classification system.

Otherwise, complex mappings will have to be produced to enable conversion between schemes. This may not be too difficult at the higher levels of a universal subject hierarchy but where any detail is involved it will become problematic because of theoretical, conceptual, cultural and practical differences between systems.

Subject indexing and classification, Co-operation between gateways

Conclusions

It is important for all information gateways to consider interoperability issues. It is generally agreed that the way forward for information gateways is increased co-operation; successful information gateway co-operation will depend upon successful interoperability and in the consistent application of standards regarding such matters as protocols, metadata formats, cataloguing rules and subject classification schemes. Gateways can start to make immediate use of existing tools that promote interoperability and to build the technical links between distributed gateways that will form the basis of any future international co-operation.

Glossary

ADS - Archaeology Data Service
AHDS - Arts and Humanities Data Service
AIHW - Australian Institute of Health and Welfare
ANSI - American National Standards Institute
CIP - Common Indexing Protocol
CNIDR - Center for Networked Information Discovery and Retrieval
EDR - Environmental Data Registry
EPA - Environmental Protection Agency
HDS - History Data Service
IAFA - Internet Anonymous FTP Archive
IEC - International Electrotechnical Commission
IETF - Internet Engineering Task Force
ISBD - International Standard Bibliographic Description
ISO - International Standards Organization
LDAP - Lightweight Directory Access Protocol
MARC - Machine-Readable Cataloguing
NHIK - National Health Information Knowledgebase
NISO - National Information Standards Organisation
OTA - Oxford Text Archive
PADS - Performing Arts Data Service
RFC - IETF Request for Comments
ROADS - Resource Organisation and Discovery in Subject-based services
SUTRS - Simple Unstructured Text Record
TEI - Text Encoding Initiative
UNIMARC - Universal MARC format
VADS - Visual Arts Data Service
Whois++ - A 'lightweight' Internet protocol for information retrieval
X.500 - An ISO directory protocol
Z39.50 - An ANSI/NISO developed protocol for information retrieval - also known as ISO 23950

References

AHDS gateway, http://ahds.ac.uk:8080/ahds_live/

EDR, http://www.epa.gov/edr/

IMesh, http://www.desire.org/html/subjectgateways/community/imesh

Isaac Network, http://scout.cs.wisc.edu/research/index.html

NHIK, http://www.aihw.gov.au/services/health/nhik.html

ROADS, http://www.ilrt.bris.ac.uk/roads/

ROADS template registry, http://www.ukoln.ac.uk/roads/templates/

ROADS Z39.50 plugin, http://www.ilrt.bris.ac.uk/roads/software/zplugin/

J. Allen & M. Mealling, The architecture of the Common Indexing Protocol (CIP) (FIND Working Group, Internet-Draft, 18 November 1998).
ftp://ftp.isi.edu/internet-drafts/draft-ietf-find-cip-arch-02.txt

P. L. Caplan, & R. S. Guenther, 'Metadata for Internet resources: the Dublin Core Metadata Element Set and its mapping to USMARC', Cataloging and Classification Quarterly 22 nos. 3-4 (1996), 43-58.

M. Day, Mapping between metadata formats (Bath: UKOLN The UK Office for Library and Information Networking, 1996).
http://www.ukoln.ac.uk/metadata/interoperability/

M. Day, ROADS cataloguing guidelines (Bath: UKOLN The UK Office for Library and Information Networking, 1998).
http://www.ukoln.ac.uk/metadata/roads/cataloguing/cataloguing-rules.html

P. Deutsch, A. Emtage, M. Koster & M. Stumpf, Publishing information on the Internet with Anonymous FTP (Internet Engineering Task Force, Internet Draft, September 1994).
http://info.webcrawler.com/mak/projects/iafa/iafa.txt

P. Deutsch, R. Schoultz, P. Faltstrom & C. Weider, RFC 1835, Architecture of the WHOIS++ service (Internet Engineering Task Force, Network Working Group, August 1995).
ftp://ftp.isi.edu/in-notes/rfc1835.txt

P. Faltstrom, R. Schoultz & C. Weider, RFC 1914, How to interact with a Whois++ Mesh (Internet Engineering Task Force, Network Working Group, February 1996).
ftp://ftp.isi.edu/in-notes/rfc1914.txt

J. Foster, M. Issacs & M. Prior, RFC 2007, Catalogue of network training materials (Internet Engineering Task Force, Network Working Group, October 1996).
ftp://ftp.isi.edu/in-notes/rfc2007.txt

D. Greenstein & R. Murray, 'Metadata and middleware: a systems architecture for cross-domain discovery' in P. Miller & D. Greenstein, eds., Discovering online resources across the humanities: a practical implementation of the Dublin Core (Bath: UKOLN on behalf of the Arts and Humanities Data Service, October 1997), 56-62.
http://ahds.ac.uk/public/metadata/disc_06.html

ISO 23950:1998, Information and documentation - Information retrieval (Z39.50) - Application service definition and protocol specification (Geneva: International Organisation for Standardization, 1998).

ISO/IEC 11179:1997, Information technology - Specification and standardization of data elements (Geneva: International Organisation for Standardization, 1997).

J. Kirriemuir, D. Brickley, S. Welsh, J. Knight & M. Hamilton, 'Cross-searching subject gateways: the query routing and forward knowledge approach', D-Lib Magazine (January 1998).
http://www.dlib.org/dlib/january98/01kirriemuir.html

J. P. Knight & M. Hamilton, Overview of the ROADS software (LUT CS-TR 1010. Loughborough: Loughborough University of Technology, Department of Computer Studies, 1995).
http://www.roads.lut.ac.uk/Reports/arch/arch.html

Library of Congress, Z39.50 Maintenance Agency [home page], (Washington, D.C.: Library of Congress 1999).
*** URL needed?

C. Lukas & M. Roszkowski, 'The Isaac Network: LDAP and distributed metadata for resource discovery', Third IEEE Meta-data Conference, National Institutes of Health, Bethesda, Md., USA, 6-7 April 1999.
http://computer.org/conferen/proceed/meta/1999/papers/46/clukas.html

P. Miller & D. Greenstein, Discovering online resources across the humanities: a practical implementation of the Dublin Core (Bath: UKOLN on behalf of the Arts and Humanities Data Service, October 1997).
http://ahds.ac.uk/public/metadata/discovery.html

Network Development and MARC Standards Office, Dublin Core/MARC/GILS Crosswalk (Washington, D.C.: Library of Congress, 4 July 1997).
http://lcweb.loc.gov/marc/dccross.html

E. Place, 'International collaboration on Internet subject gateways', 65th IFLA Council and General Conference, Bangkok, Thailand, 20-28 August 1999.
http://www.ifla.org/IV/ifla65/papers/009-143e.htm

ROADS project, CrossROADS (Bath: UKOLN The UK Office for Library and Information Networking, 1998).
http://roads.ukoln.ac.uk/crossroads/

M. Roszkowski & C. Lukas, 'A distributed architecture for resource discovery using metadata', D-Lib Magazine (June 1998).
http://www.dlib.org/dlib/june98/scout/06roszkowski.html

F. Turner, An overview of the Z39.50 Information Retrieval standard (UDT Occasional Paper, 3. Ottawa: IFLA Universal Dataflow and Telecommunications Core Programme, 1995).
http://www.ifla.org/VI/5/op/udtop3.htm

M. Wahl, T. Howes & S. Kille, RFC 2251, Lightweight Directory Access Protocol (v3) (Internet Engineering Task Force, Network Working Group, December 1997).
ftp://ftp.isi.edu/in-notes/rfc2251.txt

S. Weibel, J. Kunze, C. Lagoze & M. Wolf, RFC 2413, Dublin Core metadata for resource discovery (Internet Engineering Task Force, Network Working Group, September 1998).
ftp://ftp.isi.edu/in-notes/rfc2413.txt

C. Weider, J. Fullton & S. Spero, RFC 1913, Architecture of the Whois++ Index Service (Internet Engineering Task Force, Network Working Group, February 1996).
ftp://ftp.isi.edu/in-notes/rfc1913.txt

Credits
	Chapter author: Michael Day With contributions from: Rachel Heery

<< P R E V I O U S	1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8	N E X T >>
	Go to the table of contents


Return to: Handbook Home DESIRE Home	Search \| Full Glossary \| All References Last updated : 20 April 00	Contact Us © 1999-2000 DESIRE