DESIRE Information Gateways Handbook
HomeTable of contentsAuthors-
Search | Help   
-2.12. Multilingual issues

In this chapter...
 
  • providing a multilingual service
  • technical issues
  • interface issues
  • metadata and cataloguing
  • cross-language information retrieval
Introduction
 

Gateways need to address the language needs of their audiences. Users may want to search a multilingual collection by using queries in one language or to retrieve documents in a number of specific languages, preferably also via an interface in the language of their choice. In some cases they may require some translation or summary in another language than that of the document. Ideally you should provide your audience with the language support it needs. In reality this will very likely be restricted, depending on the available technologies, the language skills of available staff involved in selection and cataloguing and cost considerations.


Background
 

Multilinguality: praxis, trends and developments

There are two basic issues relating to multilingual access:

  • the storing, processing and presentation of information in many languages (this is a question of enabling technology)
  • multilingual search and retrieval

A lot of research has been going on in these areas for some time, especially in the retrieval of documents in languages other than that used for the query (cross-language information retrieval) (Oard, 1997). An overview of projects and demonstration systems can be viewed on the Web (compiled by Oard: http://www.ee.umd.edu/medlab/mlir/systems.html).

Nevertheless, existing gateways in general do not have much to offer yet in terms of multilingual support. Quite a few gateways - at least if they are not based in the UK or the US - do have a bilingual interface, usually the language of the country where the gateway is maintained and English, but more sophisticated facilities, such as multilingual search and/or browse support, are not often available. The main conclusion from a review conducted as part of the DESIRE I project in 1997 (Worsfold et al., 1997) was that that there was considerable inconsistency in the way existing services deal with language issues. Not only did different gateways vary in their policies, there was also a lot of inconsistency within individual gateways. For example, titles are sometimes displayed in the language of the resource, and sometimes only in English, and when resources are available in more than one language this is only sometimes mentioned. Some Internet search engines also offer a form of multilingual support, such as interfaces in various languages, localised search by country usually based on domain name, or automatic translation (such as Alta Vista's Babelfish, based on the Systran translation system). The services hardly ever describe the extent of their provisions in a detailed way, so it is difficult to assess what exactly they have to offer.

However, recent developments in the standardisation of metadata and resource description formats, electronic messaging and WWW technology can provide a solid basis for multilinguality in information gateways.

The European Multilingual Community

The number of indigenous European languages, according to CEN TC 304, is 160. The Internet European multilingual community uses more than 30 languages, represented by many character sets with different repertoires and encodings. A property common to all of them is the use of the character-box (or glyph-box) representation or single-byte character sets (SBCS), i.e. each character uses one displayable position. In this they differ from other languages used outside Europe.

Most of the European languages use the Latin script, which consists of the 26 basic characters of the English alphabet (A through Z) in upper and lower case. Some languages, such as French, Spanish or Icelandic, need some additional characters, as well as a number of characters that are composed from the basic ones and the diacritical marks specified in a few basic ISO standards (such as ISO 6937). Fourteen diacritical marks, commonly called 'accent marks', which permit the support of nearly 200 diacritical combinations, complete the set for European Languages. [Demchenko]

The repertoires of the official European languages of the members of the European Union (EU) are specified in ISO 8859-1, while the repertoires of Central and Eastern European languages using the Latin alphabet are specified in ISO 8859-2. The Greek alphabet is specified in ISO 8859-7 and the Cyrillic alphabet used in Europe is specified in ISO 8859-5. The most widely used operating systems, such as UNIX and Microsoft Windows, use their own character set encoding (e.g. Windows Code Pages 1250-58 or ANS) for support of the European Languages including the Cyrillic languages (Russian, Ukrainian, Belorussian, Bulgarian, etc.) in CP1251 [Freed]. The de facto standards for mail and news exchange as well as for WWW information in Russian and Ukrainian speaking communities are KOI8-R (RFC 1489) and KOI8-U (RFC 2319). These different character set encodings implemented in different operatingsystems are the main source of problems in accessing Internet/WWW content with client software running on these systems.


Issues for Gateway Managers
 

Gateway managers will be confronted with various choices relating to the language support of the service they want to provide. Those choices for monolingual or multilingual support present itself at many different levels:

  1. Scope and selection policy.
  2. Data presentation and resource description formats.
  3. Metadata and cataloguing rules.
  4. Searching and browsing.
  5. The user interface.

1. Scope and selection policy
 

Gateway managers will not be able to avoid language issues when trying to determine the scope and coverage of their service. They will need to decide whether to select all relevant documents, independently of their language, or to restrict the scope of the service to documents in one language or a number of specified languages. The following questions will have to be asked - and answered!

  • will the service include resources written in more than one language, in any language or in a selection of languages?
  • will the service include documents that require the use of Unicode or ISO 10646 character sets to support multiple languages and scripts in one single document, or it is possible to use single-byte character sets which normally contain characters from specific scripts together with the English alphabet/script (i.e. Latin 1, Latin 2, Cyrillic, Greek, Arabic, etc.)?

The choices made in this area directly determine the skills required of the staff responsible for selecting and/or cataloguing the resources as well as the choice of the relevant authoring and access tools and software. For example, creating an information gateway that includes resources in all European languages would require input from a team who had mastered all those languages between them. If the cataloguing is done by a separate team, this team would also have to consist of people with various language skills. Not many gateways will be able to manage such broad coverage with an in-house team. A distributed model - as opposed to a centralised model - could offer a solution, by getting input from a multinational team, located in various countries, providing their input via the WWW. In this case a multilingual development framework needs to be implemented, based on standards in resource description formats (metadata) and information retrieval and exchange.

SOSIG provides an interesting case study of such a model. As the core team of SOSIG consisted of native speakers of English with no other language skills, SOSIG created a system whereby European correspondents suggest resources in a number of other languages to SOSIG staff. Problems with this approach are that the service is dependent on the goodwill of unpaid staff and that communication takes place (almost) exclusively in a virtual environment.

Cross reference
Distributed cataloguing

  . .   R E M E M B E R
  • the needs of your target audience
  • technical features of the software underlying your service
  • the skills of the staff responsible for selecting and/or cataloguing the resource
  • the model for selection of resources (centralised or distributed), and (related to this) the available possibilities for ensuring the collaboration of staff or correspondents with the needed language skills
  • the possibilities for the implementation of a multilingual development framework based on standards in resource description formats (metadata) and information retrieval and exchange as well as supporting development/authoring software.

2. Data presentation and resource description formats
 

A multilingual gateway would require the WWW software lying behind the gateway to cope with multilingual data handling, search, retrieval and display.

Existing standards and recommendations provide a framework for multilingual support in data communications and information resource description formats and metadata.

A model for multilingual support in Internet protocols and applications is defined in RFC 2130. It is implemented both in interactive applications, such as the WWW, and in non-interactive applications, such as electronic mail. Basic for interoperability in those applications is character set encoding (charset), which uses registered MIME (Multipurpose Internet Mail Extension) types, and language tagging, which uses registered language values or names according to RFC 1766 or ISO 639.

The HTTP protocol, on which the WWW is based, includes information about the type of the transferred information and the character encoding for text-based information, for example:

http-equiv="Content-Type" Content="text/html; charset=euc-jp"

The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document:

http-equiv="Content-Type" Content-Language=se

If no Content-Language is specified, the default is that the content is intended for all language audiences.

It is also recommended to include information about the character encoding being used in the META information of the HTML document:

<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

Based on the exchange of information between client (browser) and server (HTTP Server) it is possible to provide character encoding and language negotiation between the information provider and the requester with regard to the accepted and preferred formats of the resources.

Recent developments in XML provide facilities for defining/labelling the language of the whole document, entity or item by including language attributes in the corresponding tag. For example:

<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
<l>Habe nun, ach! Philosophie,</l>
<l>Juristerei, und Medizin</l>
<l>und leider auch Theologie</l>
<l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>

Although the default XML Character Set Encodings are UTF-8 and UTF-16 (which are encodings for ISO 10646 or UNICODE), specific encodings for XML documents can be defined in the initial XML declaration for the whole document or entity (which can be regarded as a separately stored part of the whole document), for example:

<? xml encoding='UTF-8' ?>
<? xml encoding='ISO8859-1' ?>

Dublin Core, as a particular realisation of metadata resource description, provides possibilities for defining the language of the intellectual content of the resource, the record and the labelling language of particular fields by means of assigning language attributes to the relevant Dublin Core field.

Examples

DC.Language Format

<meta name = "DC.Language"
content = "en">
<meta name = "DC.Language"
scheme = "rfc1766"
content = "en">
<meta name = "DC.Language"
scheme = "ISO639-2"
content = "eng">

<meta name = "DC.Language"
scheme = "rfc1766"
content = "en-US">

<meta name = "DC.Language"
content = "zh">
<meta name = "DC.Language"
content = "ja">
<meta name = "DC.Language"
content = "es">
<meta name = "DC.Language"
content = "de">

<meta name = "DC.Language"
content = "german">
<meta name = "DC.Language"
lang = "fr"
content = "allemand">

Field content language labeling/attributing.

A work in Spanish may be assigned the following metadata:

<meta name = "DC.Language"
scheme = "rfc1766"
content = "es">
<meta name = "DC.Title"
lang = "es"
content = "La Mesa Verde y la Silla Roja">
<meta name = "DC.Title"
lang = "en"
content = "The Green Table and the Red Chair">


  . Tips
  • notwithstanding the future advent of total Unicode support in all system and application software, single-byte character sets will continue to be used as well for a long time. Your software should provide correct support and interoperability for both Unicode and single-byte encodings
  • make sure that your workware (client and server software plus font supplement and cartridges) fully supports all working languages and used character sets at all layers of the multilinguality framework model
  • configure your authoring tools (HTML and XML editors) in such a way that they insert metadata or attributes about language and character set encoding in the document. Don't forget to select proper parameters (character set encoding and Language) when you edit particular documents (in WYSIWYG HTML editors) or include this information when you use text editors for writing HTML documents
  • when you provide multilingual information and use encodings other than US-ASCII or Latin 1 (ISO 8859-1) encoding, it is recommended to provide information as to where users can find or download the necessary fonts
  • be sure that your HTTP server inserts correct information into the HTTP header. Note that different browsers may handle information about character set encoding in HTTP headers and metadatain the HTML document headers in different ways
  • consider providing some basic training on multilingual issues for your core development staff

3. Metadata and cataloguing rules
 

If you enable the end-user to specify preferred languages, the search mechanism can return matches for resources that are in a language the user can read. Sometimes you also need to provide a selection of character set encodings to be correctly (i.e. in a readable way) displayed to the user. The latter is especially important for communities that use multiple character set encodings, i.e. charsets. Such selections can be provided as part of the client's browser and WWW server negotiation if they are defined by modern standards and supported by modern multilingual client/server software. For this to be possible the record must contain appropriate information. In other words, in order to be able to provide this option, some investment in multilingual development software/authoring tools and effort on the cataloguing side is necessary.

Traditional library practice is to create one record for one resource. On the Internet the question is what exactly constitutes a resource - the granularity issue. This is also relevant to language issues. Do you include only complete versions of the document, or do you also register parts of a site that are available in another language? If so, how substantial does the translated section have to be? A related issue is the problem of whether to create a separate record for each language version. For books this has been traditional practice; the translation of a book will get its own cataloguing record. For the Internet environment, it may be worth while to store information about different language versions in one record, as long as the fields relating to one version are linked in some way. It will be less labour-intensive to keep one record up to date, and there is no need to maintain a system of cross-references between language versions in order to keep track of different versions of one document.

Some services only mention the language of the resource in the free text description of the resource, not in a separate field, and often this is not very consistently done within one service. This means that the user may search on the word 'Swedish' in the description field and will thus find resources of which it is noted that they are 'Available in Swedish', but no separate formal support for searching on language will be possible, as the system has no properly encoded language information available on which to base such facilities.

To be properly handled by different software, language and character set encoding should be incorporated into metadata and resource description formats explicitly and in a correctly formalised way. The chosen metadata format will have to be able to accommodate this language information. For example both the Dublin Core element set and ROADS enable the storage of language information in a separate, repeatable element or field. ROADS allows the labelling of different variants of informative fields expressed in different languages. Dublin Core provides a mechanism to define the language of the content of a particular field as an attribute of this field. XML encoded DC (or RDF in general) can use an XML language attribute and character set encoding (***on XML and DC, see above).

The metadata largely determine the search support that you will be able to provide. The more sophisticated your metadata set, and the more consistent the cataloguing practice, the more advanced the information retrieval options you will be able to support. On the other hand, 'garbage in = garbage out'.

Two of the most widely used protocols for library and general network information retrieval, HTTP and Z39.50, allow language and character set encoding negotiation for each particular communication (HTTP-RFC2616, Z39.50-LANG). The general scheme for such negotiation is as follows:

  • the requester or client (in the case of a WWW browser), sends a list of accepted character set encodings (charsets) and an accepted language priority list together with the URL/URI identifier
  • the server/database returns the resource/document in the requested encoding and language, if it is explicitly labelled

Note that language and character set encoding negotiation that is provided on communication protocol level should normally coincide with correspondent information at document level (i.e. in the document itself). If this is not the case, the client can have problems in reading the requested information. It is the responsibility of the WWW server or database administrator to ensure that such a facility is implemented.

multilingual issues in cataloguing:

Cross reference
Cataloguing

1. Cataloguing of the title.

Normally the title will be catalogued in the language of the resource. Titles for the same resource in other languages may be catalogued in an 'alternative title' field labelled with a language/variant label or attribute defining the language of the content. Some information gateways put alternative titles in the same field, separated by '=' or another symbol. It is recommended, however, to encode alternative titles in a separate field, with a language attribute or label, because this allows for more sophisticated handling of alternative titles in the search interface.

  . .   R E M E M B E R
  • formalised encoding of alternative title information in the metadata format allows for more sophisticated handling of this information by the software
  • defining a 'main' version and 'alternative' versions of a resource may cause problems, if it is not easy to determine what the main language of the resource is. For instance, what is the main title and what are the alternative titles for a Swiss resource, available in French, German and Italian?
  • giving each language version its own record and cross-referencing the records means more maintenance
  • when putting all the language information in one record, give all variants their own fields with attributes defining language
  • that it is labour-intensive to have to check periodically whether other language versions of the same pages have been added
  • what do you do with bits of a document that are in another language?
  • do you want to translate the title of non-English resources into English?

2. Language information in description/annotation.

In the free-text description the language(s) in which the resource is available may be mentioned. This has some major disadvantages, because it is hard to guarantee consistency of practice and it does not offer a basis to specify language in the search process.

  . .   R E M E M B E R
  • if you decide to adopt this approach, you could determine a default language to minimize effort. For instance, for resources available exclusively in English the language does not need to be mentioned, but an English page also available in French would get: 'Available in English and French.'
  • when storing language information in the description field, structured search support for searching on the language of a resource cannot be provided
  • it is almost impossible to check that the subject specialists/cataloguers consistently mention this information; the DESIRE review [Hiom et al.] indicated that this is not very consistently done

Another issue is the language of the descriptions themselves. There are several possibilities; the language of the description could be:

  • the language of the resource it describes
  • the language of the user interface and primary target audience of your service
  • English as the Internet 'lingua franca'
  • combinations of these, such as English and the language of your target audience

Descriptions in more than one language will of course multiply the necessary effort. A description in the language of the resource may be an option in a distributed model, with an international team of people without sufficient language skills in a common other language such as English, who select and catalogue resources in various languages. It may, however, be confusing to the user to be confronted with descriptions in various languages. Descriptions in a commonly used language such as English can give users information about documents in languages they can not read.

3. A separate language field.

The language of the resource may be in a separate field, preferably in a standardised format, e.g. ISO639 or RFC 1726. This facilitates search support for queries that specify the language of the resource. If different language versions are combined in one record, the alternative fields should be labelled so that they are linked to the title version that they belong to and the correct version of the title may be displayed to the user.

This practice is recommended instead of only mentioning the language(s) of the resource in a free text description.

4. URIs.

In the case where there is one record for different language versions, the URIs of all available language versions may be listed. In this case there should be some labelling of the URIs to link them to the title version to which they belong. Another option is to give just one URI, that of the home page, and let users choose their preferred language by using the language switch in the document. This will require less effort in creating the record and less maintenance; there can be only one possible 'dead link' instead of two or more. But, on the other hand, sometimes different language versions will be presented as equal, and it will be impossible to say which is the main version.

  . .   R E M E M B E R
  • the language skills of the staff responsible for cataloguing the resources
  • the way language is supported in the metadata format your are using (for instance Dublin Core, MARC, IAFA)
  • the way language issues are handled in the cataloguing rules you use
  • the search support you want to provide; these requirements must be met by the cataloguing format and rules

4. Searching and browsing
 

Cross-language information retrieval (CLIR) is the possibility of formulating queries in a natural language and retrieving documents in languages other than the language used for the query. The main approaches are defined (by Peters & Picchi, 1997) as:

  1. Text translation via machine translation techniques.
  2. Knowledge-based techniques - these involve the use of multilingual dictionaries, thesauri or general purpose ontologies.
  3. Corpus-based techniques*.

*In this approach large collections of texts are analysed to extract the information needed to construct application-specific translation methods. This usually involves vector space and probabilistic techniques.

The first two approaches are the most relevant for Information Gateways:

1. Text translation via machine translation techniques

For cross-language information retrieval, machine translation of the documents does not seem to be the most realistic option, because of the costs (and the fact that some aspects of it, such as treatment of word order, are redundant for CLIR). More feasible is the translation of the query into the language(s) of the document. Retrieved documents may then be translated for the user, if required, a service that Alta Vista currently provides. It would be possible to add this service to an information gateway. Although results of machine translation are far from perfect, readers may prefer a flawed translation of a document they cannot read to none at all.

2. Knowledge-based techniques

First attempts involved matching the query to the document using machine-readable dictionaries, but the best results have been reached with thesaurus-based approaches. The drawback is that thesaurus construction and maintenance is expensive, and training is required for optimum usage. In the case of thesaurus-based controlled vocabulary indexing and searching, a set of monolingual thesauri is used which all map to a common system of concepts. Instead of the labour-intensive manual assignment of thesaurus terms by indexers, research is being carried out in the area of (semi-)automatic assignment of terms. Thesauri may also form the basis for more complex cross-language free text searching, where the query must be mapped to possible terms in the language(s) of the documents. ISO 5964 recognizes three approaches to the construction of multilingual thesauri:

  1. Ab initio construction, i.e. the establishment of a new multilingual vocabulary without direct reference to the terms or structure of an existing thesaurus.
  2. Translation of an existing monolingual thesaurus.
  3. Reconciliation and merging of existing thesauri in two or more working languages.

Cross reference
Subject indexing and classification

  . Tips

EuroWordNet

This project, which ran till June 1999, aimed to develop a general purpose multilingual ontology: a multilingual database, which represents basic semantic relations between words in various European languages, with Princeton WordNet1.5 as starting point. The basic principle is the construction of monolingual wordnets, which maintain language specific differences, which are mapped to a common top-ontology.


Although some gateways use thesauri for subject access (OMNI) or to provide the user with additional assistance in the choice of search terms (SOSIG), little or no use has been made by gateways of the potential of using a thesaurus for multilingual retrieval.

3. Classification schemes

If resources are classified using the numerical code from a classification scheme which is available in more than one language, this enables language-independent searching as well as the possibility of offering a browsing structure in more than one language.

E X A M P L E
  • DutchESS offers a browsing structure based on the Nederlandse Basisclassificatie which is available in Dutch and English. A (slightly different) German translation of the same scheme is also available, which would make it easy to add a German interface in the future
  • Jyväskylä Virtual Library offers a browsing structure in Finnish and English (this does not apply to all sections of the distributed Finnish Virtual Library of which the Jyväskylä Virtual Library forms a part)

When choosing a classification scheme for your service, consider:

  • in which languages the classification scheme is available
  • whether it would be feasible to translate the scheme into another language in which it is not currently available but which you require for your service

4. Keywords

Keywords may be added to the resource description in any language. In this case also a consistent policy may enhance retrieval possibilities. A number of options are possible:

  • add keywords in the (primary) language of the service (user interface)
  • add keywords in the language of the document
  • add keywords in English as the Internet 'lingua franca'
  • add keywords in a number of languages

Keywords may be chosen from an uncontrolled keyword list or from a controlled vocabulary; when available in more than one language this will provide opportunities for searching documents in various languages by means of a query in one language. The user should be made aware of the available options.

Cross reference
Subject indexing and classification


5. The user interface
 

A monolingual user interface will probably be in the language of your primary audience or in a language familiar to a broad audience, such as English. The advantage of this is that it will require less effort to maintain, but you will exclude users who are not familiar with your chosen language. In the case of an academic audience, you may usually assume a certain proficiency in English, but a broader audience may not have those language skills. If the interface is in the national language only, this means that you narrow your target audience to one language community, dependent on the number of native speakers and others with a certain level of proficiency in that language.

Providing an interface in more than one language means that you will reach a broader audience, but you will have to put more effort in maintaining your service.

The target audience that you wish to serve will be of major importance when choosing the interface language(s). Another issue to consider is whether you are willing and able to match your multilingual interface with multilingual search support. For instance, if you provide a browsing structure based on a classification scheme which is available in one language only, do you want to put effort into translating the scheme into another language used in your interface?

In general users should be made aware of the consequences of the way they formulate their queries. This is easier said than done, if you want to avoid extensive help files or cluttered interfaces. For example: a simple query (all fields) in French may retrieve a document with the specified word in the title, but it will not result in any hits in the description field, if the language used for the description is English. As is well known, users are not very keen on reading help pages, so the search interface design should aim to present the language options in an clear and intuitive way.

Cross reference
User interface design

  . .   R E M E M B E R
  • the expected language skills of your audience; do you aim to address a well defined language community or do you wish to provide for a broader audience?
  • do you have staff with the necessary skills to translate the interface pages, or are you prepared to meet the extra cost of third party assistance (translation service)
  • are you willing and able to invest in extra creation and maintenance effort for your interface?
  • are you willing and able to match your multilingual interface with multilingual browsing and/or search support?

General conclusions
 

multilinguality is a complex issue. Although a lot of technology has become available in recent years, many problems have yet to be solved. In most cases gateways will not be able to provide more than very basic facilities if they need to keep costs within acceptable limits. However, from the above it may be clear that putting some effort into making consistent choices - based on user needs - concerning such issues as scope and selection policy, metadata and cataloguing, classification and subject indexing, as well as regarding the use of the appropriate technologies, may enhance the language support you will be able to provide in your service; it will allow you to project a clearer picture to your users of what your gateway is about. Any extra facilities will have their costs, though, in terms of extra initial effort, maintenance, required skills of staff and so on, and it is up to you to decide whether user benefits outweigh necessary efforts to provide them.

General recommendations

  • try to obtain knowledge about the language skills and needs of your audience
  • aim at an integrated and consistent approach to language issues for your gateway. Examples:
    • when your documents are in Danish only, it is probably not worth while to provide your users with a bilingual Danish/English interface
    • if you are not going to provide any multilingual search support, should you put effort into a bilingual or multilingual user interface?
    • if your cataloguing system can't handle Japanese, shouldn't you exclude documents in this language from the scope of your service?
    • consider the language skills of the staff responsible for selection and cataloguing when you develop the scope and selection policy of your service.
  • try to balance requirements of effort against expected results and benefits of multilingual support for your users
  • provide your users with information about your language policy, and integrate language related search options into your query interface design in a clear and unambiguous way

Glossary
 

CEN - European Committee for Standardisation
CLIR - Cross Language Information Retrieval
CTE - Content Transfer Encoding
DC - Dublin Core
DutchESS - Duthc Electronic Subject Service
IAB - Internet Activities Board
IETF - Internet Engineering Task Force
ISO - International Standards Organization
MARC - MAchine Readable Cataloguing. A family of formats based on ISO 2709 for the exchange of bibliographic and other related information in machine readable form.
MIME - Multipurpose Internet Mail Extension
OMNI - Organising Medical Networked Information (Medical gateway in the UK)
POSIX - Portable Operating System Interface
SBCS - single-byte character sets
SOSIG - The Social Science Information Gateway
Unicode - A universal 16-bit encoding for the scripts of the world's principal languages
UCS - Universal Character Set
UTF - UCS transformation formats - encodings for ISO 10646 or UNICODE
XML - Extensible Markup Language. A lightweight version of SGML designed for use on the Internet.


References
 

DutchESS, http://www.konbib.nl/dutchess/

EuroWordNet, http://www.hum.uva.nl/~ewn/

Jyväskylä Virtual Library, http://www.jyu.fi/library/virtuaalikirjasto/engroads.htm

SOSIG, http://www.sosig.ac.uk/

Unicode Consortium, http://www.unicode.org

H. Alvestrand, RFC 1766, Tags for the Identification of Languages (UNINETT, March 1995).
ftp://ftp.isi.edu/in-notes/rfc1766.txt

G. Clavel et al., CoBRA+ working group on multilingual subject access : Final report (Bern, 9th March 1999).
http://www.bl.uk/information/finrap3.html

Y. Demchenko, i18n and multilingual support in Internet mail Standards. Overview.
http://www.terena.nl/multiling/

Encoding Dublin Core Metadata in HTML (Internet Draft).
http://www.ietf.org/internet-drafts/draft-kunze-dchtml-01.txt

Extensible Markup Language (XML) 1.0 (W3C Recommendation, 10 February 1998).
http://www.w3.org/TR/1998/REC-xml-19980210

The ISO 8859 Character Sets
http://www.terena.nl/multiling/ml-docs/iso-8859.html

ISO 639, 'Code for the representation of names of languages'.

ISO/IEC 10646-1:1993(E ), 'Information technology - Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic multilingual Plane' JTC1/SC2 (1993).

J. Knight, Internationalization in the DESIRE project
http://www.roads.lut.ac.uk/DESIRE/DesireI18N.html

D. W. Oard, 'Serving Users in Many Languages : Cross-Language Information Retrieval for Digital Libraries', D-Lib Magazine (December 1997).
http://www.dlib.org/dlib/december97/oard/12oard.html

D. W. Oard, Cross-Language Information Retrieval Resources (Overview).
http://www.ee.umd.edu/medlab/mlir/

C. Peters, & E. Picchi, 'Across Languages, Across Cultures : Issues in multilinguality and Digital Libraries', D-Lib Magazine (May 1997).
http://www.dlib.org/dlib/may97/peters/05peters.html

RFC 2413. Dublin Core Metadata for Resource Discovery
http://www.ietf.org/rfc/rfc2413.txt

RFC 2616. Hypertext Transfer Protocol -- HTTP/1.1
http://www.ietf.org/rfc/rfc2616.txt

The Unicode standard, version 2.0 (Unicode Consortium. Reading, Mass.: Addison-Wesley Developers Press, 1996).

C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin & P. Svanberg, RFC 2130 - Report from the IAB Character Set Workshop (April 1997).
ftp://ftp.isi.edu/in-notes/rfc2130.txt

E. Worsfold et al., Developing multilingual subject gateways (An issues paper written as part of the DESIRE Cataloguing Project)
http://www.sosig.ac.uk/desire/lang/language.html

F. Yergeau, RFC 2279 - UTF-8, a Transformation Format of Unicode and ISO 10646 (January 1998)
ftp://ftp.isi.edu/in-notes/rfc2279.txt


Credits
 

Chapter author: Yuri Demchenko, Marianne Peereboom

<< P R E V I O U S 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 N E X T >>
  Go to the table of contents  

Return to:
Handbook Home
DESIRE Home
Search | Full Glossary | All References

Last updated : 26 April 00
Contact Us
© 1999-2000 DESIRE