The paper tries to highlight several
points where informational retrieval could benefit from human
language technology. Although there is an increasing quantity
of multimedia content on the web, the majority of information
is still coded in natural languages(s). Since the global trend
in web language usage is coding in native language, it leads to
the situation in which web pages in English do not represent the
majority of web-text anymore. To retrieve information from non-English
pages requires several tools, which have to deal with particular
human language. These tools can be borrowed from the field of
human language technology (HLT) and applied to web-text. Some
possible areas of HLT, which could be used in information retrieval
from web-text, are:
- morphological processing: should be able to cope with different
word-forms in particular language in order to make language specific
full-text search available
- named entities recognition: should give the possibility to get
information on concepts which have fixed names/ titles/formulas
(such as personal/institutional/geographical names, temporal expressions
etc.)
- semantic thesauruses: should give the ability to retrieve information
on the basis of language synonymy/proximity
- machine (aided) translation: should give (rough) translations
of web-text.
|