3.4 Harvesting, indexing and automated metadata collection - DESIRE Information Gateways Handbook

3.4. Harvesting, indexing and automated metadata collection

In this chapter...
	The technical aspects behind automatic collection of Internet resource descriptions and how to make good use of the results The software used by the DESIRE II project is reviewed - possibilities and limitations Try for yourself; set up a Harvested Information Gateway! We'll show you how to do it

Introduction

This chapter provides a starting point for technical specialists who are considering using harvesting, indexing and automated metadata collection within their information gateway. An information gateway which works like this consists of three separate mechanisms:

A robot which collects resource descriptions from the Web according to a set of rules. Care must be taken in order to assure that the robot detects and saves any metadata provided within the resource. NetLab develops and maintain a Web harvesting system called Combine.
The collected resources must be indexed and made available using a server that can process queries and requests for information retrieval. DESIRE II uses the Zebra search engine from Indexdata which implements the ANSI/NISO Z39.50 search and retrieval protocol.
Finally, the indexed resources hosted by the server must be made readily available to the end-users. Thus we need a Web interface that is able to communicate with the server, i.e. compliant with the ANSI/NISO Z39.50 protocol, and which can respond to end-users' requests. There exist a few gateways with such an interface. We will use the Europagate service provided by dtv.

The main software components used in the DESIRE II project are reviewed. The rest of this chapter describes how to glue the different pieces together into a running environment that can accommodate further development.

Background

The core function of an information gateway is to make bibliographic records available for advanced searching. The ANSI/NISO Z39.50 protocol is specially designed to support very detailed request and retreival sessions. That is why the Desire project uses the Zebra server software which implements that very protcol. Since ANSI/NISO Z39.50 isn't very widely supported (none of the major Web browsers provides a client) we need to use a gateway. The gateway's main functionality is to channel requests passed via HTTP to a Z39.50 server and return an appropriate response. It also has to keep track of all the different sessions for all users who access the gateway. Finally, we obviously should have a robot to collect the Web resources in the first place. There are many robots available, but we need one that can deal with our particular interest in metadata as well as our need to adjust robot output in a way that makes it easily available to the Zebra server. Combine fulfils both these requirements.

Harvesting and Combine

The harvesting metaphor was coined because of the strong similarities between the automated collection of Web resources and real-world harvesting. Both of these tasks raise three key issues:

What sort of crop are we interested in and where do we find it?
How do we harvest?
Can we keep the weeds out?

The first question is concerned with how best to discover Internet resources and is primarily a matter of manual selection. Those aspects are described in a separate chapter.

Resource discovery

It does, however, highlight an important problem that begs for computerised support. A harvester works very well on a field of corn but it performs poorly in other contexts, for instance when we're looking for rare mushrooms in a forest. We simply cannot take everything and then sift the mushrooms from the wood, grass and pebbles. A similar line of reasoning applies to a Web robot. It would be a huge waste of time and resources to make a robot crawl around the entire .com domain in order to harvest any page concerning the sale of fountain pens. While it is possible to employ subject specialists to detect valuable Web resources and librarians to catalogue them, such an approach is relatively expensive. For this reason it is tempting to design a Web robot that, when given a promising starting point, is able to select which trails to follow.

E X A M P L E

EELS and All Engineering

An interesting attempt to address these matters has been made within the DESIRE II project. Read about EELS and All Engineering.

The last two questions are easier to approach from the point of view of an information analyst who wishes to design a Web robot so we'll dispense with the agronomics. Instead we shall turn our attention to how the Combine system is designed to serve as an integral part of an information gateway. Combine is an open, metadata-aware system for distributed, collaborative Web indexing and it is freely available. It consists of a scheduler, a couple of robots, and receivers that process and store robot output.

The scheduler is loaded with a set of nodes called JCFs which each contain an URL and some meta information. Depending on an internal set of rules that are configurable, the scheduler selects the next URL to be processed and launches a robot (harvester).
The robot visits its target server and retrieves data. It is designed to be very polite and well mannered towards the targeted server in order to keep its administrator happy. Data is delivered via a receiver (rd) and written to a depot (hrf) where the parsers can access it.
The parsers are able to detect metadata as well as metadata formats such as Dublin Core. The parsers mark up all detected metadata and hyperlinks in accordance with a special format. Parser output is stored in a tree-like manner directly on the filesystem under the hdb directory. The hyperlinks that constitute a complete URL can be recycled, thus allowing recursive harvesting of a Web site.

You are strongly recommended to visit the Combine home page http://www.lub.lu.se/combine to get a general overview before trying to install and run Combine. Note that some information on the Combine home page may be a bit out of date.

Installing and running Combine

Before you start, make sure you have:

a system running your favourite UNIX flavour. Combine has been successfully installed under various versions of Linux and Solaris 2.5 and higher
Perl version 5.003 or higher, including the MD5 package
gcc 2.7.x or higher, complete with g++ front end and C++ libraries
the Berkeley DB system; fetch and install the latest stable version from Sleepy-Cat Software
a decent version of make, preferably GNU's
created a top level directory within which everything will be built. Call it, for instance, DESIRE2

Installation

Fetch the latest stable distribution from the Combine home page.
Unpack the tarball; type 'tar xzvf combine-???.src.tgz'.
Enter the unpacked directory, from now on referred to as 'combine-src/'. Type 'cd combine-src/'.
Edit the Makefile. Most users will only need to make three changes:
  a) Set 'HOME_ALL' to indicate where to build Combine. Make sure that the directory exists. The build directory will be referred to as 'COMBINE/'.
  b) Set 'DB' to the directory where your Berkeley DB system is located.
  c) Uncomment any line concerning your OS under the platform specific section.
Type 'make; make install'.
Everything should go smoothly but don't hesitate to use the mailing list if you have any trouble installing the Combine software.

Configuration

Create a file, say, 'starturls.txt' in your 'COMBINE/etc' directory. Put the URLs you wish to harvest on separate lines in 'starturls.txt'. Remember, Combine supports recursive harvesting so you don't need to provide URLs to all individual pages on a domain.
The Combine system's ability to recursively harvest a Web site poses a problem. We may very well want to restrict our search for Web resources to a specific host or domain or similar. To do this, edit the 'config_allow' and 'config_exclude' files in 'COMBINE/etc/'. The files are configured by means of regular expressions similar to Perl's and they contain a few typical examples.
Edit the file 'COMBINE/etc/combine.conf' and provide the necessary information.
Browse the 'COMBINE/etc/config_binext' and 'COMBINE/etc/config_parsable'.

Running Combine

Note that this example is intended to show what a Combine session looks like and is therefore run by hand.

Type 'cd COMBINE/' since some scripts depend on being run from that directory.
Type 'bin/start-cabin'.
'bin/start-hdb 2' where '2' tells Combine that we want 2 parsers.
'bin/start-harvester-local 4', twice as many harvesters.
Prepare the scheduler. Type 'bin/sd-ctrl.pl open; bin/sd-ctrl.pl pause'.
We're all fired up and ready to feed Combine with input. This is done by piping our URLs in 'COMBINE/etc/starturls.txt' through a set of filters:
  a) The first filter 'bin/selurl.pl' applies the rules in 'config_allow' and 'config_exclude' and it can be omitted.
  b) 'jcf' stands for job control format and it is Combine's internal representation of an URL. Since all URLs must be formatted this way, the filter 'bin/jcf-builder-uniq.pl' is useful.
  c) Finally, we load our jcfs into the scheduler with 'bin/sd-load.pl'.
Let's put it all together:
'cat start-urls.txt | bin/selurl.pl | bin/jcf-builder-uniq.pl | bin/sd-load.pl'
Note: Only 'bin/sd-load.pl' affects the state of Combine, so don't be afraid to experiment with the others.
Launch Combine with 'bin/await-harvest.pl 1'.

Now what?

If everything went fine, there should be be a file entry with a 'rec' suffix for each harvested Web page under the 'COMBINE/hdb/' directory. Take some time to browse the directories to see what has happened during your first Combine session. In order to harvest all interesting links that resulted from the this session, simply type:

'bin/new-url.pl | bin/selurl.pl | bin/jcf-builder-uniq.pl | bin/sd-load.pl'

People who are more interested in getting things done rather than wasting time with low-level Combine details may irritably ask themselves: 'Isn't there any high-level interface available to all this nonsense?' Fortunately there is. Browse the html document cje/cje.html and find out how to install and run the Combine Job Editor. Note that you need a Web server to take full advantage of this package.

Zebra and Z39.50

Zebra is an indexing system and a retrieval engine attached to a Z39.50 server. The following introduction to Z39.50 comes from a document at Indexdata describing Zebra.

The ANSI/NISO Z39.50-1995 standard presents a model of a very flexible, general-purpose information management and retrieval system. The intent is that this model should be placed 'in front' of new and existing information systems, to provide a uniform interface to client applications. This in turn provides the user with a number of benefits, including a uniform interface to many different kinds of information sources - hopefully tailored exactly to his specific needs by the provider of the client software. Z39.50 allows many different systems to look the same to the individual user, and it allows the individual information system to appear in many different forms, to suit the varying preferences and requirements of the users.

The quotation above should encourage you to believe that Zebra will somehow index and answer Z39.50 queries on, say, the stuff that Combine recently fetched from the Web.

Installing and running Zebra

Installation

Get zebra and yaz from Indexdata.
Unpack the tarballs from the DESIRE2 directory.
Installation is simple. Enter each directory and type 'configure; make'. Make sure that you build yaz before you build zebra.
Check your zebra/index/ directory for two executables: zebraidx and zebrasrv.

Configuring and running Zebra

Download the configuration files and unpack them with 'tar xzvf zcfg.tgz'. Enter the new directory zebraindex.
Create a link to the data collected by Combine. Type 'ln -s ../COMBINE/hdb hdb'.
Browse the configuration in zebra.zfg and check all paths. Try to create an index by typing:
'../zebra/index/zebraidx -c zebra.cfg -g index update hdb >&! index.log'
Start the zebra server. Type:
'../zebra/index/zebrasrv -c zebra.cfg tcp:host.domain:1101 &'

The Europa Gateway

Now is a good time to think about how to make our data publicly available. Since none of the most common Web browsers supplies a Z39.50 client we must have a Web interface to query our installation with HTTP requests. Visit http://europagate.dtv.dk/cgi-bin/egwcgi/80442/tform.egw and complete the three first fields of the form. Leave the others to their default values. Press 'submit'. Now search for the nickname that you just gave your name server. Enjoy!

Core skills
	Anyone interested in setting up a vanilla-flavoured information gateway should be familiar with UNIX and its development environment in general. Knowledge of Perl-style regular expressions will make things a bit simpler. Programming skills and fluency in Perl are necessary for configuring an information gateway to fit a specific purpose, tuning performance and so on.

Staff effort
	Anyone who has the core skills listed above will be able to set up and configure a first gateway in under a week. With some experience it could be done in two hours. Experience shows that the maintenance of a gateway takes about four hours a week.

References

Organizations
- Danmarks Tekniske Videncenter & Bibliotek, DTV http://www.dtv.dk
- Dublin Core Metadata Initiative http://purl.oclc.org/dc/
- Free Software Foundation and the GNU Project http://www.gnu.org
- Indexdata, Denmark http://www.indexdata.dk/
- NetLab, Sweden http://www.lub.lu.se/netlab/
- Sleepy-Cat Software http://www.sleepycat.com/
Projects
- All Engineering http://www.lub.lu.se/eel/ae/index.html
- DESIREhttp://www.desire.org/
- EELS http://www.lub.lu.se/eel/eelhome.html
- Europagate http://europagate.dtv.dk
Software
- BerkeleyDB http://www.sleepycat.com/
- Combine http://www.lub.lu.se/combine/
  Mailing list nwi@munin.lub.lu.se
- Zebra http://www.indexdata.dk/zebra/
- Z39.50 protocol http://lcweb.loc.gov/z3950/

Credits
	Chapter author: Fredrick Rybarczyk With contributions from: Andy Powell, Jasper Tredgold

<< P R E V I O U S	1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8	N E X T >>
	Go to the table of contents


Return to: Handbook Home DESIRE Home	Search \| Full Glossary \| All References Last updated : 20 April 00	Contact Us © 1999-2000 DESIRE