Search and Catalogue Web agents
This project concerns the design of two agents: a journalist
and a cataloguer: autonomous tools devoted to retrieving, cataloguing and
storing web-information in databases.
Working specification
Introduction
The search for information on the world wide web is becoming very popular.
The web provides fast and easy access to a huge volume of continuously
growing information. Publishing on the web is easy, and a wealth of organizations
and people are contributing to its growth. Information is always
scattered piecewise over many locations, in many different formats and
is in continuous change. Searching for information has become a difficult
task, and automatic search engines such as altavista,
hotbot,
or metacrawler are a point of reference for looking up information
in the web. The press is also present on the web, and many initiatives
by different newspapers are to be encountered by the net surfer. Communities
which share the same interests, for instance computer scientists or physicians,
benefit from their own specialized on-line publications. Strides
are being made towards the personalization of newspapers, so that users
can access their customized version of the news. Customized news is built
up for them by servers fed with their profiles. However there remains the
problem of interfacing with multiple sources (different newspapers) in
different formats. There is a need for automatic tools which can, on behalf
of the user:
-
retrieve information from different sites
-
catalogue the information
-
store it in databases, to ease later recovery and browsing.
These automatic tools (agents) should be able to learn of new sources (sites)
of information and appropriate new rules for the cataloguing information.
One example of the application of these tools is a group of doctors in
medicine for which one (or more) journalist agents retrieves information
from specialised sites. A catalogue agent classifies and structures the
information in order to faciliate later queries to the database. These
ask for references and details pertaining to diseases, new treatments,
where they have been applied, and with what results.
Objectives
The main goal of this work is to develop agents with the following capabilities:
-
ccoperation with other agents
-
searching for and retrieving information
-
filtering and classifying information
-
converting the information to different formats
-
storing the formatted information in a database
Description of the agents
In the project two kind of agents will be developed: journalist agents
and catalogue agents.
Journalist agents are responsible for retreiving information.
They maintain tables with the addresses (URLs) of previously identified
publications of interest, their periodicity and relevant abstract layout
information. They have a schedule of virtual visits and maintain logs of
each visit. They should have a knowledge base with rules for scheduling
visits and incorporating new URLs into their tables. They should be able
to learn and improve the schedule of visits, and discover which are the
best sites from which to retrieve information. They should be able to interact
with other agents in order to cooperate and to compete with them (weighting
and filtering the information received from other agents according
to their own criteria - what is of interest to an american agent may not
be of equal interest to a spanish one!).
The journalist agents should implement two interfaces:
-
Consult Interface (CI)
The agents should implement procedures which present their tables of
addresses (URL, periodicity, schema), the plan (and logs) of visits, and
where their information is stored.
-
Administering Interface (AI)
The agents should implement procedures to to modify, augment and
remove the table of addresses (URL, periodicity, schema), the plan (and
logs) of visits, and the place where the information being stored.
Catalogue agents are responsible for classifying information. They
should have an API which enables them to connect to different databases.
They can have different Document Type Definitions (DTDs) for different
areas of interest, so that each area (for example: medicine) gets its own
syntax (expressed by a DTD) and labelled concepts. The catalogue agents
get the raw information to be classified from the journalist agents. The
catalogue agents could use different algorithms through an API which enables
the incorporation of new algorithms. They should abstract new relations
the better to be able to catalogue the information. Ideally, the information
catalogued would be independent of the language.
The catalogue agents should implement two interfaces:
-
Interface for Queries (IQ)
The agents should implement procedures which generate statistics of
classification (relations among words learnt, algorithms used, etc.), the
databases in which they are storing information, the DTDs they are using,
and the places and agents they are getting information from.
-
Administrative Interface (AI)
The agents should implement procedures to modify, augment and remove
the algorithms they use; the relations among words; the databases that
store the information; the DTDs, and the places and agents they are getting
information from.
Both kind agents should cooperate in both directions, i.e. the catalogue
agents get information from the journalist agents, and the journalist agents
can be told about the usefulness of the information retrieved by the catalogue
agents. A logical consecuence of this cooperation is the specialization
of pairs of agents with respect to the type of information they are dealing
with (searching and classifying).
Technologies involved
The main technologies we assume will be used in the design are the following:
Open Issues
This specification is neither formal nor closed. The team of students should
begin by working on the open issues of the specification. They should take
the appropriate design decisions to achieve a working prototype. This prototype
will not necessarily implement all the aspects mentioned here, but the
design team should be able to explain:
-
the main concepts and aspects they focussed on
-
how the design was divided among the members of the team
-
how the interface specification was arrived at
-
how the testing (including plan of testing, testing environment, test cases)
was carried out.
Cooperation: a must must
Cooperation among the agents is a must for the design.
The design team could think about the way they themselves cooperate
and try to understand cooperation among agents in the same terms, so as
to extract conclusions valid for both kinds of cooperation.