Search and Catalogue Web agents


This project concerns the design of two agents: a journalist and a cataloguer: autonomous tools devoted to retrieving, cataloguing and storing web-information in databases.

Working specification

Introduction

The search for information on the world wide web is becoming very popular. The web provides fast and easy access to a huge volume of continuously growing information. Publishing on the web is easy, and a wealth of organizations and people are contributing to its growth. Information is always scattered piecewise over many locations, in many different formats and is in continuous change. Searching for information has become a difficult task, and automatic search engines such as altavista, hotbot, or metacrawler are a point of reference for looking up information in the web. The press is also present on the web, and many initiatives by different newspapers are to be encountered by the net surfer. Communities which share the same interests, for instance computer scientists or physicians, benefit from their own specialized on-line publications. Strides are being made towards the personalization of newspapers, so that users can access their customized version of the news. Customized news is built up for them by servers fed with their profiles. However there remains the problem of interfacing with multiple sources (different newspapers) in different formats. There is a need for automatic tools which can, on behalf of the user: These automatic tools (agents) should be able to learn of new sources (sites) of information and appropriate new rules for the cataloguing information. One example of the application of these tools is a group of doctors in medicine for which one (or more) journalist agents retrieves information from specialised sites. A catalogue agent classifies and structures the information in order to faciliate later queries to the database. These ask for references and details pertaining to diseases, new treatments, where they have been applied, and with what results.

Objectives

The main goal of this work is to develop agents with the following capabilities:

Description of the agents

In the project two kind of agents will be developed: journalist agents and catalogue agents.

Journalist agents are responsible for retreiving information. They maintain tables with the addresses (URLs) of previously identified publications of interest, their periodicity and relevant abstract layout information. They have a schedule of virtual visits and maintain logs of each visit. They should have a knowledge base with rules for scheduling visits and incorporating new URLs into their tables. They should be able to learn and improve the schedule of visits, and discover which are the best sites from which to retrieve information. They should be able to interact with other agents in order to cooperate and to compete with them (weighting and filtering the information received from other agents according to their own criteria - what is of interest to an american agent may not be of equal interest to a spanish one!).

The journalist agents should implement two interfaces:

  1. Consult Interface (CI)

  2. The agents should implement procedures which present their tables of addresses (URL, periodicity, schema), the plan (and logs) of visits, and where their information is stored.
  3. Administering Interface (AI)

  4. The agents should implement procedures to to modify, augment and remove the table of addresses (URL, periodicity, schema), the plan (and logs) of visits, and the place where the information being stored.

Catalogue agents are responsible for classifying information. They should have an API which enables them to connect to different databases. They can have different Document Type Definitions (DTDs) for different areas of interest, so that each area (for example: medicine) gets its own syntax (expressed by a DTD) and labelled concepts. The catalogue agents get the raw information to be classified from the journalist agents. The catalogue agents could use different algorithms through an API which enables the incorporation of new algorithms. They should abstract new relations the better to be able to catalogue the information. Ideally, the information catalogued would be independent of the language.

The catalogue agents should implement two interfaces:

  1. Interface for Queries (IQ)

  2. The agents should implement procedures which generate statistics of classification (relations among words learnt, algorithms used, etc.), the databases in which they are storing information, the DTDs they are using, and the places and agents they are getting information from.
  3. Administrative Interface (AI)

  4. The agents should implement procedures to modify, augment and remove the algorithms they use; the relations among words; the databases that store the information; the DTDs, and the places and agents they are getting information from.
Both kind agents should cooperate in both directions, i.e. the catalogue agents get information from the journalist agents, and the journalist agents can be told about the usefulness of the information retrieved by the catalogue agents. A logical consecuence of this cooperation is the specialization of pairs of agents with respect to the type of information they are dealing with (searching and classifying).

Technologies involved

The main technologies we assume will be used in the design are the following:

Open Issues

This specification is neither formal nor closed. The team of students should begin by working on the open issues of the specification. They should take the appropriate design decisions to achieve a working prototype. This prototype will not necessarily implement all the aspects mentioned here, but the design team should be able to explain:

Cooperation: a must must

Cooperation among the agents is a must for the design.

The design team could think about the way they themselves cooperate and try to understand cooperation among agents in the same terms, so as to extract conclusions valid for both kinds of cooperation.