Search and Catalogue Web agents

This project concerns the design of two agents: a journalist and a cataloguer: autonomous tools devoted to retrieving, cataloguing and storing web-information in databases.

Working specification

Introduction
Objectives
Description of the agents
Technologies involved
Open Issues
Cooperation: a must

Introduction

The search for information on the world wide web is becoming very popular. The web provides fast and easy access to a huge volume of continuously growing information. Publishing on the web is easy, and a wealth of organizations and people are contributing to its growth. Information is always scattered piecewise over many locations, in many different formats and is in continuous change. Searching for information has become a difficult task, and automatic search engines such as altavista, hotbot, or metacrawler are a point of reference for looking up information in the web. The press is also present on the web, and many initiatives by different newspapers are to be encountered by the net surfer. Communities which share the same interests, for instance computer scientists or physicians, benefit from their own specialized on-line publications. Strides are being made towards the personalization of newspapers, so that users can access their customized version of the news. Customized news is built up for them by servers fed with their profiles. However there remains the problem of interfacing with multiple sources (different newspapers) in different formats. There is a need for automatic tools which can, on behalf of the user:

retrieve information from different sites
catalogue the information
store it in databases, to ease later recovery and browsing.

These automatic tools (agents) should be able to learn of new sources (sites) of information and appropriate new rules for the cataloguing information. One example of the application of these tools is a group of doctors in medicine for which one (or more) journalist agents retrieves information from specialised sites. A catalogue agent classifies and structures the information in order to faciliate later queries to the database. These ask for references and details pertaining to diseases, new treatments, where they have been applied, and with what results.

Objectives

The main goal of this work is to develop agents with the following capabilities:

ccoperation with other agents
searching for and retrieving information
filtering and classifying information
converting the information to different formats
storing the formatted information in a database

Description of the agents

In the project two kind of agents will be developed: journalist agents and catalogue agents.

Journalist agents are responsible for retreiving information. They maintain tables with the addresses (URLs) of previously identified publications of interest, their periodicity and relevant abstract layout information. They have a schedule of virtual visits and maintain logs of each visit. They should have a knowledge base with rules for scheduling visits and incorporating new URLs into their tables. They should be able to learn and improve the schedule of visits, and discover which are the best sites from which to retrieve information. They should be able to interact with other agents in order to cooperate and to compete with them (weighting and filtering the information received from other agents according to their own criteria - what is of interest to an american agent may not be of equal interest to a spanish one!).

The journalist agents should implement two interfaces:

Consult Interface (CI)

Administering Interface (AI)

Catalogue agents are responsible for classifying information. They should have an API which enables them to connect to different databases. They can have different Document Type Definitions (DTDs) for different areas of interest, so that each area (for example: medicine) gets its own syntax (expressed by a DTD) and labelled concepts. The catalogue agents get the raw information to be classified from the journalist agents. The catalogue agents could use different algorithms through an API which enables the incorporation of new algorithms. They should abstract new relations the better to be able to catalogue the information. Ideally, the information catalogued would be independent of the language.

The catalogue agents should implement two interfaces:

Interface for Queries (IQ)

Administrative Interface (AI)

Both kind agents should cooperate in both directions, i.e. the catalogue agents get information from the journalist agents, and the journalist agents can be told about the usefulness of the information retrieved by the catalogue agents. A logical consecuence of this cooperation is the specialization of pairs of agents with respect to the type of information they are dealing with (searching and classifying).

Technologies involved

The main technologies we assume will be used in the design are the following:

Java
JavaScript
XML (XML Hacking by Dan Connolly, Notes on XML software by Bert Bos)
HTML
XML parsers
agents(Agent Info, Agent News subscription, AI Research, AI Software Repositories and Directories IBM Aglets Workbench, Intelligent Software Agents, UMBC Agent Web)
databases accesible through SQL (and JDBC or ODBC)

Open Issues

This specification is neither formal nor closed. The team of students should begin by working on the open issues of the specification. They should take the appropriate design decisions to achieve a working prototype. This prototype will not necessarily implement all the aspects mentioned here, but the design team should be able to explain:

the main concepts and aspects they focussed on
how the design was divided among the members of the team
how the interface specification was arrived at
how the testing (including plan of testing, testing environment, test cases) was carried out.

Cooperation: a must must

Cooperation among the agents is a must for the design.

The design team could think about the way they themselves cooperate and try to understand cooperation among agents in the same terms, so as to extract conclusions valid for both kinds of cooperation.