The news business is based on producing news items about current events and delivering them to customers. Customers want to receive information about events as soon as they occur. Customers do not want to be bothered with useless information, that is, they want to get information only about events of interest.
The NEWS EU IST project aims at providing solutions which help news agencies to overcome limitations in their current workflows and increase their productiveness and revenues. In order to reach this aim, the NEWS project makes use of state-of-the-art Semantic Web technologies.
In order to apply Semantic Web technologies to the news domain, in the NEWS project a set of components were developed. One of them is the NEWS ontology, a lightweight RDFS ontology providing a formal model of the domain. Another one is an annotation component, which uses natural language processing techniques to provide capabilities such as categorization and named entity extraction.
Within the semantic annotation process, one of the key problems that we found in NEWS was the disambiguation of the entities detected by the natural language processing engine. This engine extracts named entities out of the news items, but, in order to allow a fine-grained semantic search for the user of the NEWS system, these entities have to be matched against instances of the NEWS ontology. That is, the natural language processing engine can detect that a certain occurrence of the piece of text Bush represents a person, but we also need to deduce that this person is represented in the NEWS ontology by a certain URI.
In order to deal with this problem, the NEWS consortium has developed the IdentityRank algorithm. Basically this algorithm exploits all the information provided by the natural language processing engine (categories, entities) and the news item timestamp as context for entity disambiguation. It is based on two principles:
- Semantic coherence: Instances typically occur in news items of certain categories, e.g. president Obama in news items of politics category. Also the occurrence of a certain instance gives information about the occurrence of other instances. For example, the spanish F1 driver Fernando Alonso usually appears in news items where the F1 Team Renault is also mentioned.
- News trends: Important events typically are described with several news items covering a certain period of time. For instance when the former Pope died, news items describing such event where composed during several days, most of them including instances as Vatican or John Paul II.
| Download IdentityRank | IdentityRank source code (v1.0, Java) |
| Javadoc documentation | Javadoc (v1.0) |
| EU-IST project NEWS | NEWS homepage @ UC3M |
| Contact person | Norberto Fernández |
| Institution | Web technologies lab (webTlab), Telematics Engineering Deparment, Universidad Carlos III de Madrid |