Mass-customizing electronic journals

Vicente Luque Centeno, Mª Carmen Fernández Panadero, Carlos Delgado Kloos,
Andrés Marín López, Carlos García Rubio, Luis Sánchez Fernández, Arturo García Ares

Universidad Carlos III de Madrid
Área Ingeniería Telemática,
Dept. Tecnologías de las Comunicaciones

Avda de la Universidad, 30
Leganés Madrid Spain E-28911
per@it.uc3m.es www.it.uc3m.es/~per

Abstract

The evolution of the WWW has opened the way to putting information at the fingertips of the whole world with very little effort. As the amount of information available grows, there is an ever increasing demand for personalized information. In this paper, we present some ideas that we are developing in the project ``El Periotrónico'', where we take a new approach to electronic newspapers. We are taking advantage of new Web technologies to personalize both a newspaper's content and interface layout according to users' preferences.

Introduction

The current Web chaos can be considered a consequence of HTML. HTML allows separating the content, presentation (CSS), and behaviour (Java and JavaScript files), but is not rich enough to describe the logical structure of the document and does not take full advantage of processing capabilities on the client side. New Web technologies like XML, XSL, XLL, DOM, Java y JavaScript solve some of these problems.

One of our main design decisions is to use XML to define our own markup language JMLJournalism Markup Language to properly tag the journal content, its logical structure, and its metadata. This allows new articles to be self-describing which in turn allows more precise search criteria to be applied.

Likewise, we are defining JPMLJournalism Preferences Markup Language based on XML to specify the user's interest. The reader of a newspaper indicates his preferred topics. These are saved in a JPML document in order the system to show him/her only those pieces of JML news that match his/her preferences.

The introduction of metadata into the news media affects the way the news are created selected and retrieved. Journalists are no longer constrained by the physical amount of space in the printed newspaper and have new ways to present information (multimedia content). Journalists need to indicate the importance level of every news element predicated on the reader's characteristics. New tags and attributes are needed for highlighting text in a personalized manner, defining target readers and indicating the expected level of importance assumed by the journalist.

Since journalists have to include JML tags in their articles, a JML editor should be provided for them. New IBM tools that deal with XML include a program that generates an XML editor adapted to a user-defined DTD. The automatically generated JML editor can be extended with JDBC routines that insert the JML documents into a SQL database. The editor can also manage images for illustration and advertising.

The next sections describe the benefits of using a XML technology like XSL in the journal generation process, some journal personalization details used in our project and a description of JML and JPML as proposed XML applications for news markup and personalization markup. Finally, some details about the mixed evolution of Web Technology and Digital TV and some conclusions and future work are presented.

Journal generation

While XML defines the logical structure of the news, XSL allows specifying the formats for different news. One of the main advantages of using XSL instead of CSS is the possibility of specifying a transformation step before the formatting in order to achieve not only a different format but also a different physical structure. With the same XML document but different style sheets we can generate different versions of the newspaper with structure and formatting properties that match different information spaces (printed version, online-version in broadband networks, online-version on networks with smaller bandwidth, etc). In particular, we can convert XML documents to HTML, SMIL, and maybe in the future to MHEG for display via a set-top box on a TV set. After the transformation step, style sheet rules specify the format of the document.

Journal generation from XML format

Personalization

Personalization not only applies to style and layout, but also to contents. We have implemented a personalization agent written in JavaScript that performs contents customization of the news at the client side. Readers can subscribe to different sections. Every section contains a list of references to news articles published in that section. The personalization agent highlights headlines according to the reader's preferences.

Although the reader can specify his preferences statically in a form, Web technology allows dynamic personalization too. The object-oriented model of the XML documents and the DOM standard are a perfect material in which to structure information for further processing by languages like Java or JavaScript. The usage of these languages allows the document to interact with the reader. The system can automatically detect the behaviour of the user and analyze it to dynamically modify the configuration parameters.

Journalism Markup Language (JML)

The purpose of this markup language is to properly tag the journal's contents and its metadata so that four different aims can be achieved.

  1. News articles may be ``self described'' in order to be properly handled in the personalization process.
  2. The news archive can be accessed by combining matching criteria in order to produce refined results, not every news article that just contains the searching term somewhere in its text.
  3. Different style rules can be applied to the same document, so that the same document can be viewed with a different layout in a personalized manner.
  4. Journalists require a method for indicating the importance level they consider every news element might have, maybe depending on the kind of reader. New tags and attributes for highlighting text in a personalized manner, defining target readers and indicating the expected level of importance assumed by the journalist are needed.

The figure below shows a reduced version of the JML's DTD grammar and figure below that shows a small example of a news article tagged in JML.

<!ELEMENT JML (JML_AUTHOR, JML_PLACE?, JML_DATE?,
	JML_TITLE, JML_ABSTRACT?, JML_BODY)>

<!ELEMENT JML_AUTHOR EMPTY>
<!ATTLIST JML_AUTHOR value CDATA #IMPLIED>

<!ELEMENT JML_PLACE EMPTY>
<!ATTLIST JML_PLACE value CDATA #IMPLIED>

<!ELEMENT JML_DATE EMPTY>
<!ATTLIST JML_DATE value CDATA #IMPLIED>

<!ELEMENT JML_TITLE (#PCDATA)>

<!ELEMENT JML_ABSTRACT (#PCDATA)>

<!ELEMENT JML_BODY (#PCDATA|P)*>

<!ELEMENT P (#PCDATA|B|I)*>
<!ATTLIST P importance_level CDATA #IMPLIED>

<!ELEMENT B (#PCDATA)*>
<!ELEMENT I (#PCDATA)*>
JML DTD grammar
<?xml version="1.0"?>
<!DOCTYPE JML SYSTEM "jml.dtd">
<JML>
  <JML_AUTHOR value="Maruja Torres"/>
  <JML_PLACE value="Madrid"/>
  <JML_DATE value="09-06-1998"/>
  <JML_TITLE>This is the title</JML_TITLE>
  <JML_ABSTRACT>This is the abstract</JML_ABSTRACT>
  <JML_BODY>
        <P importance_level="general">  
        This <B>is</B> the body</P>
  </JML_BODY>
</JML>
Example of JML document

Journalism Personalization Markup Language (JPML)

JPML has been defined to specify user's interests. Preferences determine the way headlines are shown (highlighted, collapsed, inline, linked, ...). However, the reader can also perform explicit requests that don't match the preferences. Figure below shows a simple example of a reader's preferences and figure below that specifies the DTD grammar for this markup language.

<?xml version="1.0"?>
<!DOCTYPE JPML SYSTEM "jpml.dtd">
<JPML>
<RULE>
        <ATOM key="keyword" value="euro"/>
        <ATOM key="section" value="finances" negated="true"/>
</RULE>
<RULE>
        <ATOM key="keyword" value="Real Madrid"/>
        <ATOM key="keyword" value="Champions League"/>
</RULE>
<RULE>
        <ATOM key="author" value="Clark Kent"/>
        <ATOM key="keyword" value="Ecology"/>
</RULE>
</JPML>
JPML example
<!ELEMENT JPML (RULE)*>

<!ENTITY % match "(starts_with|ends_with|substring|fullword|is_equal_to)" >
<!ELEMENT RULE (ATOM)*>
<!ATTLIST RULE
         enabled (true|false) "true"
         description CDATA #IMPLIED
         action CDATA #IMPLIED
>

<!ELEMENT ATOM EMPTY>
<!ATTLIST ATOM
         key CDATA #REQUIRED
         value CDATA #REQUIRED
         ignorecase (true|false) "true"
         ignoreaccents (true|false) "true"
         negated (true|false) "false"
         matching %match; "substring"
>
JPML DTD

The meaning of condition attributes is described below:

Besides that, rules also define the following attributes:

Integration of Web Technology in Digital Television

There is currently a big activity around the integration of Web based technology in digital television. This integration offers advantages both to Internet content providers and digital television companies. Digital television offers the possibility of integrating audio video and data in real time and processing capabilities at the customer location by means of the set-top-boxes. This opens the possibility to offer to the customers new services, including interactive television. These new services could be based on Web technology. The Internet content providers can access to a big amount of potential customers. Many of these potential customers are not using Internet and therefore, cannot be addressed by this means.

Examples of the applications that could be offered include access to Internet via digital television, or using Web technology for annotating broadcast television. In the first case, part of the bandwidth provided is shared between all the customers to access the Internet. In the second case, Web technology gives the enhanced television contents a certain degree of interactivity.

Among the different initiatives that are currently being carried out with respect to "television and Web" we could cite the following. MHEG is a family ISO standards that deal with the coding of hypermedia contents. It includes the definition of multimedia objects, a declarative language for presenting multimedia contents, and a scripting language for data processing in MHEG applications.

ATVEF is an industry group that includes many companies interested in interactive television. Among these companies we can cite CNN, Disney, Intel, Microsoft, etc. Inside ATVEF it is being developed a specification that supports the presentation of so-called "HTML-enhanced" television contents. It is composed of announcements of the programming, triggers that define the actions to take and the location of the contents and the multimedia contents.

Finally, the World Wide Web Consortium has created a "Television and the Web" Interest Group. Inside this interest group, several activities around the integration of the Web and digital television are performed.

We are witnessing an explosion of multimedia and interactive services. This is causing a whole plethora of standards to come up, or existing standards to try to adapt to the new media. HTML belongs to the former case, and JPEG and MPEG are examples of the latter. Given this scenario, the need to standardize a higher level interface than the current standards naturally arises. MHEG, which is an acronym for Multimedia and Hypermedia coding information Experts Group, is a set of standards under development by ISO that address the specification of platform independent applications consisting of multimedia objects. Specifically, it focuses on:

An MHEG application consists mainly of declarative code that describes the objects that make it up. The application code is stored in servers that handle it to requesting clients. Nor the models nor the applications that are likely to make use of MHEG objects are defined by the standards. Possible scenarios include periodic broadcasting of Near Video on Demand or demand downloading of an electronic education application. The encoding of multimedia content is not part of the standards either; it is assumed that existing standards such as MPEG or AVI will be used.

On the client side, an MHEG engine parses the declarative code, produces the required on-screen presentation and handles all user interaction. This engine should be supported on machines with minimal resources, such as set top boxes. This is the reason why cpu-hungry tasks such as 3D imaging have been left out of the initial standard. The low resources constraint implies that MHEG is not restricted to Web browsers, but instead intended to serve as a basic form of encoding multimedia/hypermedia presentations to be transferred between pairs of heterogeneous machines, one acting as the server and the other(s) acting as the client(s).

MHEG shares with HTML the declarative approach, but while HTML is inherently a document description language, MHEG takes on the job of describing multimedia/hypermedia applications.

Similar standardization efforts been done by other groups. This is the case of SMIL, which is an application of XML targeted at the synchronization of video and audio. MHEG should eventually emerge as the leading technology in the field of multimedia presentations.

MHEG presence is imminent in the field of interactive TV, as it has been adopted by DAVIC. DAVIC is an international industry consortium whose purpose is to establish a common field of standards and protocols for the emerging digital interactive television.

The MHEG standard specifies the following notations to represent application components:

Futute work

WWW evolves too fast. Though JML, our XML language for journalism, is currently only present at our server side, it seems to be clear that XML browsers for Internet will appear in a few months. Then will be the moment to use a style sheet so that JML can be directly visualized in the client's browser instead of being transformed into HTML at the server side. Backward compatibility will be achieved by maintaing a HTML version for older browsers, but, for these readers, personalization services will loose the benefits of XML.

The recent DOM standarization is also a very important milestone that will allow software agents implemented in JavaScript to run without platform details in any browser, with much more flexibility and portability than current Dynamic HTML.

Acknowledgements

The work reported in this paper has been partially funded by the project TEL97-0788 of the Spanish CICYT. We wish to acknowledge fruitful discussions with our colleagues Peter T. Breuer, Pilar Diezhandino, Tony Hernández, Natividad Martínez, Tomás Nogales, A. Rodríguez de las Heras and Luis Sánchez of the Universidad Carlos III de Madrid. Useful assistance has been provided by El País Digital and Fundesco.

References