Mass-customizing electronic journals

Introduction

The current Web chaos can be considered a consequence of HTML. HTML allows separating the content, presentation (CSS), and behaviour (Java and JavaScript files), but is not rich enough to describe the logical structure of the document and does not take full advantage of processing capabilities on the client side. New Web technologies like XML, XSL, XLL, DOM, Java y JavaScript solve some of these problems.

One of our main design decisions is to use XML to define our own markup language JMLJournalism Markup Language to properly tag the journal content, its logical structure, and its metadata. This allows new articles to be self-describing which in turn allows more precise search criteria to be applied.

Likewise, we are defining JPMLJournalism Preferences Markup Language based on XML to specify the user's interest. The reader of a newspaper indicates his preferred topics. These are saved in a JPML document in order the system to show him/her only those pieces of JML news that match his/her preferences.

The introduction of metadata into the news media affects the way the news are created selected and retrieved. Journalists are no longer constrained by the physical amount of space in the printed newspaper and have new ways to present information (multimedia content). Journalists need to indicate the importance level of every news element predicated on the reader's characteristics. New tags and attributes are needed for highlighting text in a personalized manner, defining target readers and indicating the expected level of importance assumed by the journalist.

Since journalists have to include JML tags in their articles, a JML editor should be provided for them. New IBM tools that deal with XML include a program that generates an XML editor adapted to a user-defined DTD. The automatically generated JML editor can be extended with JDBC routines that insert the JML documents into a SQL database. The editor can also manage images for illustration and advertising.

The next sections describe the benefits of using a XML technology like XSL in the journal generation process, some journal personalization details used in our project and a description of JML and JPML as proposed XML applications for news markup and personalization markup. Finally, some details about the mixed evolution of Web Technology and Digital TV and some conclusions and future work are presented.

Journalism Markup Language (JML)

The purpose of this markup language is to properly tag the journal's contents and its metadata so that four different aims can be achieved.

News articles may be ``self described'' in order to be properly handled in the personalization process.
The news archive can be accessed by combining matching criteria in order to produce refined results, not every news article that just contains the searching term somewhere in its text.
Different style rules can be applied to the same document, so that the same document can be viewed with a different layout in a personalized manner.
Journalists require a method for indicating the importance level they consider every news element might have, maybe depending on the kind of reader. New tags and attributes for highlighting text in a personalized manner, defining target readers and indicating the expected level of importance assumed by the journalist are needed.

The figure below shows a reduced version of the JML's DTD grammar and figure below that shows a small example of a news article tagged in JML.

<!ELEMENT JML (JML_AUTHOR, JML_PLACE?, JML_DATE?,
	JML_TITLE, JML_ABSTRACT?, JML_BODY)>

<!ELEMENT JML_AUTHOR EMPTY>
<!ATTLIST JML_AUTHOR value CDATA #IMPLIED>

<!ELEMENT JML_PLACE EMPTY>
<!ATTLIST JML_PLACE value CDATA #IMPLIED>

<!ELEMENT JML_DATE EMPTY>
<!ATTLIST JML_DATE value CDATA #IMPLIED>

<!ELEMENT JML_TITLE (#PCDATA)>

<!ELEMENT JML_ABSTRACT (#PCDATA)>

<!ELEMENT JML_BODY (#PCDATA|P)*>

<!ELEMENT P (#PCDATA|B|I)*>
<!ATTLIST P importance_level CDATA #IMPLIED>

<!ELEMENT B (#PCDATA)*>
<!ELEMENT I (#PCDATA)*>

JML DTD grammar

<?xml version="1.0"?>
<!DOCTYPE JML SYSTEM "jml.dtd">
<JML>
  <JML_AUTHOR value="Maruja Torres"/>
  <JML_PLACE value="Madrid"/>
  <JML_DATE value="09-06-1998"/>
  <JML_TITLE>This is the title</JML_TITLE>
  <JML_ABSTRACT>This is the abstract</JML_ABSTRACT>
  <JML_BODY>
        <P importance_level="general">  
        This <B>is</B> the body</P>
  </JML_BODY>
</JML>

Example of JML document

Journalism Personalization Markup Language (JPML)

JPML has been defined to specify user's interests. Preferences determine the way headlines are shown (highlighted, collapsed, inline, linked, ...). However, the reader can also perform explicit requests that don't match the preferences. Figure below shows a simple example of a reader's preferences and figure below that specifies the DTD grammar for this markup language.

<?xml version="1.0"?>
<!DOCTYPE JPML SYSTEM "jpml.dtd">
<JPML>
<RULE>
        <ATOM key="keyword" value="euro"/>
        <ATOM key="section" value="finances" negated="true"/>
</RULE>
<RULE>
        <ATOM key="keyword" value="Real Madrid"/>
        <ATOM key="keyword" value="Champions League"/>
</RULE>
<RULE>
        <ATOM key="author" value="Clark Kent"/>
        <ATOM key="keyword" value="Ecology"/>
</RULE>
</JPML>

JPML example

<!ELEMENT JPML (RULE)*>

<!ENTITY % match "(starts_with|ends_with|substring|fullword|is_equal_to)" >
<!ELEMENT RULE (ATOM)*>
<!ATTLIST RULE
         enabled (true|false) "true"
         description CDATA #IMPLIED
         action CDATA #IMPLIED
>

<!ELEMENT ATOM EMPTY>
<!ATTLIST ATOM
         key CDATA #REQUIRED
         value CDATA #REQUIRED
         ignorecase (true|false) "true"
         ignoreaccents (true|false) "true"
         negated (true|false) "false"
         matching %match; "substring"
>

JPML DTD

The meaning of condition attributes is described below:

key: defines the name of the metadata field to which the matching criteria is applied. Possible values for this attribute are title, section, author, keywords, date, source, ....
value: defines some value specified by the user that can be compared against the value of the key metadata.
ignorecase: defines whether values have to be folded to uppercase before compared, overriding the need of exact match.
ignoreaccents: defines whether orthographic accents should be considered.
negated: reverses the condition.
matching: defines the criteria which is to be applied between the value and the key's value. Though the less restrictive ``substring'' criteria is applied as the default, other criteria can be specified, like ``fullword'' for matching whole words, ``starts_with'' or ``ends_with'', which require that the specified value can be found at the beginning or the end of the metadata field. This allows people to search and/or highlight headlines whose author is Clark Kent, whose title starts with Clinton or whose keywords contain Iraq.

Besides that, rules also define the following attributes:

enabled: for enabling/disabling the rule.
description: a short user's description for that rule.
action: the action to be performed when the rule is activated. Possible values for this attribute are highlight, iconify, hide, open in full window.

Integration of Web Technology in Digital Television

There is currently a big activity around the integration of Web based technology in digital television. This integration offers advantages both to Internet content providers and digital television companies. Digital television offers the possibility of integrating audio video and data in real time and processing capabilities at the customer location by means of the set-top-boxes. This opens the possibility to offer to the customers new services, including interactive television. These new services could be based on Web technology. The Internet content providers can access to a big amount of potential customers. Many of these potential customers are not using Internet and therefore, cannot be addressed by this means.

Examples of the applications that could be offered include access to Internet via digital television, or using Web technology for annotating broadcast television. In the first case, part of the bandwidth provided is shared between all the customers to access the Internet. In the second case, Web technology gives the enhanced television contents a certain degree of interactivity.

Among the different initiatives that are currently being carried out with respect to "television and Web" we could cite the following. MHEG is a family ISO standards that deal with the coding of hypermedia contents. It includes the definition of multimedia objects, a declarative language for presenting multimedia contents, and a scripting language for data processing in MHEG applications.

ATVEF is an industry group that includes many companies interested in interactive television. Among these companies we can cite CNN, Disney, Intel, Microsoft, etc. Inside ATVEF it is being developed a specification that supports the presentation of so-called "HTML-enhanced" television contents. It is composed of announcements of the programming, triggers that define the actions to take and the location of the contents and the multimedia contents.

Finally, the World Wide Web Consortium has created a "Television and the Web" Interest Group. Inside this interest group, several activities around the integration of the Web and digital television are performed.

We are witnessing an explosion of multimedia and interactive services. This is causing a whole plethora of standards to come up, or existing standards to try to adapt to the new media. HTML belongs to the former case, and JPEG and MPEG are examples of the latter. Given this scenario, the need to standardize a higher level interface than the current standards naturally arises. MHEG, which is an acronym for Multimedia and Hypermedia coding information Experts Group, is a set of standards under development by ISO that address the specification of platform independent applications consisting of multimedia objects. Specifically, it focuses on:

Synchronization in space and time of these multimedia objects
User interaction via links and user interface elements such as menus, buttons and text entry fields

An MHEG application consists mainly of declarative code that describes the objects that make it up. The application code is stored in servers that handle it to requesting clients. Nor the models nor the applications that are likely to make use of MHEG objects are defined by the standards. Possible scenarios include periodic broadcasting of Near Video on Demand or demand downloading of an electronic education application. The encoding of multimedia content is not part of the standards either; it is assumed that existing standards such as MPEG or AVI will be used.

On the client side, an MHEG engine parses the declarative code, produces the required on-screen presentation and handles all user interaction. This engine should be supported on machines with minimal resources, such as set top boxes. This is the reason why cpu-hungry tasks such as 3D imaging have been left out of the initial standard. The low resources constraint implies that MHEG is not restricted to Web browsers, but instead intended to serve as a basic form of encoding multimedia/hypermedia presentations to be transferred between pairs of heterogeneous machines, one acting as the server and the other(s) acting as the client(s).

MHEG shares with HTML the declarative approach, but while HTML is inherently a document description language, MHEG takes on the job of describing multimedia/hypermedia applications.

Similar standardization efforts been done by other groups. This is the case of SMIL, which is an application of XML targeted at the synchronization of video and audio. MHEG should eventually emerge as the leading technology in the field of multimedia presentations.

MHEG presence is imminent in the field of interactive TV, as it has been adopted by DAVIC. DAVIC is an international industry consortium whose purpose is to establish a common field of standards and protocols for the emerging digital interactive television.

The MHEG standard specifies the following notations to represent application components:

ASN.1 - this notation was the first one to be developed. Although application components are unambiguously expressed in ASN.1, it is not considered friendly enough to be read by humans so the following alternate notation was developed.
Textual Notation - this notation was developed to overcome the problems with ASN.1. It doesn't add new features, there is a one to one mapping between both ASN.1 and textual notations.
XML - currently under development, this notation is targeted to the Web world. It is thought to attract a wider user community due to the growing acceptance of XML. An earlier effort to define an SGML based encoding was cancelled due to lack of resources.

Mass-customizing electronic journals

Abstract

Introduction

Journal generation

Personalization

Journalism Markup Language (JML)

Journalism Personalization Markup Language (JPML)

Integration of Web Technology in Digital Television

Futute work

Acknowledgements

References