XML Headlines

The Journal's Latest Web Effort

James E. Gaskin
Special To Inter@ctive Week
 
08/03/1998
Interactive Week from ZDWire
Copyright (c) 1998 ZD Inc. All Rights Reserved.

By just about any measure, The Wall Street Journal Interactive Edition is a success. It's one of the leading online publications, with more than 200,000 paid subscribers making more than 40 million page impressions per month. It's also one of the few online publications that makes money, charging $29 from print subscribers and $49 per nonprint subscriber for a year's subscription

The Wall Street Journal Interactive Edition (www.wsj.com) also is a success when it comes to implementing the latest Web publishing tools.

The 2-year-old Interactive Edition is posted using a complex system based on JavaScript utilities, PERL scripts, Standard Generalized Markup Language (SGML) and a highly customized version of Microsoft Corp.'s Word 6.0. To this technology soup, the Interactive Edition recently added the new eXtensible Markup Language (XML).

The Interactive Edition's combination of technologies, particularly XML, gives the online editors unsurpassed flexibility and enables the publication's readers to download the day's top business stories to ever more portable devices, including 3Com Corp.'s PalmPilot.

In addition to document publishing tools, the Interactive Edition makes use of online audio and video technologies and plans to expand the site's customization features -- allowing readers to better tailor the content to their needs.

But, then, innovation isn't new to The Wall Street Journal. After all, says Alan Karben, associate director of interactive development at the Interactive Edition, "the Journal has been doing hypertext for over 50 years, with the summaries on Page 1 leading to more detailed stories inside."

However, being on the front lines -- whether to cover news or implement new technology -- is never easy.

"We'll try 10 new technologies to find four real good options," says Neil Budde, Interactive Edition editor.

Indeed, the Interactive Edition was not put off by the relative immaturity of XML, which was approved as a standard by the World Wide Web Consortium (www.w3.org) less than a year ago, or the current lack of vendor support for the nascent markup language.

"Living with something for two years makes you want faster and better," Budde says from behind his large wooden desk, the top of which is empty except for a black keyboard and a flat-screen monitor.

Front Lines And Headlines

Necessity was -- and continues to be -- a driving force at the Interactive Edition. When the newspaper first went online, HyperText Markup Language (HTML) editors were hard to find, and journalists had little or no experience with HTML.

"We made the definite decision to hire journalists, not HTML jockeys," says Managing Editor Rich Jaroslovsky.

The in-house developers took the tool with which reporters were most familiar, Microsoft Word, and began programming.

"We get Microsoft Word to do 95 percent to 98 percent of what's needed for generic SGML output," Karben says. This is done through macros and small utility programs written in Microsoft's own Basic scripting language for Word. "We call the process of turning Word-created text into Web content DJML," he says, referring to the nickname -- the Dow Jones Markup Language -- the newspaper has given to its online publishing tool kit.

Documents created in Word are saved in the Rich Text Format (RTF), which essentially is plain text with lots of structured formatting codes hidden from the writer.

JavaScript conversion programs are involved in almost every story, such as formatting the byline to look like The Wall Street Journal's print edition. PERL scripts and applications provide sidebars and better formatting for printed output.

The addition of XML -- which offers a way to describe new formatting methods for entire documents, rather than using a limited set of format codes as in HTML -- allows developers to create a "tag set," or labels, which define information within a file, separate from how the information is displayed.

The Interactive Edition uses common tags, such as headline, byline and company name, in almost every story. Writers click icons within the Interactive Edition -customized version of Microsoft Word, and the appropriate XML tags are inserted into the story automatically.

The Konstructor Suite from OmniMark Technologies Corp. (www.omnimark.com) helps convert the RTF files, relying on the special codes and instructions, into XML text. XML is an extremely simple dialect of SGML, optimized for Web use. XML enables generic SGML -- a decade-old standard used to present complex documents electronically -- to be processed on Web servers as simply as with HTML.

"OmniMark interacts with legacy database and existing systems, using XML and SGML parsing on lots of text," says Andrew Kowal, product manager at OmniMark. "The OmniMark batch application process takes articles stored in SGML and spits out HTML Web pages for the Journal's Interactive [Edition] server," or, he says, for whatever format is necessary -- such as for the tiny Web clients in handheld computers.

Indeed, the flexibility of XML was demonstrated recently when The Wall Street Journal arranged to have its news presented to PalmPilot users via the AvantGo format.

AvantGo Inc. (www.avantgo.com) provides information to the new breed of palm-size computers. Subscribers download information formatted for the limited Web browser abilities in these handheld devices to their PCs, then sync them to their palm computers for later reading.

XML also allows intelligent searching within the Interactive Edition archive. Searching based on XML tags provides better information by treating the tagged data separately from other data.

"If you want to see all the stories written by Rich Jaroslovsky, rather than stories that mention him by name, our search uses the byline XML tag," Karben says. Similarly, searching for the company name tag of IBM Corp. will return stories about IBM, but not stories in which IBM is mentioned in passing.

Full text search functions are provided by Verity Inc. (www.verity.com) on one of four full-time servers running the text search engine.

Next Step: Browsers

The next leap forward for XML will be in the next browser releases from Netscape Communications Corp. and Microsoft, Karben says.

"Direct XML support in the browser makes it possible to harness the intelligence built into the data right on the desktop," he says, adding that the appropriate style sheets and scripting commands must be included with the data.

Customization will be expanded, Budde says, with the Interactive Edition offering more displays tied to individual subscribers' interests. "Different users may see different front pages in the future, with highlights for that person," he says. "This may mean we make bits and pieces of the pages, and let the server assemble them on the fly."

What's next?

"Dynamic HTML [DHTML] and more XML use are coming," Karben says. "We want things both faster and more stable." DHTML is a series of platform- and language-neutral interfaces that allow Web programmers to design highly stylized pages that can respond to commands of individual users.

Feedback from subscribers tells Budde's group they're doing well, but readers are using the site in ways no one internally ever considered. "We found that many readers turn from the print edition to read the same article on the Interactive Edition ," Budde says. "They tell us it's easier to print from the Web than to cut out and keep articles from the print edition. We never thought of that."

Insite

COMPANY: The Wall Street Journal Interactive Edition

LOCALE: New York

MISSION: Be the best news site on the Web

PROJECT LEADERS: Neil Budde, Editor; Rich Jaroslovsky, Managing Editor

HARDWARE: Sun Microsystems Inc. servers

SOFTWARE: Microsoft Corp.'s Word 6.0; Netscape Communications Corp.'s SuiteSpot Enterprise Server with Publishing Suite; OmniMark Technologies Corp.'s Konstructor; and Verity Inc.'s search capabilities

TACTICS: Use resources of the print edition and the latest technologies -- including eXtensible Markup Language -- to deliver the day's business stories

BOTTOM LINE: It's not just The Wall Street Journal on the Web, it's The Wall Street Journal of Web sites. What's News? Budde (far right) and Karben (below) pounce on Web tools Webmasters disguised as editors