Behind the Scenes at the WSJ Interactive Edition

By Liora Alschuler
 
04/01/1997
The Seybold Report on Internet Publishing
(COPYRIGHT 1997 Seybold Publications Inc.)
Copyright 1997 Information Access Company.
All rights reserved.

Visiting the Dow Jones offices where editors create the Interactive Edition of the Wall Street Journal, we found a good example of how to build a contemporary editorial system for online news publishing. Mixing off-the-shelf and home-grown components, the system shows how structured markup and WYSIWYG text editing can fit together in the automation of an online newspaper.

Perhaps history's lead for the Wall Street Journal Interactive Edition will always be that it was the first online paper to give nothing away free, but when we visited its offices recently we found more of interest than just the price of admission. Behind the toll gate, Dow Jones has built an editorial production system that is easy to use, gives a high degree of control to writers and editors, carries over to the little flickering screen a good measure of the elegance of print typography, and builds a war chest of reusable content that will last long after the landfills have turned the paper's folios into a reusable substance.

The Interactive Journal, as its staff call it, has been shaped by two forces: the print paper and the long history of online publishing by Dow Jones. On the one hand is the legacy of the print edition, with its delicate type and high (some say highest) premium on the quality of the written word. The retro refusal to go the way of USA Today must strike a responsive chord in readers-the paper was one of few major dailies to increase its circulation in 1996. On the other hand, Dow Jones Newswires puts a premium on instant delivery of data, with little regard for presentation and a minimum of editorial intervention. The crew that started the Interactive Journal has roots in both the paper and online database publishing.

One of the key, early decisions was to send Alan Karben, then a graduate student, to the GCA's annual sgml conference in 1993. On his first business trip, Karben, in his words, "got both the practical and the religious sides of SGML." On his return, he found a receptive audience. Neil Budde, editor of the Interactive Journal, had come to the project from Dow Jones News Retrieval. Budde immediately saw the benefit of searching for a byline inside instead of within formatting tags (e.g., ). Managing editor Rich Jaroslovsky, who spent 18 years as a reporter and editor and filed a front-page column from Washington for nine years, liked the idea of structured markup as long as his writers and editors would not need to jump through hoops to make it work.

Given this mandate, Karben, who now works for Dow Jones full time, created an innovative, sleek editorial system, one that designers of Web publishing tools and systems would do well to examine.

As readers of Seybold publications are aware, the bedrock of sgml is the separation of content and structure from the codes that specify format and representation. It is a separation not often made in newspapers. While the Interactive Journal editorial system leverages this in many ways-and the staff received a payback for it almost immediately-the customized version of Word renders a nearly wysiwyg version of how the story will look online. The composition end of the system uses article metadata such as placement and article type to impose hundreds of variations in style. Writers can preview how the piece will look in any section of the paper at any time before filing it. The end result is that the structured markup with layered templates and style sheets renders a greater control over the final html than would be possible working in html directly.

A look at the product

From the outset, Budde and Jaroslovsky never intended to duplicate the print product. Their objective was to use the online medium to bring the paper's editorial excellence and breadth of coverage to a new audience and to enhance and expand their coverage in ways appropriate to the new medium. They wanted to create an online newspaper, not a library of articles, so the look and feel of the Interactive Edition has been a primary concern from the beginning. They have demanded that, to the extent possible, the visual presentation not be sacrificed for the ease of batch composition.

Starting in mid-1993, the team spent about a year of planning using proprietary client software. In early 1995, the group shifted its plans to the Web while work on the editorial and archival system continued uninterrupted. There was some temptation at the time to move the entire project to html, but they maintained their belief in the long-term advantages of sgml. A prototype publication, called Money and Investing Update, focused primarily on breaking business news and updated market information, was launched in July 1995, and the full Interactive Journal was inaugurated on April 29, 1996.

Standard pages, with personal options. Online subscribers open the daily editions at the front page, where the familiar "What's News" summaries are hyperlinked to the full stories. Each section-Front, Marketplace, Money, Sports-has its own "front page" with submenu and summaries linked to associated articles.

Often a summary can show up on several pages, with each one linked to a single master version of the full article. A simple, hyperlinked table of contents gives an overview of the paper.

The "interactive" portion of the title refers to the Personal Journal, the Portfolio, the online discussion groups and other customizable areas of the paper. In the Personal Journal, the subscriber can set up a profile of stories ranked by interest according to key words, company names and Journal features. Selecting the Personal Journal displays the list of recent articles that match the profile. Note that articles come from both Dow Jones and the Wall Street Journal, including its European and Asia editions, as well as from special Interactive Journal features.

Subscribers also have the option of setting up a portfolio that tracks and reports on up to 30 stocks and mutual funds. Other sections of the paper come directly from data feeds. In addition to a 14-day text-search archive of Journal and Dow Jones stories, subscribers to the Interactive Journal get access to the Dow Jones News Retrieval database, although direct links between the Journal articles and the database are limited.

24 hours, 365 days, 20 markets. Unlike print editions, for which writers and editors have one, maybe two deadlines a day, the Interactive Journal is in a near-constant state of renewal. It goes through a complete roll-over in the early hours of the morning, when the front page banner date changes and multiple stories are swapped in and out. But this new "edition" is never static. There is an ongoing rolling in of content, as third-shift editors insert news on the Far Eastern markets, which are in full swing while New York sleeps. News that breaks in the morning that won't appear on newsstands for 24 hours is brought online as quickly as possible.

While some online papers shy away from scooping their paper counterpart, the Interactive Journal is delighted to get the news out fast. At times, the Interactive Journal will develop and publish a story while the paper reporters are still hours away from filing parallel stories. Once the print story is complete, the Interactive Journal will replace its original coverage with the later story.

The Interactive Journal is an international service, with a staff of about 40 reporting and updating stories on 20 global markets every day. While the print Journal draws from the same sources, its news hole for global markets is only 20-25 inches and is localized for each edition; print subscribers rarely see the full scope of coverage.

System design

There are three primary components to the Interactive Journal's editorial system:

* Microsoft Word customized with macros, templates and keyboard shortcuts for assigning stories and summaries to sections of the publication;

* Edition Maintenance, a database application that keeps track of all of the pieces associated with each day's edition and positions stories and summaries in each edition.

* A series of conversion routines that take the rtf through two styles of sgml, parse it, archive the sgml and apply html formatting to create the final output sent to the Web server.

In addition to these components, there is an underlying database, called Copy Flow, that tracks slug, story type, section desk and revision times and manages check-in/check-out. Copy Flow was designed under the auspices of the Dow Jones Global News Management System Team, which includes integrator EDS, for the print edition, but the Interactive Journal is the first group to use it in production. (The print publication staff will make the migration at some point, when Dow Jones replaces the current IMOS and CSI systems with the new one under development.)

Structured editing with feedback. Writers and editors work in a heavily customized, keyboard-friendly version of Microsoft Word. They use templates with paragraph and in-line styles named for the type of content they contain, and they make limited use of hidden text for layout instructions. Users can show or hide tags.

The word processor is optimized for fast keyboard entry of precise, structured markup. To link to a profile of a company mentioned in text, for example, the writer highlights the name in text and uses a keystroke combination or a toolbar icon to invoke the Link To Snapshot dialog. The writer inserts the ticker symbol (if there is one) and accepts or changes the significance ranking, which can determine the relevance ranking a particular story receives in a search. A similar dialog speeds byline entry. Writers can link their articles to other current stories, archived stories, urls or other points within the current story. Comments between writers and editors are stripped out before the story is archived and published.

Story metadata are entered into a Document Attributes dialog that classifies the story and captures the information required for routing it into a user's Personal Journal. As the writer selects categories, starting with Section on the left, Page and Type are populated with defaults. The writer can accept, augment or override these defaults. Writers who know the two-letter industry codes for their beat can use the type-in Rapid Industry Entry field. This information is exported with the rtf within the Summary Info carried by all Word documents.

Writers are more likely to add information (Industry Type) that is not part of their narrative if it has a direct bearing on the usability of the story. Links to company profiles, which are prominent in the online edition, are used frequently, while glossary entries, which won't be implemented until browsers have an easy way to do popup windows, are largely ignored.

The on-screen format in Word mimics the look of the final html pages. Karben has supplied an additional element of feedback reminiscent of Passage Systems' Passage Pro. Writers can click a preview button and get a very close simulation of how the story will look in a browser, with live icons and links between articles and summaries.

Tables, according to Karben, were the trickiest part of converting to sgml. The rtf table model is a limiting factor, but Karben has given writers the ability to put a paragraph or series of paragraphs within a cell, including soft returns, pictures and pictures with captions. A true sgml editor might provide additional capability, such as nested lists, but given the Journal's limited use of tables, so far Word has been sufficient.

Placing stories. When the author is done with a piece, it is exported to Edition Maintenance, where it shows up on the list of Ready to Place articles. Edition Maintenance consists of a Visual Basic front end on top of a UniSQL database server. A slot editor then places the article in the Working Edition according to section, page, and column. Placing the story is a drag-and-drop operation onto the hierarchical tree of the Working Edition. The slot editor can preview the Working Edition and prepublish an individual page, section or the entire edition.

When the slot editor is ready to update the online edition, he publishes it and it becomes the current online version. Once a day, when printed Journal content is published for the first time, the slot editor works from the print lineup. During most hours, the slot editor works directly in Edition Maintenance. In the evening, when a huge mass of material is moving through the system, a news assistant has physical control of placement.

SGML in two steps. Dragging a story from the Ready-to-Place list to the hierarchical edition structure triggers the first conversion from rtf to a simple sgml document type dubbed Dow Jones Markup Language or DJML-Lo. Parsing errors detected during the OmniMark conversion are reported to the editor. Karben uses conversion software from OmniMark Technologies to rewrite some error messages for nontechnical users. (Some contain his pager number.) In all cases, the parser returns the line of text that caused the error.

* * *

Sample parse error (Courtesy Alan Karben)

One line of this article did not translate into "Valid" DJML.

By Christina Binkley

In this part of the document, you are not allowed to have a BREAK element.

* * *

Immediately following the conversion to DJML-Lo is a second conversion to DJML-Hy (for HyTime, the ISO standard for sgml linking and entity management.) DJML-Hy substitutes entities (e.g., %plus;) for special characters (+) and ids for path names and removes all tag minimization. Karben chose HyTime conventions because they describe one-to-many links and make it easier to manage entities. The subobjects in the Edition Maintenance application are taken directly from the sgml entity files.

Converting and parsing a document takes about 10 seconds on the Sun Solaris server. The most common error is the accidental deletion of a portion of the hidden text making it impossible to supply a complete set of tags during conversion. Karben reports that about three to four parse errors occur in a typical 24-hour period and that most can be handled routinely by the slot editor.

One change Karben would make if he were redesigning the system today would be to invoke this conversion on the move into Ready To Place instead of during placement in the edition. Earlier conversion would ensure that the original author was on hand to correct the error and would speed up the placement process. Karben noted that the general design principle ought to be to convert to sgml as soon as possible and leave it as late as possible.

Karben said he was "looking for what was neat in the logical sense and was not jaded by what was available in current tools." He was confident that he could build anything required to "make the data do wonders." He hopes that the xml standard will encourage the creation of new tool sets that take advantage, as he has, of sgml/HyTime linking mechanisms. (For more on XML, see story on page 3.)

The resulting sgml archive, stored in the UniSQL database, contains the DJML-Hy markup but excludes the temporal pieces of the story such as requests for comments and pointers to online discussions. While most of the paper's pages are placed on the Web server as html, the individual views of the paper, such as the Personal Journal, are created dynamically from the sgml whenever the user invokes the custom pages.

On the preprocessing, authoring side, the conversion is done using programming tools from Omnimark Technologies. On the searching and Personal Journal side, Karben uses the set of Perl libraries put into the public domain by David Megginson of the University of Ottawa (www.uottawa.ca/~dmeggins). These libraries use nsgmls, the binary output of the SP parser from James Clark (www. jclark.com). The implementation required an enhancement in SP for HyTime entity management, which Clark provided. The individual views of the paper, such as the Personal Journal, are created dynamically from the sgml whenever the user invokes the custom pages.

DJML does not use the News Industry Transfer Format (nitf) document type created by the International Press Telecommunications Council (www. iptc.org/iptc/). Nitf was designed for transmission between news agencies and as such has no element names for article, page and section. Karben believes that as long as Dow Jones can translate to and from the industry-standard document type, it loses nothing by using its own document type definition, created with help from sgml consulting firm Martin Hensel Corporation.

At present, the online editorial system loses all coding from the print side and from the wire services. When the wire services implement an sgml header, Karben's group will be able to take direct advantage of this, yet still add its own tags. It would be a tremendous advantage to know bylines, company names and headlines instead of using macros to make a best guess, which is what they are forced to do now. Karben's advice to developers is think of their own content and to take advantage of the great translation tools available and the ease of translation between different forms of sgml. Ideally, they should capture at the time of creation everything needed to describe all the content of the article.

Turning words into home pages

With richly tagged sgml files as his source, Karben creates html with the use of down-translation scripts, also written in OmniMark. Rendering from an sgml source yields consistent html formatting without tedious hand tagging or complex manipulations. The translation software, written by Karben, knows the page, article and intended placement and applies one of ten common templates. Each template has html boilerplate that supplies header, footer, gifs and other standard page features. In this way, the editorial system renders 350 distinct article types, all within an easily controlled stylistic vocabulary. On an editor's preview screen, a single summary is rendered three different ways according to placement.

A single template can render a wide variety of styles contoured to fit the specific section and page. Each template has variables for section name, color scheme and other distinctive features. The conversion program pulls the context-appropriate values from three tables that correspond to the article type, page and section. The page type (e.g., Review & Outlook) determines formatting characteristics, such as column separators, column widths, headline suppression, column and summary logos, and whether ads are allowed. The article type (e.g., Heard on the Street), determines icons, headline size, logos and other characteristics.

Karben explains that keeping the number of basic templates small makes the system easy to update and maintain. Recently, the paper contracted with the DoubleClick Network to manage all of its online advertising placement. Replacing the local ads with the DoubleClick source was accomplished in less than a day for all areas of the paper.

Editors can customize the layout of special features with a specialty template. They also have write permission for the published html pages, but rarely use it. When a writer wants something new, Karben reports, it is usually for the sake of consistency, such as a new separator or the application of asterisks or dashes to match an existing feature.

Rendering from sgml also means that the layout can associate a picture with its caption and move and place the two as a single unit.

Unfortunately, even with this attention to detail in formatting, we find the screen remains a tedious medium for extensive reading, and the online paper's mimicking of the print paper's column widths accentuates that weakness. The Interactive Journal does use Cascading Style Sheets, so readers can benefit wherever CSS is supported, but at present the browsers do not provide a convenient mechanism for subscribers to override the styles with their own, such as wider column widths.

The searching advantage. Consistency and automation are not the only benefits of table-driven markup from a rich source; searching is also improved. Karben provided this example.

Compare a byline in DJML:

* * *

<BYLINE>By <AUTHOR>Mark Robichaux</AUTHOR>

<CREDIT>Staff Reporter of <TITLE>The Wall Street Journal</TITLE></CREDIT></BYLINE>

* * *

with the html it gets converted to:

* * *

<B>By M<FONT size=-1>ARK</FONT>
R<FONT size=-1>OBICHAUX</FONT></B><BR>
<FONT size=-1>Staff Reporter of</FONT>
<FONT size=-1>T</FONT><FONT size=-2>HE</FONT>
<FONT size=-1>W</FONT><FONT size=-2>ALL</FONT>
<FONT size=-1>S</FONT><FONT size=-2>TREET</FONT>
<FONT size=-1>J</FONT><FONT size=-2>OURNAL</FONT>

* * *

which renders:

By MARK ROBICHAUX
Staff Reporter of THE WALL STREET JOURNAL

Rendering from an sgml source means that author Mark Robichaux is searchable as a single string, which may not be possible if the source is littered with formatting codes. It also means that once browsers get hip to more of the finer points of rendering, the small capitals can be applied with a single format command, as they would be in any reasonably adept composition system (or even Microsoft Word).

Conclusions

The Wall Street Journal Interactive Edition has created an editorial system in which the professionals at the center of the enterprise, the journalists and editors, can continue to work much as they always have. Differences in editorial requirements, such as the need for tighter classification of stories, are imposed by the new media, not by the technology used to create it. It is pleasing to see a system that uses structured markup and yet stays within the reach of an editorial staff that considers itself the best in the world and doesn't give a hoot for whatever is under the hood. For writers and editors, the greatest difference lies in the tempo and timing of their day-they are always on deadline.

As writers work, the Web technology is mostly hidden, but they can be called on for front-line troubleshooting. Budde and Jaroslovsky compare this stage of technology and interface design to the model T stage of the automobile: "You didn't have to be a mechanic to drive, but it helped if you could open the hood and tell the carburetor from the gear box."

The Interactive Journal extends the Journal's rich typographic tradition into a new media without creating a facsimile of the print edition. The firm has achieved this with a composition system driven by structured markup and batch composition that nevertheless preserves much of the interactivity of desktop publishing. Writers and editors determine the final look of a story by the way they define it and place it in the edition, and they can check their results with an immediate preview.

The control over the final look of the page through predetermined, context-driven templates has been achieved without sacrificing the ability to tweak it as needed and without compromising the paper's ability to take advantage of improvements to the Web browsers that ultimately format the material for the reader. The structured approach does impose some controls compared to the freedom of building pages from scratch, but Budde and Jaroslovsky want their writers to focus on words and logical links, not dropped caps. Given that journalists' overriding concern is getting the words right and only when that is accomplished (on deadline) do they care if the headline fits, we think the Interactive Journal has provided a good balance among ease of entry, control over hyperlinking and metadata, and fidelity to the composed screen.

Little downside. There is a price, however, for in-house development. Karben has stretched Word Basic to its limits, and admitted he is probably doing a little more than he should do with it. He's found that Word Basic cannot nest more than four subroutines and that there are limits to the macro interpreter. He can't create a popup dialog box from a popup and, of course, can't catch parsing errors until they are further downstream.

More importantly, Dow Jones is now responsible for maintaining and upgrading the one-off configuration management applications developed by EDS-Copy Flow and Edition Maintenance-in contrast with Word, which is already two upgrades past the 6.0 version in use at the Interactive Journal.

Another area that might be improved is indexing and retrieval. The current keyword entry with the Verity search engine seems more primitive than it need be, given the sophistication of the paper and archive. A Personal Journal view pulls from both Dow Jones News Retrieval and the Interactive Journal, but there is insufficient filtering of duplicates, so the reader can be inundated with multiple copies of the same story. It also would be nice to see Dow Jones leverage its own classification scheme by giving the reader direct access to the same categories in conducting searches, or to make use of sgml tags in the searches. Lastly, it also would be nice if the user could point to a story of interest and say, "Give me more that are like this," a technique that other engines offer. The new version of Verity, Karben hopes, will allow tighter coupling with the sgml source markup, and, should the paper switch to another indexing tool, the change could be made without disrupting other aspects of the system.

Strong upside: SGML payback. In this era of Quark pagination it is highly unusual for a newspaper editorial system to generate an sgml repository. But without the constraints of editing to fit, the Interactive Journal's experience clearly shows that it can pay off for online news publications. The rich markup aids the automation and consistency of the formatting process, and makes it much easier to support alternative presentations. For example, Interactive Journal subscribers can get personalized content "pushed" onto their screens using an After Dark screen saver that is updated automatically. The translation from DJML to the HTMx format used by After Dark for its screen savers is quick and simple.

Karben has also worked on enhancements that make it easier to use the Interactive Journal with pwWebSpeak's nonvisual web browser. All he had to do was insert

tags around article summaries to give subscribers client-side control over the audio format of these navigational links.

Equally important, Dow Jones has since ported the sgml to several spin-off products and media, deriving additional revenue at a reduced production cost than would otherwise be possible. For the past few weeks, the Interactive Journal has delivered its content around the clock for broadcast over the PointCast Network. Recently PointCast upgraded its client; an Interactive Journal channel is included.

Budde and Jaroslovsky decided to stick with sgml as long as they could make it work and make an editorial system that their editors could live with, that fit within the establish workflow. They credit Alan Karben for continuing to find creative ways to do so. They have pressured both sides-Karben to come up with less intrusive technical fixes, and the editorial staff to slow down and add the information required to take advantage of the power of the new medium. In this way, they avoid two of the usual downsides to sgml-the awkwardness of the editorial environment and the disconnect between the writer's screen and the final layout.

Having already ported to several browsers, the team is ready for the next media platform, whether it is airplane seat-back displays or heads-up eyeglasses. When the time comes, Alan Karben is confident that they can get there from sgml.

Interactive Journal Editorial System Components

Editorial platform: Microsoft Windows 95 or 3.1

Editorial workflow: Copy Flow, inhouse system written by EDS

Placement: Edition Maintenance, inhouse application written by EDS in Visual Basic on top of a UniSQL database server

Text editor: Microsoft Word, 16-bit, heavily customized using Word Basic

Conversion and validation: OmniMark from Omnimark Technologies; SP, nsgmls from James Clark; Perl libraries from David Megginson

Ninety Percent Empty or Two-Thirds Full?

The week we began this story, a New York Times headline read "700 Newspapers to Read Online; Only One Charges for Everything." The point of the article was that with the imposition of the universal toll, the audience for the Wall Street Journal Interactive Edition had largely evaporated, going from roughly 700,000 free-loaders to 70,000 paid subscribers. Even more alarming, the Times reported that some online subscribers had pulled their print subscriptions, thus "cannibalizing" the audience for the print edition.

Rather than look at this as Niagara-like fall-off, Interactive Journal editors Neil Budde and Rich Jaroslovsky present an alternate view: Not all of the 700K were readers; many were surfers. On a really good day, before the toll gate went up over the entrance, the site had about 45,000 visitors. Today, it has about 30,000 a day, turning a 90% reduction into a 33% reduction. (The Times, in contrast, has about 1 million registered users, about 70,000 of whom use the paper daily. )

Jaroslovsky suggests yet another way to look at the same figures: If the 700,000 are viewed not as subscribers to a free service but as recipients of a direct mail promotional piece, the 10% who paid represent a triumph of marketing.

The editors see the Interactive Journal as neither a cannibal nor clone of the print parent. They want to create a publication of substance that will attract new readers and extend the overall franchise of the paper. The profile of the Interactive Journal reader is younger and more high-tech than the audience for the print Journal. As long as both publications continue to grow overall, they see no need to be concerned. Jaroslovsky's E-mail indicates that there are new print subscribers being generated from the online product, though not yet as many as drop print once they get the online product.

Still, looking at the daily profit and loss statement, the Interactive Journal, a separate business unit within Dow Jones Interactive Publishing, is not in the black. It may seem anomalous to demand a direct payback on the basis of one product when it is, at the same time, building an archive that will continue to spin off products and new media deliverables. The expectation at Dow Jones, however, is that both print and the Interactive Journal will be profitable, each in its own sphere and each catering to distinct, albeit overlapping, audiences.

The Wall Street Journal is not eager to replace the print product-among other reasons the paper owns its printing plants-but if paper does go away management doesn't expect this change to happen overnight. Whatever the mix of readership in coming years, as long as they continue to expand, not shrink, the franchise, it will be to the company's benefit.