Smart Content

Since the nineties, bold promises have been made about XML and what it could make possible in the realm of content reuse and automation.

Since the nineties, bold promises have been made about XML and what it could make possible in the realm of content reuse and automation. In some areas such as tech pubs this has happened, but when it comes to mainstream content creation by business users and knowledge workers, XML is no more prevalent than it was 20 years ago. The time has come to make XML a universal format for content creation. Enter Smart Content.


What is Smart Content?

Smart Content is the foundation of Quark’s Content Automation platform. It is an open, customer-configurable XML-based content schema that allows non-technical authors to create reusable content components in a familiar, intuitive environment. These components can then be dynamically assembled for multi-channel publishing, including print, web and mobile.

Smart Content is best deployed when the content type has characteristics that include one or more of the following:

  • High volume of similar documents
  • High volume of revisions
  • Frequently repeated creation processes
  • Government or corporate regulated documents
  • High possibility to reuse content across multiple documents
  • Integration of data into the content
  • Translated to multiple languages
  • Delivered in multiple formats
  • Delivered with multiple different presentation styles

What Other XML Schema Exist?

There are many XML schema for authoring and publishing in the marketplace. Some are very generic and some are industry specific. Interestingly, even HTML4 and later versions are actually implementations of an XML schema called “XHTML.” Other popular XML schemas include:

  • DITA
    One of the most popular XML schema for technical document authoring and publishing. It was originally developed at IBM and moved to OASIS as an industry standard for technical publications. More on DITA later.
  • SPL - Structured Product Labeling
    Used in the United States for submitting drug labeling information to the FDA for approval prior to releasing a new drug or packaging to market. Supported for authoring through the SPL Accelerator for Quark XML Author.
  • MSP
    Used in the United States, Australia, and other partner countries for capturing and sharing intelligence research at the Department of Homeland Security and across nearly every department of the US government. Supported for authoring through Quark Pubs-XML Accelerator.
  • Docbook
    A precursor to DITA and used heavily in technical publications and reference books.

And there are many more, including some companies that define their own custom schema from scratch which is A LOT of work, difficult, and expensive to do well.

So, if there are many XML document schema available, why did Quark create the Smart Content schema? The story begins with a short history of XML.

What’s Wrong With XML?

XML for document production was first adopted by the technical publications industry. It is heavily used in Computer Software and Hardware documentation, complex discreet manufacturing, and some process manufacturing where the content is ultimately published as print and PDF, HTML, and several Help system formats such as HTMLHelp, MSHelp, EclipseHelp, WebHelp, as well as other output types. The most widely used document XML schema were created by and for the technical publications industry including the very popular DITA schema.

The result is that these schemas are extremely powerful tools, but are also extremely complex. To steal a quote from a Quark professional services partner, “DITA is great if your authors can think like programmers.” That’s perfect for technical authors who are, by nature of their jobs, highly technical and well trained. They are also full-time authors.

But for business-critical communications, for example documents written by financial and legal analysts or product marketing teams, it is unreasonable to think that these part-time authors can or want to “think like programmers.”

What makes these authoring schema hard? They are often overly restrictive. At Quark many of our early adopters that used one of these schema complained that the simple task of cutting and pasting content from one area of a document to another area of a document was blocked by the application. Why was it blocked? Take the following simple example of a title and a paragraph (we’re showing the XML tags, but remember that most XML authoring tools try to hide the tags).

<title>How to Make</title>
<para>Begin with the ingredients from the <keyword>Thanksgiving Recipe</keyword>.</para>

If the user selects and copies the phrase, the <keyword>Thanksgiving Recipe</keyword>. and pastes that after Make in the <title> then the authoring tool might block that paste because the controlling schema doesn’t allow <keyword> inside a <title> element. That’s frustrating — and worse — the reason for the failed paste is often hidden from the user. They can’t figure out why it’s blocked so they think the tool is broken.

This example highlights one of the major challenges for any XML authoring tool vendor, and especially for Quark who is targeting non-technical authors: introducing rules and content structure to users who have years of experience using free-form tools.

Additionally, creating a user experience that manages and exposes those rules to the user - without making the tool overly complex - is extremely difficult. That’s why the user experience of so many XML authoring products resembles a programmer’s development environment more than a word processing tool.

This challenge is worth tackling because of the incredible value of applying content automation to business-critical communications. Generally, the automation value proposition is relatively simple to describe:

  • Automation lowers costs, improves quality, and shortens time-to-market
  • For automation to succeed it requires that the inputs are valid and expected: “Garbage in, Garbage out,” as the saying goes.

So for Content Automation to succeed, the input – which is the authored, narrative content - must be expected and validated. That’s where XML is powerful, because it is easy to validate and forces authors to only create what is expected. But it is also where XML authoring tools cause the most problems, because they are a departure from free-form word processing tools.

Business users, part-time authors, and subject matter experts that have used a free-form word processing tool their entire career such as Microsoft Word or Google Docs, have expectations about how fast they can write and how much freedom (often total freedom) they have in how they write their document. Switching these types of authors to a controlled, “structured” content authoring tool that limits what they can do presents a significant challenge to the authors. The more prescriptive and restrictive the XML schema is, the bigger the gap between the author’s expectations and their experience with authoring XML.

Resolving that challenge is what led Quark to develop the Smart Content schema.

Smart Content Schema in Detail

For the XML savvy, the Smart Content schema borrows ideas from many other XML implementations including, importantly, the idea of content types – sometimes called content classes or information architectural forms. The core idea is relatively simple: there are a set of fundamental types of content and all other content can be described as belonging to one of these root classes. For those familiar with DITA, another way to describe this would be “specialization” of one of those root classes. The concept of root classes and class hierarchies is common in computer programming, biology, physics, mathematics and more.

The value of root classes and class hierarchies is that a system that knows how to process the root element can provide basic processing of any specialization of that root without previously knowing anything about the specific specialization.

This is less complicated than you might think. By a simple example, if the system knows that all <para> elements should be presented with a blank line above and a blank line below, then if the system processes content that includes <para type="blockquote"> it will at least get right that a Block Quote should have a blank line above and below. There are many other processing rules, presentation rules, and user interactions that can be applied to all content of similar types. The “specialization” is created because a system could also add new and unique processing such as right and left indents for presenting a Block Quote.

What are some of these root classes? Smart Content represents these in different categories, and here is a table that compares some of the terminology that Smart Content, HTML and DITA use:

Content Type Smart Content HTML DITA
Sections section div topic
Blocks p p p
In-lines tag em, strong, etc. phrase
Lists ul, ol ul, ol list type="type"
Tables tables table table
Images image img image
Media Media video, object object
Metadata XML meta fragment tag attribute = "value" tag attribute = "value"

How specialization of these root content types is handled in each markup language is one of the important differences:

In HTML, specialization of a root HTML tag is usually done to drive the CSS formatting or to trigger tag specific javascript and is most often encoded using a ‘class’ attribute such as:

<div class="Navigation">…</div>

However, in HTML, there are very few rules about where and how you can use and there are no rules on the value of the “class” attribute, so HTML is actually very freeform and not useful for high-value communications content authoring – though it is great for presentation in a web page or mobile application.

In DITA, specialization of a root DITA element such as <topic> is encoded like this:

<concept class="- topic/topic concept/concept">…</concept>

Though the class attribute has an apparently redundant value, it’s easy to identify the goal, which is that the element “concept” is of the class “topic” and therefore should be treated as a topic except where specific processing for concept has been defined.

In Smart Content, specialization is encoded like this:

<section type="purpose">

This is very similar to the HTML method for specialization, but has very specific implementation rules so that, for example, authoring a Standard Operating Procedure document can limit each document to one and only one “purpose” and that purpose must be after the title of the document. HTML doesn’t limit the use of or even validate the value of class attributes.

It’s worth highlighting that in HTML and Smart Content, the element name is always the root of the class. It is:

<section type="mySection"> it is not <mySection class="section">

DITA users and other XML experts might ask, “Why not use the DITA method for defining specializations?” The full answer is complex, but the simple answer is directly related to the difficulties described earlier in providing good authoring usability including support for gross-edits by cut and paste across one or more documents.

Nearly all available XML parsing tools validate the structure of a document based on the element name (valid structure means that all the elements used are allowed by the schema and are in a valid order). Also XML parsers ignore attribute values when validating structure. By using the HTML style of element specialization, Smart Content can enable gross-edits with a positive user experience. The user can cut and paste an element and after the paste, added processing can either silently correct the type attribute, or if there is more than one choice that could be made, provide the author with a user experience that allows them to make a valid type choice.

While there are many other reasons for how the Smart Content schema is architected, this ability to “fallback” to processing based on the root class is one of the biggest and most valuable.

Even though the Smart Content Schema is relatively new in XML schema terms, its development has been grounded in years of XML, content authoring and publishing expertise by Quark and our customers and partners. The schema is being successfully used by a number of customers in industries such as finance, energy, manufacturing and government. We welcome feedback on the schema and plan in the future to make the specifications widely available for other companies to use.

To find out more about implementing a Smart Content solution, check out Quark Author. Quark Author is the Web-based content creation software that, together with Quark Publishing Platform, offers subject matter experts and non-technical writers an intuitive online authoring experience for rapidly creating, previewing, publishing and reusing content.