Round-Tripping Specifications Bob Stayton Sagehill Enterprises Steve Ball Explain 1.8 2008-05-22 SRB Updated for current implementation. 1.7 2008-02-22 SRB Added edition. 1.6 2007-10-19 SRB Added keyword. 1.5 2007-01-05 SRB Reduce emphasis on WordML, add support for OpenOffice. 1.4 2005-11-11 SRB Added bibliography. 1.3 2005-10-31 SRB Added mediaobjectco, imageobjectco, programlistingco, areaspec, area, calloutlist. 1.2 2005-10-13 SRB Version prior to using revhistory. This document specifies how DocBook elements are mapped to paragraph and character styles in a word processor. The specifications are used to write conversions between DocBook XML and word processor XML formats, such as Microsoft's WordProcessingML (WordML), OpenOffice's OpenDocument and Apple's Pages.
Introduction Microsoft Word 2003 introduced WordProcessingML (WordML), an XML vocabulary for Word documents. Since then, other popular word processors have become available that use XML as their data representation, namely Apple's Pages and OpenOffice. By converting Word (or OpenOffice or Pages) to XML, it becomes possible to convert a word processing document to DocBook and vice versa using XSL transformations. Such conversions then enable the following. DocBook content creators write in their familiar wordprocessing application, rather than learning a new XML editing application. DocBook XML documents can be styled for output using the typesetting features of the word processor. Word processors have a simple, flat data model; documents consist of paragraphs (and tables) and paragraphs contain text and character spans. All word processors allow styles to be associated with paragraphs and spans. This specification describes how DocBook elements map to a set of paragraph and character styles. It defines a specific set of style names for which a Word style template can be created. The style names are also used in XSLT template match patterns for conversion. Although originally targetted to MS Word, the system has subsequently been extended to use other word processors, notably Apple's Pages and Open Office.
Project goals The goal of this project is to enable a word processor, such as, but not limited to, Microsoft Word, to be used with DocBook files. The specific goals include: Enable authoring of basic DocBook documents in the word processor. Enable importing of basic DocBook XML documents into the word processor. To meet these goals, the project provides a toolkit that can be immediately put to use. The kit includes: Templates for Microsoft Word, Apple Pages and Open Office with formatting styles attached to the style names. XSLT stylesheets that convert a word processing document that is authored with the corresponding template into a DocBook XML file. XSLT stylesheets that convert a DocBook document into a word processing document that can be opened in a word processor.
Why basic DocBook? This project will never be able to support all DocBook elements and structure. Take, for example, the address element. This element can be used both as a block element for metadata. It can also be used as a phrase level element in a block parent, such as the affiliation element. To make matters worse, it can itself contain phrase level markup, such as personname. No word processor allows character styles to be nested. The project will initially focus on a basic set of commonly used DocBook elements in order to create a useful editing environment that utilises a word processor with DocBook. One problem facing this conversion project is the sheer number of DocBook elements, over 400 in DocBook 5.0. To support DocBook structural models, several of the elements require more than one paragraph or character style. This would lead to very long and unwieldy list of styles in the word processor interface. That would make authoring less efficient and discourage users. Accordingly, this project assumes that authors who need the full set of DocBook elements and structures will use an XML authoring tool that better supports them. This project is focused on authors who wish to write basic DocBook documents using a word processor. Because Microsoft Word is so widespread, it is hoped that this project will help a lot of new DocBook users get started with familiar tools. They can then graduate to more advanced tools as their needs develop.
Project Non-Goals The following goals are not in the scope of this project: Support of versions of Word that do not feature reading/writing WordML (XML). That is, all versions prior to Word 11 (Office 2003). Support of arbitrarily defined styles. This system may expect certain styles to be defined in a particular fashion (in particular, those defining the title of components and divisions).
Mapping elements to styles Although WordML, OpenDocument and DocBook are all XML, there several challenges when trying to convert between them. The basic problem in mapping paragraph/character styles to DocBook elements is that word processor documents support far less structure than DocBook. DocBook permits nesting of elements within other elements, providing multiple levels of context for each element. Word's only structural feature is the outlining mode. In Word outlining, certain paragraph styles are assigned outline levels. When a user applies those styles, they effectively create logical structure in the Word document. Unfortunately, Word itself attempts to automatically determine which paragraphs are headings, rendering this method is unreliable. Instead of relying on Word's built-in outlining mode, this system uses only the names of paragraph styles to determine document structure. Certain heuristics are applied to build the DocBook element structure from the (relatively flat) word processing structure. Titles and other features are used to mark the beginning of a structure and all paragraphs following that are included in that structure until the beginning of the next structure is found. That is, the beginning of one structure marks the end of the previous structure. Problems may arise when a structure should end, but there is no word processor feature that marks the endpoint. To mark the end of a feature an empty paragraph is used. Nesting of block elements is another commonly used feature of DocBook. It is not possible to use Word's outline mode for blocks if it is being used for components and sections. So in this specification, nesting of block elements is indicated by adding a number suffix to a style. So a paragraph with style orderedlist2 is considered to be contained within a preceding paragraph with style orderedlist1 or itemizedlist1. Where appropriate in the word processor, paragraph indent levels are used to visually indicate nesting of blocks. Nesting of inline DocBook elements is particularly difficult to support because word processors do not nest character styles. That means a nested inline would require a separate character style to indicate the parent-child relationship. Given the large number of combinations possible, a prohibitively large number of character styles would have to be created. In this project, nesting of character styles is not supported. Nested inlines being imported from DocBook will be converted to a sequence of single-name character styles, where possible, or rejected. In many cases, DocBook structure can be derived from the flat sequence of paragraphs based on sibling relationships. For example, when a paragraph styled as para is followed by a paragraph styled as itemizedlist1, the conversion to DocBook will output a para element and then start an itemizedlist element, with the second paragraph as its first listitem. All itemizedlist1 paragraphs that follow without interruption are inserted into the same itemizedlist element. Some combinations of elements cannot be supported (at least not with the techniques as described in this document). An example is informalexample and its permitted content; there is no title to mark the beginning of the element and no marker for the end of the element, also there are too many parent-child combinations to reasonably define style names. The design principles used in this project for selecting paragraph/character style names are as follows: Where Word (or OpenOffice or Pages), by default, has a style or feature that corresponds directly to a DocBook element then that style or feature will be used (and documented in this document). For example, the Normal paragraph style maps to a DocBook para element, and a Word table (w:tbl) maps to a DocBook tableIn some cases Word may posess a feature, but it doesn't function in an acceptable manner. For example, lists. In these cases the feature is to be avoided, and a workaround provided.. Paragraph and character style names will match DocBook element names as much as possible. This will enable authors to learn DocBook element names and help debug problems with conversion. A style may indicate a parent-child relationship, but the paragraph for such an element may only occur after a paragraph that denotes the beginning of the parent structure. In this case the element name is used as the style name. For example, a personblurb paragraph may only occur after an author, editor or othercontrib paragraph. If a paragraph occurs without the appropriate preceding paragraph, then an error is signalled. Some styles may also indicate a parent-child relationship, but either the parent structure is ambiguous or the paragraph starts the parent structure. For example, chapter-title indicates that the paragraph is a title element whose DocBook parent is a chapter element. Some style names are simplified to make them easier to use in the word processor. For example, a paragraph in an orderedlist requires three elements in DocBook: orderedlist, listitem, and para. The paragraph style name in Word is shortened from orderedlist-listitem-para to just orderedlist1 (for a first level list). In the case of lists (see below), the list level is appended, which is why this example becomes orderedlist1. Style names with a number suffix indicate a nesting level, as described above. Style names with continue indicate that the paragraph is part of the preceding element. For example, a para paragraph is used for a single paragraph para element. This causes any preceding list to be closed. If a list item in the preceding list is to contain more than one paragraph, then the subsequent paragraphs in the word processor documentmust use the para-continue style. Character styles map to elements that are children of the element for the paragraph, hence there is no need to encode parent-child relationships. For example, a surname character style in an author paragraph becomes a surname child element of the author element. Empty paragraph and character styles are ignored. This can be useful to end structures. The first paragraph style in the word processor document is used to define the root element of the DocBook document. For example, if the document starts with book-title, then the DocBook document will have book element as its root element. All the rest of the document content will be contained in that root element. Sequential structures are coalesced into a single parent element. For example, a sequence of itemizedlist1 paragraphs becomes a single itemizedlist element with several listitem element children. DocBook to Paragraph/Character Styles DocBook element Style(s) Comments Components and sections book/info/title book-title book/info/subtitle book-subtitle book/info/titleabbrev book-titleabbrev chapter/info/title chapter-title Assigned Word outline level 1. chapter/info/subtitle chapter-subtitle chapter/info/titleabbrev chapter-titleabbrev appendix/info/title appendix-title Assigned Word outline level 1. preface/info/title preface-title Assigned Word outline level 1. article/info/title article-title Assigned Word outline level 1. article/info/subtitle article-subtitle article/info/titleabbrev article-titleabbrev bibliography/info/title bibliography-title Assigned Word outline level 1. bibliography/bibliodiv/info/title bibliodiv-title biblioentry/title biblioentry-title Metadata elements after the biblioentry-title paragraph become part of the biblioentry. glossary/info/title glossary-title Assigned Word outline level 1. index/info/title index-title Assigned Word outline level 1. part/info/title part-title section Unnumbered section elements are translated into their equivalent numbered paragraph style. Sections 6 levels and deeper are reported as an error. sect1/info/title sect1-title Assigned Word outline level 2. sect1/info/subtitle sect1-subtitle sect2/info/title sect2-title Assigned Word outline level 3. sect2/info/subtitle sect2-subtitle sect3/info/title sect3-title Assigned Word outline level 4. sect3/info/subtitle sect3-subtitle sect4/info/title sect4-title Assigned Word outline level 5. sect4/info/subtitle sect4-subtitle sect5/info/title sect5-title Assigned Word outline level 6. sect5/info/subtitle sect5-subtitle simplesect/info/title simplesect-title simplesect/info/subtitle simplesect-subtitle bridgehead bridgehead Metadata elements abstract/title abstract-title . abstract/para abstract affiliation affiliation address address author author date date edition edition legalnotice legalnotice pubdate pubdate publisher/pubishername publisher publisher/address publisher-address revhistory/revision revision Block-level elements para para, Normal Any Word paragraph with style Normal will also be converted to a para element. formalpara/title formalpara-title formalpara/para formalpara simpara simpara note/title note-title note/para note Consecutive paragraphs with style note after the first note are to be treated as part of the same note element. That is, consecutive notes are coalesced. The note may or may not have a title. caution/title caution-title caution/para caution Consecutive cautions are coalesced. warning/title warning-title warning/para warning Consecutive warnings are coalesced. important/title important-title important/para important Consecutive importants are coalesced. tip/title tip-title tip/para tip Consecutive tips are coalesced. itemizedlist/listitem/para itemizedlist1 itemizedlist2 itemizedlist3 itemizedlist4 A number suffix indicates a nesting level within other lists. orderedlist/listitem/para orderedlist1 orderedlist2 orderedlist3 orderedlist4 listitem/para[position() != 1] para-continue This paragraph is included in the immediately preceding listitem. example/title example-title All content following the title is included in the example element. The end of the example content is marked by a caption paragraph or an empty paragraph if there is no caption. figure/title figure-title All content following the title is included in the figure element. Metadata must immediately follow the title. The end of the figure content is marked by a caption paragraph or an empty paragraph if there is no caption. informalfigure/mediaobject/imageobject/imagedata/@fileref informalfigure-imagedata, caption The content of the imageobject-imagedata paragraph is taken as the URI for the image. Metadata may immediately follow the paragraph. mediaobject/imageobject/imagedata/@fileref imageobject-imagedata, caption The content of the imageobject-imagedata paragraph is taken as the URI for the image. May be followed by a caption style paragraph. Metadata may immediately follow the paragraph, before the caption, if any. table Word table, caption table/title table-title, caption Metadata may immediately follow the paragraph. informaltable Word table A table with no title imediately preceding it. caption caption literallayout literallayout Inside a literallayout paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter). programlisting programlisting Inside a programlisting paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter). Tabs are not supported. blockquote/title blockquote-title Must immediately precede a blockquote paragraph in Word. blockquote/para blockquote blockquote/attribution blockquote-attribution Must immediately follow a blockquote paragraph in Word. bibliomisc bibliomisc Non-DocBook elements xi:include xinclude The content of the paragraph becomes the value of the href attribute. Inline elements emphasis emphasis emphasis/@role="bold" emphasis-bold emphasis/@role="underline" emphasis-underline footnote Word footnote link link In Word, hyperlink properties identify the DocBook linkend. releaseinfo releaseinfo surname surname Character style. Must occur in an appropriate parent paragraph, such as author or editor. firstname firstname Character style. Must occur in an appropriate parent paragraph, such as author or editor. orgname orgname keyword keywordset/keyword Paragraph style. Consecutive keyword elements are merged into a single keywordset parent element. Words (phrases) within a paragraph separated by commas become individual keyword elements. citetitle citetitle city city contrib contrib country country email email fax fax honorific honorific jobtitle jobtitle lineage lineage orgdiv orgdiv otheraddr otheraddr othername othername phone phone pob pob postcode postcode shortaffil shortaffil state state
Proposed Additions - not yet implemented DocBook element Style(s) Comments variablelist/varlistentry/term variablelist1-term variablelist2-term variablelist3-term variablelist4-term A variablelist in Word should be a sequence of alternating paragraphs styled as variablelistN-term and variablelistN. variablelist/varlistentry/listitem/para variablelist1 variablelist2 variablelist3 variablelist4 Consecutive paragraphs are coalesced.
Attributes Attributes are a feature of DocBook XML that have no direct counterpart in Word. XML attributes are encoded in Word comments (annotations). Some dummy text (just a space, using a character style that includes the hidden property) anchors the comment. Within the comment text, character types are used to indicate attribute names and values (these must be paired). This approach keeps the attributes separate to the main body and allows multiple attributes to be encoded. A disadvantage to this approach is that a paragraph may be related to more than one element, but the attributes are associated with only one element (by default the parent). For example, a section may have an attribute as well as the title child element, but only a single paragraph (with paragraph style sect1-title) represents both elements. Any attribute defined in a comment would be associated with the sect1 element. Pages does not have annotations, so the character styles attribute-name and attribute-value are used.