XHTML-docx roundtrip: content tracking
September 8th, 2014 by JasonThere are a couple of common use cases for docx4j’s XHTML import capability:
The first is enabling a webapp with HTML reporting to output/export reports in Word’s docx format. With docx4j, you can get really nice results doing this, especially if your XHTML has @class which map to Word styles.
The second – to support web based editing – is the subject of this post. In a full incarnation, the vision is:
- be able to edit the content in Word or in the web browser (using an XHTML editor such as CKEditor)
- track chunks of content, perhaps for workflow/approval processes, version control, or re-use
docx4j can help you with this vision in a Java or .NET (eg C#) environment.
Web based XHTML editing is well understood, so here I’ll focus on tracking chunks of content.
In XHTML, its straightforward. You can add div elements (eg <div id=”contentXYZ”>) to your heart’s content. And you can nest them (think book, chapter, section, sub-section).
How to track that ID to or from docx format?
The answer: content controls.
Bookmarks are another possibility, but I wouldn’t recommend them for this purpose, because it is easy for a user to delete them, or inadvertently insert extra bookmarks. They lack the rich features of content controls (eg locking), and aren’t very “XMLy” (they are pairs of start and end point tags which create additional challenges).
So, back to content controls.
Content controls are analogous to divs. They have IDs; you can nest them; etc.
Content controls aside, the docx file format is flat. Its a sequence of paragraphs and tables. Its only inside tables that paragraphs also appear (and nested tables).
So, all we need to do convert divs to content controls, and vice versa.
This post tells you how to do that with docx4j.
XHTML to docx (div to content control)
For XHTML to docx, you use docx4j-ImportXHTML
div to content control support was added after 3.2.0’s release, in this commit. So for now, you need to build from source, or use a nightly build.
Once you have that, to use it, do something like:
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setDivHandler(new DivToSdt());
That implementation will convert div elements to content controls, and place @id and @class values into the content control’s w:tag, for example “class=class1&id=myid”
You can extend DivToSdt with any extra functionality/logic you might require, such as locking the content control for editing/deletion.
docx to XHTML (content control to div)
The content control to div functionality has been present for a lot longer.
For that, you use docx4j to generate XHTML output in the usual way, but first you invoke SdtWriter.registerTagHandler
See the sample DivRoundtrip.java for a fully worked example of divs to content controls, then back to divs again.
The tag handler concept is to treat the content of the w:tag like an HTTP query string (key value pairs).
A tag handler is registered for a specific key (eg ‘id’, ‘class’) or the wildcards (‘*’, ‘**’), and will only execute if the key is found in the w:tag.
For this example, we want our tag handler to insert a div depending on both class and id keys, so we register it as ‘*’ (we don’t want 2 handlers, which might result in 2 divs).
A tag handler with double asterisk ‘**’ will always be applied if you need that. See the SdtWriter source code for definitive behaviour.