docx4j contains classes which represent key parts of a WordprocessingML docx.
For example, we have a paragraph class to represent the p element; another to represent a run, etc.
Each class knows how to unmarshall its docx XML representation, and marshall it again.
It will create specialised objects for things it knows how to handle (for example the paragraph content collection contains run objects). For XML we don’t have strongly typed objects for, the class will simply store that XML node, so that it can be round tripped.
Instead of coding these classes one by one by hand, we wanted to see whether one of the Java-XML data binding frameworks could make our lives easier.
Given there is a standard for doing this (JSR 222 – The Java Architecture for XML Binding), we tried the JAXB reference implementation (JAXB 2.1.5).
The JAXB web presence leaves a lot to be desired. I’ll write a post on that shortly.
Having said that, I’m quite impressed with the spec and the reference implementation.
You feed your schema into xjc, and it generates Java classes.
The @XmlAnyElement annotation allows unknown elements to be round tripped, mimicking our existing code.
Why would there be any unknown elements you ask?
The answer is that we are using a subset of the wml.xsd schema from TC45. So there can be a lot of stuff in a docx document which falls outside the subset.
There are a number of reasons we are using a subset:
- running the entire schema through XJC produces lots of errors, both in the parsing phase, and once you overcome those, in the compiling stage
- more importantly, we’re unlikely to ever implement the entire WordML spec. So it makes sense to work with the subset of key features which are on our roadmap.
- you have to add annotations to the schema to ensure the resulting Java classes use names which make sense (this is called customizing the binding).
Anyway, this approach seems to work well. That is:
- the JAXB version can read a Word document, edit it, and save it again, and Word 2007 can consume the result. See sample.java
- the resulting classes can be made quite intuitive (though there is more tweaking to do)
- unknown elements can be round-tripped
The JAXB version of docx4J is in subversion at the following branch:
http://dev.plutext.org/trac/docx4j/browser/branches/jaxb
You can’t just check out the branch and use it right now, since
classes need to be generated. There are maybe 50 generated, but I have
only committed 3 of them.
Where to from here?
If this approach continues to look promising, we are likely to move the JAXB code into the trunk, and upgrade plutext-server and docx4all to use it.