Jan 11 2008

docx4j license change

A note for the record that we’ve changed the docx4j license from the GPL v3 to the Affero General Public License v3.   All users of which we are aware are happy with this change.

The logic for the change is the same as the logic for licensing plutext-server under the Affero GPL.  That is, to ensure that people who use docx4j in a SAAS environment are treated the same as people who distribute docx4j to end users.

Licensing docx4j under an Apache style licence also has its attractions – let us know if this would make a difference to you.

Jan 06 2008

OOXML, boolean values and binding

ST_OnOff is used extensively in the XML Schema. Here is the openiso.org link (nice resource!).

Basically, it is used for things which should use the built in boolean schema data type:

This simple type specifies a set of values for any binary (on or off) property defined in a WordprocessingML document.

For example, the b (bold) element has an attribute @val of type ST_OnOff.

There are several problems with how this is done.

The first is that its possible values are “on, 1, or true”. OOXML should just use the XSD boolean data type, which doesn’t allow “on” (or “off”). For related comments, see here, here, and here. Denmark and France seem to be the strongest advocates of the use of xsd:boolean, and I hope they get their way.

The second is that it is left to the specification text to say that if the attribute is omitted, its value is implied to be true. That should be expressed as part of the schema.

For CT_OnOff, it would be:

<xsd:complexType name=”BooleanDefaultTrue”>
<xsd:attribute name=”val” type=”xsd:boolean” default=”true” />
</xsd:complexType>

I don’t think Denmark or anyone else made this second point.

The schema we are using in docx4j to generate classes uses these sorts of definitions instead of ST_OnOff or CT_OnOff.  For CT_OnOff, this results in a BooleanDefaultTrue type, which is used in fields like (for bold):

protected BooleanDefaultTrue b

Which brings me to the the third problem with ST_OnOff (and the schema in general), which is that it generates ugly code in JAXB and other binding frameworks (presumably .NET included). The built in schema data types produce much nicer code.

As a general remark, running the schema through JAXB is a good way to find places where the schema can be improved. Schema design goals should include:

  1. that it can be processed out of the box by binding frameworks (since that makes it easier for people to pick up a schema and start using it). [This is not currently the case]
  2. that the schema be expressed in such a way as to generate the simplest code.

Dec 22 2007

docx4j trunk now uses JAXB

10 days ago, we created a proof of concept for using JAXB on a subset of wml.xsd (one of the OpenXML schema files).

We’ve declared that a success, and moved it from a branch into the trunk of docx4j. Here be the generated classes.

plutext-server has now been migrated to use it.

And Jo is working with it as he codes docx4all.

So we’re pretty committed at this point!

We’re tidying up bits of the object model as we go (ie editing our xsd to generate Java that we like). So far, paragraphs (p, pPr, r, rPr, t) and structured document tags (sdt, sdtPr, sdtContent) have had our attention.

We’re also making a few changes to the generated classes, so we need to think about how best to prevent those changes from getting lost when the classes are re-generated. There’s a bit of support in XJC for this, and diff may come in handy, but I’d love to hear best practices.

What we have now is an object model for key pieces of the Main Document part (document.xml), in package name
org.docx4.jaxb.document. Next cab off the rank is the Styles part, which we’ll put in org.docx4.jaxb.styles.

Dec 17 2007

“View Page Source” from within Word 2007

When developing software which uses WordprocessingML, you often need to look at the XML.

Wouter’s Package Explorer is a great way to do this, particularly if you want to look at an existing file.

Wouldn’t it be great (well, at least a little bit useful), if you could look at the WordML for a document from within Word? Then you could quickly see the WordML produced when you do something in Word (format some text, create a table, add a comment etc).

ActiveDocument.WordOpenXML provides the OpenXML corresponding to the document. plutext-client-word2007 uses this extensively in C#.

Anyway, we can also use it in VB from within Word to open that in an Internet Explorer window, syntax highlighted and with collapsible sections (similar to IE’s default stylesheet for XML documents).

The result:

word2007-viewpagesource.png

The very straightforward code to do this can be cut/pasted from here -use the “download in other formats” links at the bottom of the page. In Word, from the Developer menu > Visual Basic is used to access Word 2007’s Visual Basic IDE. You can then just paste the code into a new module. Create or open a document, then run the VB. That’s all there is to it.

I specifically chose to do it using VB and not VSTO, so you don’t need Visual Studio installed to get this running.

Also I cobbled this code together quickly, and I know it can be improved. If you’d like me to incorporate your improvements, please feel free to send them in!

Dec 12 2007

Running a community – lessons from jaxb.dev.java.net

As described in my last post, we’re experimenting with using JAXB to unmarshall/marshall docx documents.

The specification is thorough, and the reference implementation (v2.1.5) seems to work well.

Unfortunately, the same can’t be said of jaxb.dev.java.net.

Given that one of my hats is to develop a community around the plutext projects, I’m trying to be aware of what helps or hinders this process.

So in the spirit of constructive criticism (I’d really like to see momentum grow around JAXB-RI), here are some observations:

  1. there are at least two places to go to for discussion (the mailing lists, and the   Metro and JAXB forum).  Where should you post? Which is going to get the better response? Why two options? In this case, the forum seems more active.
  2. its much harder than it needs to be to get the source code. There is no anonymous CVS (or SVN) access.  You need to be registered, and to have applied for the Observer role.  Then the instructions omit the cvs login step.  Eventually it worked, but in the meantime, it took a bit of digging to find a link to the zipped up sources.  There are outdated blog entries to disregard along the way.
  3. once you do have the source code, and given that JDK 1.6 introduced JAXB 2.0 in rt.jar, there should be prominent instructions for using 2.1 in Eclipse (ie use JDK 1.5)
  4. I couldn’t find JAXB 2.1.5 in Maven repositories. Again, outdated blog entries.
  5. the website is pretty slow

Now, none of these problems will stop the determined user. But I’m sure their cumulative effect is to make many others give up.

For those like me who try to get a quick sense of how active a project is by looking at the volume of traffic on the mailing list or forum before making any further commitment, problem #1 above amounts to bad marketing if nothing else.

This is a pity, because as I said, JAXB 2.1.5 is good stuff.

Dec 12 2007

Docx4j branch: Using JAXB to unmarshall OOXML to Java

docx4j contains classes which represent key parts of a WordprocessingML docx.

For example, we have a paragraph class to represent the p element; another to represent a run, etc.

Each class knows how to unmarshall its docx XML representation, and marshall it again.

It will create specialised objects for things it knows how to handle (for example the paragraph content collection contains run objects). For XML we don’t have strongly typed objects for, the class will simply store that XML node, so that it can be round tripped.

Instead of coding these classes one by one by hand, we wanted to see whether one of the Java-XML data binding frameworks could make our lives easier.

Given there is a standard for doing this (JSR 222 – The Java Architecture for XML Binding), we tried the JAXB reference implementation (JAXB 2.1.5).

The JAXB web presence leaves a lot to be desired. I’ll write a post on that shortly.

Having said that, I’m quite impressed with the spec and the reference implementation.

You feed your schema into xjc, and it generates Java classes.

The @XmlAnyElement annotation allows unknown elements to be round tripped, mimicking our existing code.

Why would there be any unknown elements you ask?

The answer is that we are using a subset of the wml.xsd schema from TC45. So there can be a lot of stuff in a docx document which falls outside the subset.

There are a number of reasons we are using a subset:

  1. running the entire schema through XJC produces lots of errors, both in the parsing phase, and once you overcome those, in the compiling stage
  2. more importantly, we’re unlikely to ever implement the entire WordML spec. So it makes sense to work with the subset of key features which are on our roadmap.
  3. you have to add annotations to the schema to ensure the resulting Java classes use names which make sense (this is called customizing the binding).

Anyway, this approach seems to work well. That is:

  • the JAXB version can read a Word document, edit it, and save it again, and Word 2007 can consume the result. See sample.java
  • the resulting classes can be made quite intuitive (though there is more tweaking to do)
  • unknown elements can be round-tripped

The JAXB version of docx4J is in subversion at the following branch:

http://dev.plutext.org/trac/docx4j/browser/branches/jaxb

You can’t just check out the branch and use it right now, since
classes need to be generated. There are maybe 50 generated, but I have
only committed 3 of them.

Where to from here?

If this approach continues to look promising, we are likely to move the JAXB code into the trunk, and upgrade plutext-server and docx4all to use it.

Dec 06 2007

docx4all now in subversion

I’m excited to say that today Jojada uploaded his work to date on docx4all to subversion.

Docx4all is our open source word processor which uses OOXML WordprocessingML as its native document format. Like our other projects, we’re releasing it under a GPL (in this case v3).

We intend it to run wherever Swing runs, and both from the desktop and within a web browser.

Docx4all is a thoroughly modern Swing application, in that its built on JavaFX Script and the Swing App Framework.

Here is a screenshot of a simple document rendered in it (click to enlarge), running on Ubuntu:

docx4all v0.1 screenshot

Its very early days yet. As you can see from the screenshot, docx4all can render simple paragraph content. But you can’t actually edit yet. That will change before Christmas.

The philosophy we’re taking is that if docx4all encounters any WordML markup which it doesn’t understand, it should preserve (ie round trip it).

You can see in the screenshot that sectPr currently falls into that category. As I said, its very early days!

But we wanted to get docx4all out there, so that anyone who’d like to work on it is able to get started.

The wiki contains instructions for building docx4all. Let us know how you go in the forums.

Dec 05 2007

Why are we doing this, anyway?

The plutext solution enables many users to work on the one Word document at the same time.

Why would you want to do that?

The way we put it in the Wiki:

  1. Get documents finished ahead of deadline. Sales proposals, contracts, reports. Our focus is real time simultaneous collaboration – two or more people working on the document at the same time.
  2. Plutext allows you to continue to use Microsoft Word as your editing environment. You know how to use Word (at least until you installed Office 2007 anyway..).
  3. So you can format the document using Microsoft Word. If you did your collaboration in Google Docs, chances are you’ll have to bring it back into Word to make it pretty. Our collision handling is nicer to.
  4. Work offline. It’s Word, after all.
  5. Word’s docx is our native document format. So there is 100% fidelity. No numbering going haywire.
  6. Complete version history / audit trail.
  7. Don’t have Word? Coming soon … Use docx4all, our WYSIWYG docx editor – on a Mac, on Linux etc.
  8. Oh, and its open source. All GPL 3 (Affero GPL 3 in the case of the server side bits). Use our server (developers only for now), or build your own.

Dec 05 2007

wiki content – getting started with the Plutext Word 2007 add-in

A quick post to flag that there is some good content in the wiki now to help developers get started with the Word 2007 client:

Nov 29 2007

plutext-client-word2007 source code released (GPL 3)

I’ve uploaded the plutext-client-word2007 source code to subversion.  Let me say up front that right now, this is for developers, not end users.  Here are instructions for setting up your Visual Studio 2005 environment.

This is the add-in for Word 2007 which lets you collaborate with other people on a docx document.  I’ll post some screenshots to the website tomorrow.
At the moment, you need a plutext-server (BYO or use our development server).  It also helps to have some collaborators, though you can open the document twice in Word and collaborate with yourself.  All collaborators currently have to be using Word 2007 with the add-in – a Word 2003 add-in is currently under development.