Oct 29 2011

Hello Maven Central

With version 2.7.1, docx4j – a library for manipulating Word docx, Powerpoint pptx, and Excel xlsx xml files in Java – and all its dependencies, are available from Maven Central.

This makes it really easy to get going with docx4j.  With Eclipse and m2eclipse installed, you just add docx4j, and you’re done.  No need to mess around with manually installing jars, setting class paths etc.

This post demonstrates that, starting with a fresh OS (Win 7 is used, but these steps would work equally well on OSX or Linux).

Step 1 – Install the JDK

For the purposes of this article, I used JDK 7, but docx4j works with Java 6 and 1.5.

Step 2 – Install Eclipse Indigo (3.7.1)

I normally download the version for J2EE developers. Unzip it and run eclipse

Step 3 – Install m2eclipse.

In Eclipse, click Help > Install New Software.

Type “http://download.eclipse.org/technology/m2e/releases” in the “Work with” field as shown:

then follow the prompts.

Step 4 – Create your Maven project

In Eclipse, File > New > Project.., then choose Maven project

You should see:

Check “Create a simple project (skip archetype selection)” then press next.

Allocate group and artifact id (what you choose as your artifact id will become the name of your new project in Eclipse):

Press finish

This will create a project with directories using Maven conventions:

(Note: If your starting point is a new or existing Java project in Eclipse, you can right click on the project, then choose Configure > Convert to Maven project)

Step 5 – Add docx4j to your POM

Double Click on pom.xml

Next click on the dependencies tab, then click the “add dependency” button, and enter the docx4j coordinates as shown in the image below:

The result is this pom:


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>mygroup</groupId>
  <artifactId>myartifact</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
  	<dependency>
  		<groupId>org.docx4j</groupId>
  		<artifactId>docx4j</artifactId>
  		<version>2.7.1</version>
  	</dependency>
  </dependencies>
</project>

Ctrl-S to save it.

m2eclipse may take some time to download the dependencies.

When it has finished, you should be able to see them:

Step 6 – Create HelloMavenCentral.java

If you made a Maven project as per step 4 above, you should already have src/main/java on your build path.

If not, create the folder and add it.

Now add a new class:

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

public class HelloMavenCentral {

	public static void main(String[] args) throws Exception {
		
		WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
		
		wordMLPackage.getMainDocumentPart()
			.addStyledParagraphOfText("Title", "Hello Maven Central");

		wordMLPackage.getMainDocumentPart().addParagraphOfText("from docx4j!");
				
		// Now save it 
		wordMLPackage.save(new java.io.File(System.getProperty("user.dir") + "/helloMavenCentral.docx") );
		
	}	
}

Step 7 – Click Run

When you click run, all being well, a new docx called helloMavenCentral.docx will be saved.

You can open it in Word (or anything else which can read docx), or unzip it to inspect its contents.

Step 8 – Adding docx4j.properties

One final thing. If you plan on creating documents from scratch using docx4j, it is useful to set paper size etc, via docx4j.properties. Put something like the following on your path:

# Page size: use a value from org.docx4j.model.structure.PageSizePaper enum
# eg A4, LETTER
docx4j.PageSize=LETTER
# Page size: use a value from org.docx4j.model.structure.MarginsWellKnown enum
docx4j.PageMargins=NORMAL
docx4j.PageOrientationLandscape=false

# Page size: use a value from org.pptx4j.model.SlideSizesWellKnown enum
# eg A4, LETTER
pptx4j.PageSize=LETTER
pptx4j.PageOrientationLandscape=false

# These will be injected into docProps/app.xml
# if App.Write=true
docx4j.App.write=true
docx4j.Application=docx4j
docx4j.AppVersion=2.7.1
# of the form XX.YYYY where X and Y represent numerical values

# These will be injected into docProps/core.xml
docx4j.dc.write=true
docx4j.dc.creator.value=docx4j
docx4j.dc.lastModifiedBy.value=docx4j

#
#docx4j.McPreprocessor=true

# If you haven't configured log4j yourself
# docx4j will autoconfigure it.  Set this to true to disable that
docx4j.Log4j.Configurator.disabled=false

And that’s it. For more information on docx4j, see our Getting Started document.

Please click the +1 button if you found this article helpful.

Aug 13 2011

OpenDoPE Word Add-In source code released

The source code for the OpenDoPE Word Add-In developer edition is available at last at http://opendope.codeplex.com/

(A binary download has been available for 10 months or so now)

OpenDoPE stands for Open Document Processing Ecosystem; its a standards based approach to document automation / document assembly.

Fundamentally, it is a set of conventions for doing document assembly using Open XML (the ISO-standard Microsoft Word docx file format), specifically, its content control databinding architecture.

Its real attraction is that it enables users to do document production without getting locked in to some vendors’ proprietary file format:- in adopting OpenDoPE, you aren’t making any commitment above and beyond continued use of the docx file format, and a conventional approach to use of its content controls.

For further details, see the OpenDoPE website.

docx4j can combine an XML data file with an OpenDoPE docx template for you; the point of the OpenDoPE Word Add-In is to help your authors with the initial step of creating OpenDoPE docx templates.

The Word Add-In is relatively straightforward; it uses VSTO (Visual Studio Tools for Office).  You’ll need Visual Studio (2010) and basic C# skills to modify it.

The point of releasing the source code is to make it easy for developers to contribute back fixes and enhancements (which has worked really well for docx4j), or extend the Addin to create their own specialised authoring tool.  The source code is in Mercurial, which – because of its distributed nature – should facilitate the latter especially.

The source code for the OpenDoPE Word Add-In (developer edition) is dual licensed, the primary license being GPL v2.

The Add-In is made possible because of the availability of the SharpDevelop “Avalon” and XML editor components.  Thanks guys!

Aug 06 2011

docx4j has a new home

For reasons best known (or only known) to Google, dev.plutext.org has never been on the first page of results when you search for “docx java”, despite all the relevant posts in our forums over more than 3 years.

I can only think Google doesn’t at all like a hostname other than “www”.

So I’ve moved everything to www.docx4java.org

This shouldn’t impact you (other than having to find this new site, and update any bookmarks) unless you are using svn and have docx4j checked out.

If you have the docx4j repository checked out, you’ll want to do something like:

 svn switch --relocate http://dev.plutext.org/svn/docx4j/trunk/docx4j http://www.docx4java.org/svn/docx4j/trunk/docx4j

If you are on Windows and using TortoiseSVN, use Tortoise’s “relocate” command (not its “switch” command).

That should make your SVN checkout work again.

There may be various broken or outdated links on the website.  I guess I’ll fix these over time.

If you encounter any other issues, then please post to http://www.docx4java.org/forums/announces/docx4j-has-a-new-home-t815.html

Jul 08 2011

docx4j 2.7.0 released

I’m pleased to announce the release today of docx4j 2.7.0.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx.  it is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET.   It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

  • Improvements to Maven build
  • ContentAccessor interface
  • AlteredParts: identify parts in this pkg which are new or altered; Patcher
    which adds new or altered parts.
  • Support for .glox SmartArt package (/src/glox/)
  • JAXB RI 2.2.3 compatibilty
  • OpenDoPE support improvements

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse, or download them from one of the links above)

Maven: Please see forum for details (since XML doesn’t paste nicely here right now).

Dependency changes

Antlr is now required for OpenDoPE processing; this gives us better XPath processing.  The required jars are:

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

A request to docx4j users

If you are happily using docx4j, it would be great if you could reply to this post with some words of recommendation for others who might be wondering whether docx4j is a good choice. I know there are thousands of you out there :-)

Some users have been kind enough to make such statements already; these may be found on the trac homepage.

Of course, there are a number of other ways you can contribute back.  Please consider doing so, especially if you think you might find yourself looking for support from volunteers in the docx4j forums.

Jun 28 2011

Feedback on docx4j 2.7.0 release candidate?

docx4j 2.7.0 release candidate is now available at http://dev.plutext.org/docx4j/docx4j-2.7.0-rc1.jar

This will form the basis of the 2.7.0 release. In fact, unless there are significant issues over the next week or so, this will become the 2.7.0 release! So please try it out and report back, positive or negative…

It is mainly a maintenance release, but things of note include:

* Improvements to Maven build

* ContentAccessor interface

* AlteredParts: identify parts in this pkg which are new or altered; Patcher
which adds new or altered parts.

* Support for .glox SmartArt package (/src/glox/)

* JAXB RI 2.2.3 compatibilty

For contributors to this release and a more complete list of changes, please see http://dev.plutext.org/svn/docx4j/trunk … README.txt

There are 2 new dependencies (required for OpenDoPE processing): antlr-runtime-3.3.jar and stringtemplate-3.2.1.jar For convenience, copies of these can be found in the same dir as the rc jar.

Thanks very much to everyone who contributed to this release (candidate!).

And please consider clicking one of the buttons below to circulate news of the release.

Nov 20 2010

Microsoft’s data binding patent

I just stumbled across
United States Patent 7730394, Data binding in a word-processing application

Its Microsoft’s patent on data bound content controls.

Its a useful description of how it works.

I’m not sure it’s worthy of a patent though.  They reference a  lot of prior art, but not my March 2004 paper  “XForms for Contract Semantics”, which contains the following binding example:

<p>
In consideration of the payment of <xforms:output ref=”lineitems/item/price”/>, <xforms:output ref=”supplier”/> agrees to deliver
a <xforms:output ref=”lineitems/item/name”/> to <xforms:output ref=”customer”/> on or before <xforms:output ref=”deliverydate”/>.
</p>

Interestingly to me, Wolters Kluwer referenced my paper in their “Document creation system” patent, but that’s a side note.

I’m a big fan of data-bound content controls.

So much so, in fact, that I’d like to see the same stuff included in ODF and implemented in OpenOffice .. umm .. maybe I mean LibreOffice these days!

That would obviously be more likely if Microsoft didn’t lodge patents for stuff like this.  Who can blame them, you might say, with things like i4i happening to them?  Well, my response is that they should be using their considerable corporate muscle to lobby for patent reform.  In the absence of such efforts, you can only conclude that the innovation inhibiting patent system suits Microsoft, event though they take the odd hundred million dollar hit from it.

Nov 19 2010

docx4j v2.6.0 released

I published docx4j 2.6.0 yesterday.

For details, see the forum. This post introduces TraversalUtil, which makes it easier for you to find and change the bits of a docx you want to manipulate.

If you are working with an existing docx, you often need to get a particular bit of the document, and change it somehow.

If you know you want to change the 6th paragraph, say, that’s easy.

But if you want to find all occurrences of some item, which could occur at various different levels of the hierarchy (for example, paragraphs can appear not just in the document body, but also within table cells, and in content controls)?

docx4j offers a couple of different tools to make this easy.

XPath

XPath is a succinct way to select the things you need to change.

Happily, from docx4j 2.5.0, you can do use XPath to select JAXB nodes:

MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

String xpath = "//w:p";

List<Object> list = documentPart.getJAXBNodesViaXPath(xpath, <strong>false</strong>); 

These JAXB nodes are live, in the sense that if you change them, your document changes.

There is a limitation however: the xpath expressions are evaluated against the XML document as it was when first opened in docx4j.  You can update the associated XML document once only, by passing true into getJAXBNodesViaXPath. Updating it again(with current JAXB 2.1.x or 2.2.x) will cause an error.

To workaround this bug in JAXB, you can marshall it, and then unmarshall the result using either:

    public org.docx4j.wml.Document unmarshal( java.io.InputStream is ) 

    public org.docx4j.wml.Document unmarshal(org.w3c.dom.Element el) 

Both of those will re-create the binder.

Not the most efficient, so consider voting for JAXB bug 459

But now we have an alternative…

TraversalUtil

New to docx4j 2.6.0 is a class TraversalUtil, which is a general approach for traversing the JAXB object tree in the main document part (though it can also be applied to headers, footers etc).

For example, to get a list of hyperlinks, you can do something like:

PHyperlinkFinder finder= new PHyperlinkFinder();
new TraversalUtil(paragraphs, finder);

static class PHyperlinkFinder extends CallbackImpl {
			
        List<P.Hyperlink> links = new ArrayList<P.Hyperlink>();  
        	
        @Override
		public List<Object> apply(Object o) {
				
			if (o instanceof P.Hyperlink)
				links.add((P.Hyperlink)o);
				
			return null;
		}
	}

This approach is used extensively in the MergeDocx extension I discussed in my previous post. It is now also the basis of the OpenMainDocumentAndTraverse sample, so see that for another example of how to use it.

The example above simply finds relevant bits of the docx; you could also modify the objects encountered if you want.

Nov 14 2010

Merging Word documents

I’ve written a utility to merge docx documents in Java.  “Merge” as in concatenate/join/append, as opposed to diff/merge (although docx4j does include code to do a diff, if you are looking for that instead).

With the utility, you can take 2 or more Word documents, and join them into one.

Edit Feb 2014. MergeDocx is now part of Plutext’s Docx4j Enterprise Edition.

As Eric White’s blog explained:

This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs. For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid. You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.

With this utility, merging/concatenating documents is as easy as invoking the method:

public  WordprocessingMLPackage merge(List&lt;WordprocessingMLPackage&gt; wmlPkgs)

In other words, you pass a list of docx, and get a single new docx back.

Edit March 2014. You can try the MergeDocx and/or MergePptx functionality via the demo webapp.

This utility takes care of the niggly edge cases for you:

You can also use my MergeDocx utility to process a docx which is embedded as an altChunk.

Without this utility, you had to rely on Word to convert the altChunk to normal content.

That meant you had to round trip your docx through Word, before docx4j could create a PDF or HTML out of it.

Now you don’t.

To process the w:altChunk elements in a docx, you invoke:

public WordprocessingMLPackage process(WordprocessingMLPackage srcPackage)

You pass in a docx containg altChunks, and get a  new docx back which doesn’t.

But wait a minute .. if you can merge Word documents using this tool, why would you ever put an altChunk (containing a docx, as opposed to HTML) into the docx in the first place?

Ordinarily you wouldn’t, you’d just merge with this tool instead.  But there are at least 2 possibilities:

  • some upstream process put the altChunk there, and now you want to process it in docx4j
  • OpenDoPE.  The Open Document Processing Ecosystem convention is being extended in a v2.3 to allow other documents to be injected, and a natural thing is to convert an injection instruction to an altChunk.  Edit Feb 2014: docx4j 3.0.1 can also bind an XML element containing a base64 encoded docx, inserting it into the docx as an AltChunk.  MergeDocx can then convert that content into “real” docx content, suitable for including in a table of contents, or generating HTML or PDF.  The binding is two-way, so user edits in Word can be injected back into the XML (eg for persisting to a database).

There is one place my code differs significantly from how Word processes an altChunk, and that is in section handling.  When Word processes an altChunk, it seems to largely remove sectPr.  So for example, columns will disappear.  But it also might merge headers, so the resulting header contains stuff from the headers of both documents!  My code doesn’t do that: by default, it includes each section, and headers go with sections.

Mar 18 2010

makeofficebetter.com shut down

In the months since August 2009, interested users submitted ideas to makeofficebetter.com

The Microsoft employees who ran that site have shut it down.  Not stopped accepting new submissions, but shut it down entirely.

As a result, all the community submitted data is lost to the community. Or has been taken from us, since it is no longer shared.

Mar 10 2010

Why did Google acquire Docverse?

People have been asking me why Google bought Docverse.
Surely Google already has the collaboration smarts.  After all, Google Docs made document collaboration mainstream.  And Wave is taking it to the next level.  And they already employ zz; and they just bought aa.
What does Docverse give them?
The answer is simple.
Office 2010 Tech Guarantee will defer $300M-$350M of revenue from Q3 .. people who would otherwise buy in Q3, but wait until the TG is available.

People have been asking me why Google bought Docverse.

Surely, Google already has the collaboration smarts?  After all, it was Google Docs which made document collaboration mainstream.  And it is Google Wave which is arguably now taking it to the next level.  Google also employs Neil Fraser, and it recently bought Etherpad.

So what does Docverse give them?  And why pay so much?

Its not about getting the people – Docverse is a small team – although additional engineers with domain knowledge are surely nice to have.

What this is about is taking away the reasons for upgrading to Office 2010, and more particularly, Sharepoint 2010.  Any business which takes Sharepoint 2010 is making a commitment to Microsoft technology for the next decade or so, which effectively shuts Google enterprise products out, and might even lead these customers to use IIS etc for their consumer web sites (which would also be bad for Google).

So Google is doing what it can to give businesses reason to stop and think.

In the 6 months ended 31 December 2009, Microsoft’s Business Products Division had revenue of $9.149 billion, and operating income of $5.867 billion.  Office is responsible for around 90% of that.

What would it be worth to Google, if it could put a 5% dent in those figures? 5% of $9 billion is $450 million. 0.5% is $45 million.

Put one way, if the people responsible for just 0.5% of Office purchase decisions look at Google + Docverse and say “hey, we can stick with the version of Office we’ve got; we don’t need to buy Office 2010 and Sharepoint to do real time collaboration”, then the Docverse acquisition has made sense for Google.

But really, its about the larger ecosystems, not just the Office purchase.  An Office purchase is a commitment to Windows on the client, and possibly Windows on the server.  And it has network effects along the supply chain (people you exchange documents with).  So preventing an Office purchase frees up a lot of other spend.

Now, Google needs to prove that with Google you get:

  • the ability to keep using your existing Microsoft Office (Docverse’s contribution)
  • real-time collaboration (without Office 2010 or Sharepoint 2010)
  • web-based editing if/when you need it

Docverse gives Google slick looking Add-Ins for Word, Powerpoint and Excel.

Time is of the essence.  Office 2010 will be launched for businesses on May 12, and available online/retail in June.

The adds-ins are worth a few months head start.  (So maybe it is about the people after all?)

Now Google needs to integrate Docverse in to Google Apps.  Rip/replace of the existing Docverse back-end (and probably much of their Word Add-In, since it sends the whole document every time you save, not just the diffs – something Plutext has had right since the beginning) will take a while.  However, the rip/replace isn’t necessary for a rudimentary integration into Google Docs.  What is critical is to make Docverse’s server-side differencing work on Google scale and interoperate with the Google Docs webapp.  Same  for presentations and slides.

It’ll be interesting to see how quickly this can be done.