Thanks for emailing me your docm.
I don't think there is anything to worry about. In summary, 3 things explain the file size differences:
1. differences in zip implementation (Microsoft versus Java)
2. namespaces
3. mc:AlternateContent handling
I will explain these in turn, but first, my methodology. I used docx4j to open and save your document, so I had INPUT.docm and OUTPUT.docm to compare. I renamed them to .zip, and opened each in a zip tool (t-zip). It was then easy to see the size of each file.
I encourage you to do the same. You may find some things that my quick analysis missed.
1. differences in zip implementation (Microsoft versus Java)
For IN word\**, actual size is ~609K, packed size ~390K
For OUT word\**, actual size is ~623K, packed size ~365K
So you can see that the OUT produced by docx4j is actually bigger, although it is packed more efficiently (Java zip implementation). It is probably bigger because of namespaces, see 2 below.
2. namespacesdocx4j (JAXB) always writes all namespaces in the relevant JAXB context, which makes the file a bit bigger, for example:
Using xml Syntax Highlighting
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:ns6="http://schemas.openxmlformats.org/schemaLibrary/2006/main" xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" xmlns:ns8="http://schemas.openxmlformats.org/drawingml/2006/chartDrawing" xmlns:dgm="http://schemas.openxmlformats.org/drawingml/2006/diagram" xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture" xmlns:ns11="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:dsp="http://schemas.microsoft.com/office/drawing/2008/diagram" xmlns:ns13="urn:schemas-microsoft-com:office:excel" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:ns17="urn:schemas-microsoft-com:office:powerpoint" xmlns:odx="http://opendope.org/xpaths" xmlns:odc="http://opendope.org/conditions" xmlns:odq="http://opendope.org/questions" xmlns:odi="http://opendope.org/components" xmlns:odgm="http://opendope.org/SmartArt/DataHierarchy" xmlns:ns24="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns:ns25="http://schemas.openxmlformats.org/drawingml/2006/compatibility" xmlns:ns26="http://schemas.openxmlformats.org/drawingml/2006/lockedCanvas">
Parsed in 0.001 seconds, using
GeSHi 1.0.8.4
as compared with:
Using xml Syntax Highlighting
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
Parsed in 0.001 seconds, using
GeSHi 1.0.8.4
3. mc:AlternateContent handlingWhere docx4j encounters Word 2010 Microsoft specific extensions which aren't part of the ECMA/ISO base standard, it falls back to
In this situation, you'll see something like:
- Code: Select all
19.05.2012 08:21:01 *INFO * Part: /word/charts/chart1.xml (Part.java, line 150)
19.05.2012 08:21:01 *INFO * JaxbXmlPart: encountered unexpected content; pre-processing (JaxbXmlPart.java, line 243)
19.05.2012 08:21:01 *WARN * XSLTUtils: Found some mc:AlternateContent (XSLTUtils.java, line 16)
19.05.2012 08:21:01 *WARN * XSLTUtils: Selecting c:style (XSLTUtils.java, line 16)
19.05.2012 08:21:01 *DEBUG* RelationshipsPart: Loading part /word/charts/chart1.xml (RelationshipsPart.java, line 376)
The mc:AlternateContent will be used and the Word 2010 stuff dropped. In effect, the docx becomes a Word 2007 docx. In the case of the document you provided, this affects the 4 charts (though they still all end up bigger, because of the namespaces).