Hi I'm using the MergeDocx extension to merge a collection of Word files into one. This all works great and the final file looks fine except that I am experiencing memory issues when merging large collections of documents (100 or so), so I decided to take a look at the underlying 'bundle' for the merged file in smaller successful merges.
When I open the zip file and navigate to the 'word' subdirectory, I see a load of files named thus:
document.xml
endnotes.xml
fontTable.xml
-- so far so good. But then I see:
foot.xml
foote.xml
foote2.xml
footer.xml
footer1.xml
... and on to footer115.xml
This pattern is repeated for all the header files (i.e. 'head', 'heade', 'heade2' etc.) too. Once the file has been opened and saved in MS Word, the underlying xml structure makes more sense. The odd files -- 'foot', 'foote', 'foote2', 'footer' -- all disappear, and the number of footer files is reduced from 115:
footer1.xml
through to
... footer69.xml
In addition, the file's size is reduced considerably once saved in Word (in my example, the merged file is 389Kb, which reduces to 237Kb when opened and saved in Word 2010).
Does anyone have any insight to this? Has anyone seen this before? I believe these somewhat odd headers and footers also mean that the FOP-based PDF generation in docx4j sucks up reams of memory as it tries to resolve all the rIds that result. When I tested generating a PDF from the merged file using the Apache xdocreport code base, that couldn't do anything at all, reporting an exception relating to running out of IDs.
Am I doing something wrong in my initial document generation I wonder (not sure what!), or is there something strange happening in the merge? For what it's worth, I checked out the source documents for a typical merge scenario, and they all have normal footerX.xml and headerX.xml files -- the strange ones only seem to appear and proliferate as part of the merge process.