Merge strangeness

by **benpoole** » Mon Sep 30, 2013 5:06 pm

Hi I'm using the MergeDocx extension to merge a collection of Word files into one. This all works great and the final file looks fine except that I am experiencing memory issues when merging large collections of documents (100 or so), so I decided to take a look at the underlying 'bundle' for the merged file in smaller successful merges.

When I open the zip file and navigate to the 'word' subdirectory, I see a load of files named thus:

document.xml
endnotes.xml
fontTable.xml

-- so far so good. But then I see:

foot.xml
foote.xml
foote2.xml
footer.xml
footer1.xml
... and on to footer115.xml

This pattern is repeated for all the header files (i.e. 'head', 'heade', 'heade2' etc.) too. Once the file has been opened and saved in MS Word, the underlying xml structure makes more sense. The odd files -- 'foot', 'foote', 'foote2', 'footer' -- all disappear, and the number of footer files is reduced from 115:

footer1.xml
through to
... footer69.xml

In addition, the file's size is reduced considerably once saved in Word (in my example, the merged file is 389Kb, which reduces to 237Kb when opened and saved in Word 2010).

Does anyone have any insight to this? Has anyone seen this before? I believe these somewhat odd headers and footers also mean that the FOP-based PDF generation in docx4j sucks up reams of memory as it tries to resolve all the rIds that result. When I tested generating a PDF from the merged file using the Apache xdocreport code base, that couldn't do anything at all, reporting an exception relating to running out of IDs.

Am I doing something wrong in my initial document generation I wonder (not sure what!), or is there something strange happening in the merge? For what it's worth, I checked out the source documents for a typical merge scenario, and they all have normal footerX.xml and headerX.xml files -- the strange ones only seem to appear and proliferate as part of the merge process.

by **jason** » Mon Sep 30, 2013 7:49 pm

Hi Ben, If you are merging 100 documents, then you'll generally be creating 100 sections, each of which can have say 3 header files and 3 footer files (depending on your header/footer settings). MergeDocx support is generally via email, so let's take this offline.

If you use docx4j to create PDF output, you'll get an fo:page-sequence for each docx section, and conditional-page-master-reference elements, and simple-page-master elements, for each distinct header/footer in each section. See docx4j's docs/headers_footers.docx for more explanation. This may be where your memory usage is coming from; see also http://apache-fop.1065347.n5.nabble.com ... 38355.html

So the solution is probably to look at your documents, and see whether the number of sections can be reduced.

cheers .. Jason

by **benpoole** » Tue Oct 01, 2013 6:33 am

Thanks Jason, I'd missed that in the docs, will check it out. I suspect there's some scope for rationalisation of headers and footers, e.g. ignore them in all "subsidiary" documents and just bring them in on merge. I shall have a think!

Merge strangeness

Merge strangeness

Re: Merge strangeness

Re: Merge strangeness

Who is online