Heap space use on load docx

by **guille** » Wed Dec 02, 2020 6:31 am

Hi

We are usind docx4j for replacing variable replacement in docx files. Until now has worked fine in production for several months. But today we hit this error:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
at org.docx4j.openpackaging.io3.stores.ZipPartStore.getBytesFromInputStream(ZipPartStore.java:191)
at org.docx4j.openpackaging.io3.stores.ZipPartStore.<init>(ZipPartStore.java:154)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:559)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:410)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:376)
at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:341)
at org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(WordprocessingMLPackage.java:180)

The file is a 4.0MB word with several images. When loading it the Heap of the JVM sky rockets from 100Mb to more than 300Mb. 200MB heap for loading a 4Mb file?

Getting the heap from my local machine:

Heap Before loading file

Heap after loading

Thats 700Mb of byte[] created when loading a 4Mb file!!

The calling code. We arenusing Spring Resource class. Files are red from a remote server, so they are not local files.:

Code: Select all: private byte[] replaceVars(Resource word, HashMap<String, String> mappings) { if (logger.isDebugEnabled()) { logger.debug("replaceVars(Resource word={}, HashMap<String,String> mappings={}) - start", word.getFilename(), mappings); } ByteArrayOutputStream baos = new ByteArrayOutputStream(); InputStream is = null; try { is = word.getInputStream(); WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is); VariablePrepare.prepare(wordMLPackage); wordMLPackage.getMainDocumentPart().variableReplace(mappings); wordMLPackage.save(baos); return baos.toByteArray(); } catch (Exception e) { logger.error("replaceVars(Resource, HashMap<String,String>)", e); } finally { try { is.close(); baos.close(); } catch (IOException e) { logger.error("replaceVars(Resource, HashMap<String,String>)", e); } } return null; }

There is clearly something odd here. Any hints? I will be debbing the load code to try to figure out what may be happening.

Edit1:
I downloaded docx4j code. This is log of a simple Junit that runs my code with mockito.

Heap size before MB: 238
01-12-2020 18:19:52.684 DEBUG o.g.documents.DocumentServiceImpl.replaceVars - replaceVars(Resource word=null, HashMap<String,String> mappings={fecha=null, secuencial=1}) - start
01-12-2020 18:19:52.684 DEBUG o.g.documents.DocumentServiceImpl.replaceVars - replaceVars(Resource word=null, HashMap<String,String> mappings={fecha=null, secuencial=1}) - start
01-12-2020 18:19:52.700 INFO o.d.o.io3.stores.ZipPartStore.<init> - Start reading zip...
01-12-2020 18:19:52.707 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size before read MB: 238
01-12-2020 18:19:58.811 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size after read MB: 726
Heap total after test MB: 3390

Changing /docx4j-core/src/main/java/org/docx4j/openpackaging/io3/stores/ZipPartStore.java getBytesFromInputStream to this code:

Size can be read from the entry.getSize().

Code: Select all: private byte[] getBytesFromInputStream(InputStream is, int size) throws Exception { if (size == -1) { BufferedInputStream bufIn = new BufferedInputStream(is); ByteArrayOutputStream baos = new ByteArrayOutputStream(); BufferedOutputStream bos = new BufferedOutputStream(baos); int c = bufIn.read(); while (c != -1) { bos.write(c); c = bufIn.read(); } bos.flush(); baos.flush(); //bufIn.close(); //don't do that, since it closes the ZipInputStream after we've read an entry! bos.close(); return baos.toByteArray(); }else { byte[] targetArray = new byte[size]; int read = is.read(targetArray); int offset = read; while (read != -1) { read = is.read(targetArray,offset, size-offset); offset += read; } return targetArray; } }

Get some interesting result, but more testing is needed to ensure is not luck.
01-12-2020 20:14:33.849 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size before read MB: 238
01-12-2020 20:14:33.849 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size before read MB: 238
01-12-2020 20:14:34.608 INFO o.d.o.io3.stores.ZipPartStore.<init> - Heap size after read MB: 398

Will keep looking at this and update here with more info.

Best regards,
Guillermo

by **guille** » Thu Dec 03, 2020 12:10 am

More info:

The change on the ZipPartStore looks to improve performance on loading the file and somewhat heap usage improvement. But heap keeps growing on OpcPackage load.

Adding this on the

Code: Select all: partLoader.getPartByteArra().entrySet().stream().forEach(p->{ log.info("Part {} size {} KB",p.getKey(), p.getValue().getLength() / (1024)); });

I see that each image weights around 33Mb on memory

02-12-2020 10:05:19.915 INFO o.d.o.packages.OpcPackage.lambda$0 - Part word/media/image5.emf size 32603 KB
02-12-2020 10:05:19.916 INFO o.d.o.packages.OpcPackage.lambda$0 - Part word/media/image9.emf size 32761 KB
02-12-2020 10:05:19.916 INFO o.d.o.packages.OpcPackage.lambda$0 - Part word/media/image4.emf size 32682 KB

How can a 4Mb file contain several uncompresed 32Mb images? I am missing something?

by **jason** » Wed Dec 09, 2020 6:21 am

Hi Guillermo,

Could you please share a sample docx which exhibits this behaviour? word/media/image4.emf looks like an EMF containing an image. For the purposes of sharing, you could potentially delete any text in the docx, retaining just the EMF image.

cheers .. Jason

by **guille** » Thu Dec 10, 2020 7:16 am

Hi Jason,

You can download it from: https://coding12.cl/test.docx

by **jason** » Fri Dec 11, 2020 10:51 am

Hi Guillermo,

In your docx, what look like paragraphs of text are actually EMF objects.

I extracted/saved one of them (using unzip, not docx4j).

In line with your observations, this was 32,604KB. When I zipped it up again, it compressed to 267KB.

So what you are seeing is very high compression by zip.

By the way we do have some protection against zip bomb attacks in policePartSize https://github.com/plutext/docx4j/blob/ ... .java#L129

But your content looks legitimate. What is creating these EMF objects? I'd recommend you fix that to emit ordinary docx paragraph content.

cheers .. Jason

by **guille** » Sat Dec 12, 2020 5:51 am

Hi Jason,

The content was created by the client's scanner. Thank you for your support. As it has been only one case I spoke with our client to create lighter docs, as is also unnecessary to have that information on the document it self.

Nevertheless, take a look in my first post to the changes in the ZipPartStore.getBytesFromInputStream function. You can get the entry size from entry.getSize() so you can read the bytes without using the streams and improving performance.

Best regards,
Guillermo

by **jason** » Sat Dec 12, 2020 11:02 am

Thanks, see now https://github.com/plutext/docx4j/commi ... e7cb41770c

Heap space use on load docx

Heap space use on load docx

Re: Heap space use on load docx

Re: Heap space use on load docx

Re: Heap space use on load docx

Re: Heap space use on load docx

Re: Heap space use on load docx

Re: Heap space use on load docx

Who is online