Docx to HTML formatting is lost

by **Prashanth** » Wed Mar 31, 2021 3:26 am

Hello,

I am trying to convert docx to HTML in a web application(PEGA).
I imported the necessary jars linked to this based on my search.
Issue -1 Certain converted html loses its format.
Issue -2 If i use byteArray instead of file resulting HTML is messed up.

Not sure where i am wrong.

Code: Select all: java.util.Base64.Decoder decoder = java.util.Base64.getDecoder(); java.util.Base64.Encoder encoder = java.util.Base64.getEncoder(); //Get inputstream from Case Document as a param byte[] bs = decoder.decode(Word.getBytes()); InputStream is =new ByteArrayInputStream(bs); WordprocessingMLPackage wordMLPackage = Docx4J.load(is); HTMLSettings htmlSettings = Docx4J.createHTMLSettings(); htmlSettings.setImageDirPath(FilePath); htmlSettings.setWmlPackage(wordMLPackage); OutputStream os = new ByteArrayOutputStream(); OutputStream ou = new FileOutputStream(FilePath); Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL); Docx4J.toHTML(htmlSettings, ou, Docx4J.FLAG_EXPORT_PREFER_XSL); ou.close(); String result = ou.toString(); return result ;

Generated HTML File

: HTML File; HTML.PNG (26.17 KiB) Viewed 1054 times

HTML as a byteArray o/p

: Generated Stream; HTML_Stream.PNG (21.6 KiB) Viewed 1054 times

Source content

: Source document; Source.PNG (10.39 KiB) Viewed 1054 times

by **jason** » Tue Apr 06, 2021 7:59 am

I expect there's something "unusual" about your source docx, so would need to see that.

Please attach it, or if it is sensitive, anonymise it first using https://github.com/plutext/docx4j/blob/ ... ingle.java

Docx to HTML formatting is lost

Docx to HTML formatting is lost

Re: Docx to HTML formatting is lost

Who is online