Numbers not displayed in html

by **sbelt** » Thu Apr 02, 2015 7:04 am

I am using docx4j 3.2.1 to convert .docx files into .html files in real time; in other words, when the web browser requests the file, tomcat reads the .docx file, uses docx4j to convert it into a html string, then streams the content back to the browser. This is running on Linux 3.0.0-26-server. My issue is that although the text looks fine, numbers (phone numbers, zipcodes, addresses, etc.) are displaying as strange, farsi-looking characters. For example, the .docx file is using Arial to display zipcode 48174 - docx4j converts this to:

<span class="" style="">٤٨١٧٤</span>

Previous posts related to this issue suggested that I might need the mscorefonts, though those posts seemed to describe trouble converting all text - not just numbers. But I installed the msfonts, and I now get:

<span class="" style="font-family: 'Times New Roman';">٤٨١٧٤</span>

So as you can see, a font has been added to the style, but the actual text is still gibberish.

Can anyone suggest what might be wrong, or how I might go about troubleshooting this?

Thanks!
Steve

by **jason** » Thu Apr 02, 2015 8:59 pm

Save the HTML string to a file and have a look at its contents (ie instead of looking at it in the browser).

Does it still look weird?

by **sbelt** » Fri Apr 03, 2015 1:48 am

Thanks, Jason, for responding.

I hope this achieves the same test you were suggesting: I have attached my Eclipse debugger to the running webapp, and I see that the html String created for me using
Docx4J.toHTML() method has the weird characters in it.

I confess I am new to this library - could it be related to my HTMLSettings or the Docx4J.FLAG_EXPORT_PREFER_XSL)?

Steve

by **sbelt** » Fri Apr 03, 2015 3:19 am

I am attaching a simple sample .docx which presents this misbehavior.

fragment.docx: Sample .docx which obscures the numbers; (13.13 KiB) Downloaded 599 times

by **sbelt** » Fri Apr 03, 2015 4:32 am

I now believe chasing fonts is a red herring: I am now reproducing this error on my windows desktop.

In addition to the source file I provided in my previous post, I am attaching the resulting .html being generated. Below is my code:

Code: Select all: public class Docx4jTest { private static Mapper fontMapper = new BestMatchingMapper(); public Docx4jTest() throws Exception { } public String convertDocxToHTML(String path) { String html = ""; try { URL url = new URL(path); WordprocessingMLPackage wordMLPackage = Docx4J.load(url.openStream()); HTMLSettings htmlSettings = Docx4J.createHTMLSettings(); htmlSettings.setImageDirPath( path + "_files"); htmlSettings.setImageTargetUri(path.substring(path.lastIndexOf("/")+1) + "_files"); htmlSettings.setWmlPackage(wordMLPackage); String userCSS = "html, body, div, span, h1, h2, h3, h4, h5, h6, p, a, img, ol, ul, li, table, caption, tbody, tfoot, thead, tr, th, td " + "{ margin: 0; padding: 0; border: 0;}" + "body {line-height: 1;} "; htmlSettings.setUserCSS(userCSS); ByteArrayOutputStream baos = new ByteArrayOutputStream(); Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true); Docx4J.toHTML(htmlSettings, baos, Docx4J.FLAG_EXPORT_PREFER_XSL); html = baos.toString(); } catch (Exception e) { System.err.println(e.getMessage()); html =""; //reset html - empty string means failure } return html; } /** * @param args */ public static void main(String[] args) { try { Docx4jTest docx4j = new Docx4jTest(); String path = new File("c:\\temp\\fragment.docx").toURI().toString(); String html = docx4j.convertDocxToHTML(path); File file1 = new File("c:\\temp\\test1.html"); FileUtils.writeStringToFile(file1, html); } catch (Exception e) { e.printStackTrace(); } } }

by **jason** » Fri Apr 03, 2015 7:01 pm

docx4j.s RunFontSelector class contains a method:

Syntax: [ Download ] [ Hide ]

Using java Syntax Highlighting

privateString arabicNumbering(String text, BooleanDefaultTrue rtl, BooleanDefaultTrue cs, CTLanguage themeFontLang )
Parsed in 0.028 seconds,  using GeSHi 1.0.8.4

which under certain conditions will convert numerals to arabic.

It seems it is being overly aggressive.

Code: Select all: @@ -396,7 +396,9 @@ return nullRPr(document, text); } - text = this.arabicNumbering(text, rPr.getRtl(), rPr.getCs(), themeFontLang); + if (pPr!=null && pPr.getBidi()!=null && pPr.getBidi().isVal() ) { + text = this.arabicNumbering(text, rPr.getRtl(), rPr.getCs(), themeFontLang); + }

seems to fix it, but I'm not sure yet that this is the correct fix.

by **sbelt** » Sat Apr 04, 2015 2:57 am

Thanks, Jason, last night I did notice that my failing .docx was different from others in that the /word/settings.xml part of the .docx file used '<w:themeFontLang w:val="en-US" w:eastAsia="zh-TW" w:bidi="ar-SA"/>' instead of '<w:themeFontLang w:val="en-US"/>'. The fact that your patch seems includes 'themFontLang' variable confirms to me that we are on the same track.

So, I have downloaded the docx4j source, applied your patch, and rebuilt the .jar. I can confirm that this fixes the (mis)behavior I was describing. You say that you are, "not sure yet that this is the correct fix." Do you think I should deploy your current fix, or are you still checking for a better, more-correct solution?

Thanks for all your help!

Steve

Numbers not displayed in html

Numbers not displayed in html

Re: Numbers not displayed in html

Re: Numbers not displayed in html

Re: Numbers not displayed in html

Re: Numbers not displayed in html

Re: Numbers not displayed in html

Re: Numbers not displayed in html

Who is online