Content not allowed in prolog

by **avelin** » Wed May 13, 2020 7:26 pm

Hello,

I have the following docx file (content-not-allowed-in-prolog.docx), where the first char in the file is a Byte Order Mark (BOM) or Zero-width no-break space (U+FEFF).

That's why doxc4j using the SAXParser cannot parse it.

If a docx file starts with a special character, such as a BOM, the following code cannot parse it correctly and the execution lands in the last else (see "Assuming Flat OPC XML").
Usually, such docx files are encoded UTF-8-BOM, instead of UTF-8.

Please, advise how to fix.

Greetings,
Angelina

Code: Select all: private static org.docx4j.openpackaging.packages.OpcPackage load(PackageIdentifier pkgIdentifier, InputStream inputStream, String password) throws Docx4JException { BufferedInputStream bis = new BufferedInputStream(inputStream); bis.mark(0); byte[] firstTwobytes = new byte[2]; boolean var5 = false; int read; try { read = bis.read(firstTwobytes); bis.reset(); } catch (IOException var7) { throw new Docx4JException("Error reading from the stream", var7); } if (read != 2) { throw new Docx4JException("Error reading from the stream (no bytes available)"); } else if (firstTwobytes[0] == 80 && firstTwobytes[1] == 75) { return load(pkgIdentifier, bis, Filetype.ZippedPackage, (String)null); } else if (firstTwobytes[0] == -48 && firstTwobytes[1] == -49) { log.info("Detected compound file"); return load(pkgIdentifier, bis, Filetype.Compound, password); } else { log.info("Assuming Flat OPC XML"); return load(pkgIdentifier, bis, Filetype.FlatOPC, (String)null); } }

by **jason** » Thu May 14, 2020 10:20 am

Clearly we could address this case, but in the meantime, I'm curious, how was the docx created/what is the source of these files?

A file created with a Microsoft text editor will start with a byte order mark (BOM): http://msdn.microsoft.com/en-us/library ... 01(v=vs.85).aspx

by **avelin** » Thu May 14, 2020 7:56 pm

hi Jason,

I don't know, how the Word files were created. I assume someone had a pretty old MS Office installation on an old Windows machine, and there we go. My task is to parse a large amount of files and to analyse them.
Right now, the only thing I can do to bypass this "Content not allowed in prolog" issue, is open each docx file in MS Word and re-save it, so that it is encoded correctly and the first BOM char is removed.

It would be great, if you had a suggestion for this issue, or a fix.

Thank you again for your input and help.

Angelina

Content not allowed in prolog

Content not allowed in prolog

Re: Content not allowed in prolog

Re: Content not allowed in prolog

Who is online