I have the following docx file (content-not-allowed-in-prolog.docx), where the first char in the file is a Byte Order Mark (BOM) or Zero-width no-break space (U+FEFF).
That's why doxc4j using the SAXParser cannot parse it.
If a docx file starts with a special character, such as a BOM, the following code cannot parse it correctly and the execution lands in the last else (see "Assuming Flat OPC XML").
Usually, such docx files are encoded UTF-8-BOM, instead of UTF-8.
Please, advise how to fix.
Greetings,
Angelina
- Code: Select all
private static org.docx4j.openpackaging.packages.OpcPackage load(PackageIdentifier pkgIdentifier, InputStream inputStream, String password) throws Docx4JException {
BufferedInputStream bis = new BufferedInputStream(inputStream);
bis.mark(0);
byte[] firstTwobytes = new byte[2];
boolean var5 = false;
int read;
try {
read = bis.read(firstTwobytes);
bis.reset();
} catch (IOException var7) {
throw new Docx4JException("Error reading from the stream", var7);
}
if (read != 2) {
throw new Docx4JException("Error reading from the stream (no bytes available)");
} else if (firstTwobytes[0] == 80 && firstTwobytes[1] == 75) {
return load(pkgIdentifier, bis, Filetype.ZippedPackage, (String)null);
} else if (firstTwobytes[0] == -48 && firstTwobytes[1] == -49) {
log.info("Detected compound file");
return load(pkgIdentifier, bis, Filetype.Compound, password);
} else {
log.info("Assuming Flat OPC XML");
return load(pkgIdentifier, bis, Filetype.FlatOPC, (String)null);
}
}