First of all, thank you for creating this incredibly useful library.
I'm trying to convert XHTML to Docx in a way that preserves the heading styles of my template DocX component. It's almost working: I do get a valid Microsoft Word document with my HTML adequately converted.
My only problem is that, while the headings (h1, h2, h3) are correctly mapped to my original template document's heading style, they don't look like the original ones. Indeed they all have additional "+ Times New Roman, black" specifications attached to their style definitions.
Looking at the XML output, I noticed that, indeed, some paragraphs have an additional "style" definition that looks like this:
- Code: Select all
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:b/>
<w:color w:val="000000"/>
</w:rPr>
I have tried to filter those items with a custom XSLT, like so:
- Code: Select all
TransformerFactory factory = TransformerFactory.newInstance();
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
InputStream docxTransformer = DocxExportController.class.getResourceAsStream("/noFormatting.xslt");
Templates templates =
factory.newTemplates( new StreamSource( docxTransformer ) );
wordMLPackage.getMainDocumentPart().transform(templates, null, result);
using the following XSLT:
- Code: Select all
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:WX="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:ext="http://www.xmllab.net/wordml2html/ext"
xmlns:java="http://xml.apache.org/xalan/java"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
version="1.0"
exclude-result-prefixes="java msxsl ext o v WX aml w10">
<xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no" indent="yes" />
<!-- doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="w:rPr"></xsl:template>
</xsl:stylesheet>
The result of that output indeed has the w:rPr nodes stripped out. But then, I can't seem to find a way to "re-serialize" that output to a valid Docx document. Here's my last best attempt:
- Code: Select all
WordprocessingMLPackage pkg = new WordprocessingMLPackage();
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
pkg.addTargetPart( ndp );
// Object unmarshalled = XmlUtils.unmarshalString(sw.toString());
org.docx4j.convert.in.FlatOpcXmlImporter xmlPackage =
new org.docx4j.convert.in.FlatOpcXmlImporter(new StringBufferInputStream(sw.toString()));
WordprocessingMLPackage transformedWordMLPackage = (WordprocessingMLPackage)xmlPackage.get();
This will trigger an exception:
- Code: Select all
unexpected element (uri:"http://schemas.openxmlformats.org/wordprocessingml/2006/main", local:"document"). Expected elements are <{http://schemas.microsoft.com/office/2006/xmlPackage}package>,<{http://schemas.microsoft.com/office/2006/xmlPackage}xmlData>
I'm not exactly sure what is going on here, or if I'm taking the right approach. Can something be configured at the API level to disable those pesky style attributes? Is running an XSLT on a Docx a valid method? If yes, are there documented examples of valid XSLT programs for postprocessing? What does the above error mean?
Thank you in advance for your help,
Best Regards,
Candide