I am using docx4j for parsing word documents containing a number of form fields that each contain a unique field name. These documents are questionnaires and I want to extract the entered values into a database using an automatic import tool. The field names act as unique keys so that they can be matched to the corresponding database field.
Generally speaking, I have been able to parse these fields (i.e. extract the field name and its value, if there is one) by using the following java code:
Using java Syntax Highlighting
WordprocessingMLPackage wordMLPackage = Docx4J.load(word_file);
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
// find fields
ComplexFieldLocator fl = new ComplexFieldLocator();
new TraversalUtil(documentPart.getContent(), fl);
{
// canonicalise and setup fieldRefs
List<FieldRef> fieldRefs = new ArrayList<>();
for (P p : fl.getStarts()) {
FieldsPreprocessor.canonicalise(p, fieldRefs);
}
for (FieldRef fr : fieldRefs) {
String[] l = new String[2];
//initialize with empty string
l[1] = "";
//get "begin" area of field -- it contains fldchar definition and checkbox values
R beg = fr.getBeginRun();
ClassFinder fldchar_finder = new ClassFinder(FldChar.class);
new TraversalUtil(beg.getContent(), fldchar_finder);
for (Object fld_o : fldchar_finder.results) {
FldChar fld = (FldChar) fld_o;
if (fld.getFldCharType() == STFldCharType.BEGIN) {
ClassFinder ctff_finder = new ClassFinder(CTFFData.class);
new TraversalUtil(fld, ctff_finder);
for (Object ctff_obj : ctff_finder.results) {
if (ctff_obj instanceof CTFFData) {
CTFFData c_d = (CTFFData) ctff_obj;
List<JAXBElement<?>> el_list = c_d.getNameOrEnabledOrCalcOnExit();
for (JAXBElement<?> j_el : el_list) {
if (j_el.getValue() instanceof CTFFName) {
CTFFName n = (CTFFName) j_el.getValue();
l[0] = n.getVal();
}
if (j_el.getValue() instanceof CTFFCheckBox) {
CTFFCheckBox c = (CTFFCheckBox) j_el.getValue();
if (c != null) {
if (c.getChecked() != null) {
l[1] = c.getChecked().isVal() ? "true" : "false";
} else if (c.getDefault() != null) {
l[1] = c.getDefault().isVal() ? "true" : "false";
}
}
}
}
}
}
}
}
//get "result" area of field -- it contains value in case of text fields
R res = fr.getResultsSlot();
for (Object o : res.getContent()) {
JAXBElement el = (JAXBElement) o;
if (el.getValue() instanceof Text) {
//add all values to l[1] as they may be distributed over several fields....
l[1] += ((Text) el.getValue()).getValue();
//System.out.println("Value: " + el.getValue());
}
}
//remove all unneccessary whitespace
l[1] = l[1].trim();
System.out.println(l[0] + ": " + l[1]);
lines.add(l);
}
}
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
// find fields
ComplexFieldLocator fl = new ComplexFieldLocator();
new TraversalUtil(documentPart.getContent(), fl);
{
// canonicalise and setup fieldRefs
List<FieldRef> fieldRefs = new ArrayList<>();
for (P p : fl.getStarts()) {
FieldsPreprocessor.canonicalise(p, fieldRefs);
}
for (FieldRef fr : fieldRefs) {
String[] l = new String[2];
//initialize with empty string
l[1] = "";
//get "begin" area of field -- it contains fldchar definition and checkbox values
R beg = fr.getBeginRun();
ClassFinder fldchar_finder = new ClassFinder(FldChar.class);
new TraversalUtil(beg.getContent(), fldchar_finder);
for (Object fld_o : fldchar_finder.results) {
FldChar fld = (FldChar) fld_o;
if (fld.getFldCharType() == STFldCharType.BEGIN) {
ClassFinder ctff_finder = new ClassFinder(CTFFData.class);
new TraversalUtil(fld, ctff_finder);
for (Object ctff_obj : ctff_finder.results) {
if (ctff_obj instanceof CTFFData) {
CTFFData c_d = (CTFFData) ctff_obj;
List<JAXBElement<?>> el_list = c_d.getNameOrEnabledOrCalcOnExit();
for (JAXBElement<?> j_el : el_list) {
if (j_el.getValue() instanceof CTFFName) {
CTFFName n = (CTFFName) j_el.getValue();
l[0] = n.getVal();
}
if (j_el.getValue() instanceof CTFFCheckBox) {
CTFFCheckBox c = (CTFFCheckBox) j_el.getValue();
if (c != null) {
if (c.getChecked() != null) {
l[1] = c.getChecked().isVal() ? "true" : "false";
} else if (c.getDefault() != null) {
l[1] = c.getDefault().isVal() ? "true" : "false";
}
}
}
}
}
}
}
}
//get "result" area of field -- it contains value in case of text fields
R res = fr.getResultsSlot();
for (Object o : res.getContent()) {
JAXBElement el = (JAXBElement) o;
if (el.getValue() instanceof Text) {
//add all values to l[1] as they may be distributed over several fields....
l[1] += ((Text) el.getValue()).getValue();
//System.out.println("Value: " + el.getValue());
}
}
//remove all unneccessary whitespace
l[1] = l[1].trim();
System.out.println(l[0] + ": " + l[1]);
lines.add(l);
}
}
Parsed in 0.022 seconds, using GeSHi 1.0.8.4
The above code always works for extracting the field names, however, for the values of the fields it only works if there is e.g. no <p> in the result area of the document (i.e. the user has added a paragraph while filling out the form field). Otherwise the fr.getResultsSlot() call returns an empty list and no value is extracted.
My question is now: Is this a bug/feature? What is the correct/recommended way of parsing these kind of form fields so that the value is always extracted?
Thanks in advance!
Best regards
Christian