Parsing of form fields in Word document

by **chrisspiel** » Thu Feb 12, 2015 12:52 am

Dear all,

I am using docx4j for parsing word documents containing a number of form fields that each contain a unique field name. These documents are questionnaires and I want to extract the entered values into a database using an automatic import tool. The field names act as unique keys so that they can be matched to the corresponding database field.

Generally speaking, I have been able to parse these fields (i.e. extract the field name and its value, if there is one) by using the following java code:

Syntax: [ Download ] [ Hide ]

Using java Syntax Highlighting

                WordprocessingMLPackage wordMLPackage= Docx4J.load(word_file);

                MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

// find fields

                ComplexFieldLocator fl =new ComplexFieldLocator();
new TraversalUtil(documentPart.getContent(), fl);
{
// canonicalise and setup fieldRefs

                    List<FieldRef> fieldRefs =new ArrayList<>();
for(P p : fl.getStarts()){

                        FieldsPreprocessor.canonicalise(p, fieldRefs);
}

for(FieldRef fr : fieldRefs){
String[] l =newString[2];
//initialize with empty string

                        l[1]="";
//get "begin" area of field -- it contains fldchar definition and checkbox values

                        R beg = fr.getBeginRun();

                        ClassFinder fldchar_finder =new ClassFinder(FldChar.class);
new TraversalUtil(beg.getContent(), fldchar_finder);
for(Object fld_o : fldchar_finder.results){

                            FldChar fld =(FldChar) fld_o;
if(fld.getFldCharType()== STFldCharType.BEGIN){

                                ClassFinder ctff_finder =new ClassFinder(CTFFData.class);
new TraversalUtil(fld, ctff_finder);
for(Object ctff_obj : ctff_finder.results){
if(ctff_obj instanceof CTFFData){

                                        CTFFData c_d =(CTFFData) ctff_obj;

                                        List<JAXBElement<?>> el_list = c_d.getNameOrEnabledOrCalcOnExit();
for(JAXBElement<?> j_el : el_list){
if(j_el.getValue()instanceof CTFFName){

                                                CTFFName n =(CTFFName) j_el.getValue();

                                                l[0]= n.getVal();
}
if(j_el.getValue()instanceof CTFFCheckBox){

                                                CTFFCheckBox c =(CTFFCheckBox) j_el.getValue();
if(c !=null){
if(c.getChecked()!=null){

                                                        l[1]= c.getChecked().isVal()?"true":"false";
}elseif(c.getDefault()!=null){

                                                        l[1]= c.getDefault().isVal()?"true":"false";
}
}
}
}
}
}
}
}
//get "result" area of field -- it contains value in case of text fields

                            R res = fr.getResultsSlot();
for(Object o : res.getContent()){

                            JAXBElement el =(JAXBElement) o;
if(el.getValue()instanceof Text){
//add all values to l[1] as they may be distributed over several fields....

                                l[1]+=((Text) el.getValue()).getValue();
//System.out.println("Value: " + el.getValue());
}
}

//remove all unneccessary whitespace

                        l[1]= l[1].trim();
System.out.println(l[0]+": "+ l[1]);

                        lines.add(l);
}
}
Parsed in 0.022 seconds,  using GeSHi 1.0.8.4

The above code always works for extracting the field names, however, for the values of the fields it only works if there is e.g. no <p> in the result area of the document (i.e. the user has added a paragraph while filling out the form field). Otherwise the fr.getResultsSlot() call returns an empty list and no value is extracted.

My question is now: Is this a bug/feature? What is the correct/recommended way of parsing these kind of form fields so that the value is always extracted?

Thanks in advance!

Best regards

Christian

by **jason** » Fri Feb 13, 2015 9:55 pm

sample docx?

by **chrisspiel** » Sun Feb 15, 2015 5:16 am

I have added two sample docx files - "working" is without <p>, "not working" contains <p> tags.

Thanks and best regards

Christian

by **cyruswong** » Mon Feb 16, 2015 6:31 pm

Hi Christian and Jason

I am also jammed by the same problem, and I have made another post
docx-java-f6/form-field-extraction-t2078.html

I think it is related to the following xml structure:

Syntax: [ Download ] [ Hide ]

Using xml Syntax Highlighting

<w:p w:rsidR="0065667C" w:rsidRDefault="00B02AD0" w:rsidP="0065667C">
<w:r>
<w:t xml:space="preserve">Q1.</w:t>
</w:r>
<w:r>
<w:fldChar w:fldCharType="begin">
<w:ffData>
<w:name w:val="Q1"/>
<w:enabled/>
<w:calcOnExit w:val="0"/>
<w:textInput/>
</w:ffData>
</w:fldChar>
</w:r>
<w:bookmarkStart w:id="4" w:name="Q1"/>
<w:r>
<w:instrText xml:space="preserve">FORMTEXT</w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="0065667C">
<w:t>line 1</w:t>
</w:r>
</w:p>
<w:p w:rsidR="0065667C" w:rsidRDefault="0065667C" w:rsidP="0065667C">
<w:r>
<w:t>line 2</w:t>
</w:r>
</w:p>
<w:p w:rsidR="0065667C" w:rsidRDefault="0065667C" w:rsidP="0065667C">
<w:r>
<w:t>line 3</w:t>
</w:r>
</w:p>
<w:p w:rsidR="0065667C" w:rsidRDefault="0065667C" w:rsidP="0065667C"/>
<w:p w:rsidR="00D87B64" w:rsidRDefault="0065667C" w:rsidP="0065667C">
<w:r>
<w:t>dadsad</w:t>
</w:r>
<w:bookmarkStart w:id="5" w:name="_GoBack"/>
<w:bookmarkEnd w:id="5"/>
<w:r w:rsidR="00B02AD0">
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:bookmarkEnd w:id="4"/>
Parsed in 0.004 seconds,  using GeSHi 1.0.8.4

Line 1 - 3 is the data in the field, but your code should not work for the case of multiple lines.

by **jason** » Tue Feb 17, 2015 9:14 am

As per the Javadoc in FieldsPreprocessor:

Code: Select all: * Currently the canonicalisation is done at the paragraph level, * so it is not suitable for fields (such as TOC) which extend across paragraphs.

The method:

Syntax: [ Download ] [ Hide ]

Using java Syntax Highlighting

publicstatic P canonicalise(P p, List<FieldRef> fieldRefs)
Parsed in 0.012 seconds,  using GeSHi 1.0.8.4

should not be used where a field in the P extends into a subsequent P (as is the case with your respective examples, Christian and Cyrus, both of which are FORMTEXT fields).

FieldsPreprocessor as it stands is intended primarily for MERGEFIELD and DOCPROPERTY fields.

A modified design would be required to handle fields which extend across paragraphs.

by **cyruswong** » Wed Feb 25, 2015 8:47 pm

Oh! I see!
Thank you Jason with your details explanation!
I think I have to move the extraction task out of JAVA as it is too complex and xpath hard code with Docx4j or POI, and step back trigger a VBA Script in windows platform.

Parsing of form fields in Word document

Parsing of form fields in Word document

Re: Parsing of form fields in Word document

Re: Parsing of form fields in Word document

Re: Parsing of form fields in Word document

Re: Parsing of form fields in Word document

Re: Parsing of form fields in Word document

Who is online