Text extraction

by **hani** » Wed May 27, 2009 2:29 am

Couple of questions before I delve into the source:

- Are these any examples of doing simple text extraction? All I need is to get a meaningful string of the document contents for indexing purposes

- Is there a list of runtime dependencies anywhere, assuming all I need is text extraction?

by **jason** » Wed May 27, 2009 6:47 am

If all you want to do is extract the text of the document, I think I would adapt the code in LoadFromZipFile or LoadFromZipNG to just get the main document part, and then use SAX to get the contents of the <w:t> elements.

If you want to ignore things which are tracked as deletions, then you might want to use XSLT instead of SAX.

On this approach, your runtime dependencies will be minimised as well :-)

by **hani** » Wed May 27, 2009 12:35 pm

What would be the benefit of using docx4j in this case, or am I better off just opening the zip and using sax on the main document xml?

by **hani** » Wed May 27, 2009 1:21 pm

For anyone interested, this is the code I came up with:

org.docx4j.openpackaging.packages.Package pkg = new LoadFromZipNG().get(is);
MainDocumentPart main = (MainDocumentPart)pkg.getParts().get(new PartName("/word/document.xml"));
Document doc = (Document)main.getJaxbElement();
StringBuilder body = new StringBuilder();
for(Object o : doc.getBody().getEGBlockLevelElts()) {
if(o instanceof P) {
P p = (P)o;
String val = p.toString().trim();
if(val.length() > 0) {
body.append(val);
body.append(' ');
}
}
}
return body.toString();

It's very slow for some reason (most of the time spent is in opening the package), but at least it works. Thanks!

by **jason** » Wed May 27, 2009 9:29 pm

That will give you the content of all w:p/w:r/w:t, but it would miss a lot of the w:t in:

Code: Select all: <w:p> <w:sdt> <w:sdtPr> <w:id w:val="100738110"/> <w:placeholder> <w:docPart w:val="DefaultPlaceholder_22675703"/> </w:placeholder> </w:sdtPr> <w:sdtContent> <w:r> <w:t xml:space="preserve">Hello, this is </w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>first.</w:t> </w:r> </w:sdtContent> </w:sdt> <w:r> <w:t>This</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:r> <w:t xml:space="preserve"> is a </w:t> </w:r> <w:commentRangeStart w:id="0"/> <w:r> <w:t xml:space="preserve">basic </w:t> </w:r> <w:bookmarkStart w:id="1" w:name="basic_test"/> <w:bookmarkEnd w:id="1"/> <w:commentRangeEnd w:id="0"/> <w:r> <w:rPr> <w:rStyle w:val="CommentReference"/> </w:rPr> <w:commentReference w:id="0"/> </w:r> <w:r> <w:t xml:space="preserve">test of </w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>a messy paragraph</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:r> <w:t xml:space="preserve"> object. </w:t> </w:r> <w:sdt> <w:sdtPr> <w:id w:val="100738112"/> <w:placeholder> <w:docPart w:val="E8623F8E883342B2B28AB88DE2A6934A"/> </w:placeholder> </w:sdtPr> <w:sdtContent> <w:r> <w:t>Hello, this is second.</w:t> </w:r> </w:sdtContent> </w:sdt> <w:r> <w:t xml:space="preserve"> 1</w:t> </w:r> <w:ins w:id="2" w:author="Jason Harrop" w:date="2009-05-21T16:07:00Z"> <w:r> <w:t>1</w:t> </w:r> </w:ins> <w:r> <w:t xml:space="preserve"></w:t> </w:r> <w:sdt> <w:sdtPr> <w:id w:val="100738113"/> <w:placeholder> <w:docPart w:val="6458D466289846D88E943D8FD605A365"/> </w:placeholder> </w:sdtPr> <w:sdtContent> <w:r> <w:t xml:space="preserve">Hello, this is </w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>third.</w:t> </w:r> </w:sdtContent> </w:sdt> <w:r> <w:t>T</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:r> <w:t xml:space="preserve"></w:t> </w:r> <w:ins w:id="3" w:author="Jason Harrop" w:date="2009-05-21T16:08:00Z"> <w:r> <w:t>2</w:t> </w:r> </w:ins> <w:r> <w:t xml:space="preserve">2 </w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>h</w:t> </w:r> <w:sdt> <w:sdtPr> <w:id w:val="100738114"/> <w:placeholder> <w:docPart w:val="C7140505A382445DBF14026910A1DC58"/> </w:placeholder> </w:sdtPr> <w:sdtContent> <w:proofErr w:type="gramStart"/> <w:r> <w:t>Hello</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:proofErr w:type="gramEnd"/> <w:r> <w:t xml:space="preserve">, this is </w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>fourth.</w:t> </w:r> </w:sdtContent> </w:sdt> <w:r> <w:t>T</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:r> <w:t xml:space="preserve"></w:t> </w:r> <w:ins w:id="4" w:author="Jason Harrop" w:date="2009-05-21T16:08:00Z"> <w:r> <w:t>3</w:t> </w:r> </w:ins> <w:r> <w:t>3</w:t> </w:r> <w:bookmarkStart w:id="5" w:name="after_3"/> <w:bookmarkEnd w:id="5"/> <w:r> <w:t xml:space="preserve"></w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>h</w:t> </w:r> <w:sdt> <w:sdtPr> <w:id w:val="100738115"/> <w:placeholder> <w:docPart w:val="CAD8497DA1C54B8A801994DAE07AE3CA"/> </w:placeholder> </w:sdtPr> <w:sdtContent> <w:proofErr w:type="gramStart"/> <w:r> <w:t>Hello</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:proofErr w:type="gramEnd"/> <w:r> <w:t xml:space="preserve">, this is </w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r> <w:t>fifth.</w:t> </w:r> </w:sdtContent> </w:sdt> <w:commentRangeStart w:id="6"/> <w:r> <w:t>Th</w:t> </w:r> <w:commentRangeEnd w:id="6"/> <w:proofErr w:type="spellEnd"/> <w:r> <w:rPr> <w:rStyle w:val="CommentReference"/> </w:rPr> <w:commentReference w:id="6"/> </w:r> <w:r> <w:t xml:space="preserve"></w:t> </w:r> <w:hyperlink r:id="rId5" w:history="1"> <w:r> <w:rPr> <w:rStyle w:val="Hyperlink"/> </w:rPr> <w:t>http://192.168.153.129:8080/</w:t> </w:r> </w:hyperlink> </w:p>

Until such time as docx4j is modified so that each element in the paragraph knows how to return its text content, I recommended the unzip + SAX approach (which doesn't use docx4j).

If you want to use docx4j for other reasons, I'd recommend running an XSLT on the main document part (or making the modification described in the previous paragraph - which we'd be happy to accept as a patch).

by **jason** » Thu May 28, 2009 12:36 am

I've added a method public static void extractText(Object o, Writer w) in http://dev.plutext.org/trac/docx4j/brow ... Utils.java

You can use it to extract the text of any JAXB object (eg a org.docx4j.wml.P, or the entire org.docx4j.wml.Document).

It uses SAX.

by **hani** » Thu May 28, 2009 4:38 am

Wonderful, thank you so much!

Any chance of creating a new nightly snapshot? Maven hates me and refuses to build most things I ask it to.

by **jason** » Thu May 28, 2009 10:16 am

hani wrote:Any chance of creating a new nightly snapshot?

Done. 20090528

Text extraction

Text extraction

Re: Text extraction

Re: Text extraction

Re: Text extraction

Re: Text extraction

Re: Text extraction

Re: Text extraction

Re: Text extraction

Who is online