Couple of questions before I delve into the source:
- Are these any examples of doing simple text extraction? All I need is to get a meaningful string of the document contents for indexing purposes
- Is there a list of runtime dependencies anywhere, assuming all I need is text extraction?