Index queries do not provide document content (only a
partial and unprecise reconstruction is performed to show the
snippets text). In order to access the actual document data,
the data extraction part of the indexing process
must be performed (subdocument access and format
translation). This is not trivial in
general. The rclextract
module currently
provides a single class which can be used to access the data
content for result documents.
Methods
- Extractor(doc)
- An
Extractor
object is built from aDoc
object, output from a query. - Extractor.textextract(ipath)
- Extract document defined
by
ipath
and return aDoc
object. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The typical use would be as follows:qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) doc = extractor.textextract(qdoc.ipath) # use doc.text, e.g. for previewing
- Extractor.idoctofile(ipath, targetmtype, outfile='')
- Extracts document into an output file,
which can be given explicitly or will be created as a
temporary file to be deleted by the caller. Typical use:
qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)