From cd11886f6c6c5243b8cfe3dbf84a1c86050cdc79 Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Wed, 1 Jun 2016 09:44:11 +0200 Subject: [PATCH] document the python index update interface --- src/doc/user/Makefile | 2 +- src/doc/user/usermanual.html | 1083 +++++++++++++++++++++++----------- src/doc/user/usermanual.xml | 637 +++++++++++++------- 3 files changed, 1180 insertions(+), 542 deletions(-) diff --git a/src/doc/user/Makefile b/src/doc/user/Makefile index 5d1860e5..14b6c9c2 100644 --- a/src/doc/user/Makefile +++ b/src/doc/user/Makefile @@ -19,7 +19,7 @@ commonoptions=--stringparam section.autolabel 1 \ # index.html chunk format target replaced by nicer webhelp (needs separate # make) in webhelp/ subdir -all: usermanual.html usermanual.pdf webh +all: usermanual.html webh usermanual.pdf webh: make -C webhelp diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 99ec3fe9..b86a303b 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -20,8 +20,8 @@ alink="#0000FF">
-

Recoll user manual

+

Recoll user manual

@@ -109,13 +109,13 @@ alink="#0000FF"> multiple indexes
2.1.3. Document types
+ "#idp55164976">Document types
2.1.4. Indexing failures
+ "#idp55184656">Indexing failures
2.1.5. Recovery
+ "#idp55192112">Recovery @@ -390,17 +390,29 @@ alink="#0000FF"> processing
4.3. API
+ "#RCL.PROGRAM.PYTHONAPI">Python API
4.3.1. Interface - elements
+ "#RCL.PROGRAM.PYTHONAPI.INTRO">Introduction
4.3.2. Python + "#RCL.PROGRAM.PYTHONAPI.ELEMENTS">Interface + elements
+ +
4.3.3. Python search interface
+ +
4.3.4. Creating Python + external indexers
+ +
4.3.5. Package + compatibility with the previous + version
@@ -777,8 +789,8 @@ alink="#0000FF"> "link" href="#RCL.SEARCH.COMMANDLINE" title= "3.3. Searching on the command line">command line interface, a Python programming interface, a
-

2.1.3. Document types

+

2.1.3. Document types

@@ -1079,8 +1091,8 @@ indexedmimetypes = application/pdf
-

2.1.4. Indexing +

2.1.4. Indexing failures

@@ -1120,8 +1132,8 @@ indexedmimetypes = application/pdf
-

2.1.5. Recovery

+

2.1.5. Recovery

@@ -4605,9 +4617,8 @@ export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
  • By writing a custom Python program, using the - Recoll Python - API.

    + Recoll Python API.

  • @@ -5807,10 +5818,11 @@ dir:recoll dir:src -dir:utils -dir:common
    -

    Terminology

    The small programs or - pieces of code which handle the processing of the - different document types for Recoll used to be called +

    Terminology

    + +

    The small programs or pieces of code which handle the + processing of the different document types for + Recoll used to be called filters, which is still reflected in the name of the directory which holds them and many configuration variables. They were named this @@ -5820,7 +5832,7 @@ dir:recoll dir:src -dir:utils -dir:common term input handler is now progressively substituted in the documentation. filter is still used in many - places though. + places though.

    Recoll input handlers @@ -6411,8 +6423,8 @@ or

    4.3. API

    + "RCL.PROGRAM.PYTHONAPI" id= + "RCL.PROGRAM.PYTHONAPI">4.3. Python API
    @@ -6422,8 +6434,124 @@ or

    4.3.1. Interface + "RCL.PROGRAM.PYTHONAPI.INTRO" id= + "RCL.PROGRAM.PYTHONAPI.INTRO">4.3.1. Introduction

    +
    +
    +
    + +

    Recoll versions after + 1.11 define a Python programming interface, both for + searching and creating/updating an index.

    + +

    The search interface is used in the Recoll Ubuntu Unity Lens and the + Recoll Web UI. It can + run queries on any Recoll configuration.

    + +

    The index update section of the API may be used to + create and update Recoll + indexes on specific configurations (separate from the + ones created by recollindex). The + resulting databases can be queried alone, or in + conjunction with regular ones, through the GUI or any of + the query interfaces.

    + +

    The search API is modeled along the Python database + API specification. There were two major changes along + Recoll versions:

    + +
    +
      +
    • +

      The basis for the Recoll API changed from Python + database API version 1.0 (Recoll versions up to 1.18.1), + to version 2.0 (Recoll 1.18.2 and later).

      +
    • + +
    • +

      The recoll module + became a package (with an internal recoll module) as of Recoll version 1.19, in order + to add more functions. For existing code, this only + changes the way the interface must be imported.

      +
    • +
    +
    + +

    We will describe the new API and package structure + here. A paragraph at the end of this section will explain + a few differences and ways to write code compatible with + both versions.

    + +

    The Python interface can be found in the source + package, under python/recoll.

    + +

    The python/recoll/ + directory contains the usual setup.py. After configuring the main + Recoll code, you can use + the script to build and install the Python module:

    +
    +            cd recoll-xxx/python/recoll
    +            python setup.py build
    +            python setup.py install
    +          
    +
    + +

    As of Recoll 1.19, + the module can be compiled for Python3.

    + +

    The normal Recoll + installer installs the Python2 API along with the main + code. The Python3 version must be explicitely built and + installed.

    + +

    When installing from a repository, and depending on + the distribution, the Python API can sometimes be found + in a separate package.

    + +

    As an introduction, the following small sample will + run a query and list the title and url for each of the + results. It would work with Recoll 1.19 and later. The + python/samples source + directory contains several examples of Python programming + with Recoll, exercising + the extension more completely, and especially its data + extraction features.

    +
    +#!/usr/bin/env python
    +
    +from recoll import recoll
    +
    +db = recoll.connect()
    +query = db.query()
    +nres = query.execute("some query")
    +results = query.fetchmany(20)
    +for doc in results:
    +    print(doc.url, doc.title)
    +
    +
    + +
    +
    +
    +
    +

    4.3.2. Interface elements

    @@ -6434,36 +6562,94 @@ or
    -
    udi
    - -
    -

    An udi (unique document identifier) identifies a - document. Because of limitations inside the index - engine, it is restricted in length (to 200 bytes), - which is why a regular URI cannot be used. The - structure and contents of the udi is defined by the - application and opaque to the index engine. For - example, the internal file system indexer uses the - complete document path (file path + internal path), - truncated to length, the suppressed part being - replaced by a hash value.

    -
    - -
    ipath
    +
    ipath

    This data value (set as a field in the Doc object) is stored, along with the URL, but not indexed by Recoll. - Its contents are not interpreted, and its use is up - to the application. For example, the Recoll internal file system - indexer stores the part of the document access path - internal to the container file (ipath in this case is a list of - subdocument sequential numbers). url and ipath are - returned in every search result and permit access - to the original document.

    + Its contents are not interpreted by the index + layer, and its use is up to the application. For + example, the Recoll file system indexer + uses the ipath to + store the part of the document access path internal + to (possibly imbricated) container documents. + ipath in this case is + a vector of access elements (e.g, the first part + could be a path inside a zip file to an archive + member which happens to be an mbox file, the second + element would be the message sequential number + inside the mbox etc.). url and ipath are returned in every search + result and define the access to the original + document. ipath is + empty for top-level document/files (e.g. a PDF + document which is a filesystem file). The + Recoll GUI knows + about the structure of the ipath values used by the + filesystem indexer, and uses it for such functions + as opening the parent of a given document.

    +
    + +
    udi
    + +
    +

    An udi (unique + document identifier) identifies a document. Because + of limitations inside the index engine, it is + restricted in length (to 200 bytes), which is why a + regular URI cannot be used. The structure and + contents of the udi is + defined by the application and opaque to the index + engine. For example, the internal file system + indexer uses the complete document path (file path + + internal path), truncated to length, the + suppressed part being replaced by a hash value. The + udi is not explicit in + the query interface (it is used "under the hood" by + the rclextract + module), but it is an explicit element of the + update interface.

    +
    + +
    parent_udi
    + +
    +

    If this attribute is set on a document when + entering it in the index, it designates its + physical container document. In a multilevel + hierarchy, this may not be the immediate parent. + parent_udi is + optional, but its use by an indexer may simplify + index maintenance, as Recoll will automatically + delete all children defined by parent_udi == udi when the + document designated by udi is destroyed. e.g. if a + Zip archive contains + entries which are themselves containers, like + mbox files, all the + subdocuments inside the Zip file (mbox, messages, message + attachments, etc.) would have the same parent_udi, matching the + udi for the + Zip file, and all + would be destroyed when the Zip file (identified by its + udi) is removed from + the index. The standard filesystem indexer uses + parent_udi.

    Stored and indexed @@ -6478,25 +6664,16 @@ or
    - -

    Data for an external indexer, should be stored in a - separate index, not the one for the Recoll internal file system indexer, - except if the latter is not used at all). The reason is - that the main document indexer purge pass would remove - all the other indexer's documents, as they were not seen - during indexing. The main indexer documents would also - probably be a problem for the external indexer purge - operation.

    -

    4.3.2. Python - interface

    +

    4.3.3. Python + search interface

    @@ -6506,118 +6683,8 @@ or

    4.3.2.1. Introduction

    -
    -
    -
    - -

    Recoll versions - after 1.11 define a Python programming interface, both - for searching and indexing.

    - -

    The search interface is used in the Recoll Ubuntu - Unity Lens and Recoll WebUI.

    - -

    The indexing section of the API has seen little use, - and is more a proof of concept. In truth it is waiting - for its killer app...

    - -

    The search API is modeled along the Python database - API specification. There were two major changes along - Recoll versions:

    - -
    -
      -
    • -

      The basis for the Recoll API changed from - Python database API version 1.0 (Recoll versions up to - 1.18.1), to version 2.0 (Recoll 1.18.2 and - later).

      -
    • - -
    • -

      The recoll module - became a package (with an internal recoll module) as of - Recoll version - 1.19, in order to add more functions. For - existing code, this only changes the way the - interface must be imported.

      -
    • -
    -
    - -

    We will mostly describe the new API and package - structure here. A paragraph at the end of this section - will explain a few differences and ways to write code - compatible with both versions.

    - -

    The Python interface can be found in the source - package, under python/recoll.

    - -

    The python/recoll/ - directory contains the usual setup.py. After configuring the main - Recoll code, you can - use the script to build and install the Python - module:

    -
    -            cd recoll-xxx/python/recoll
    -            python setup.py build
    -            python setup.py install
    -          
    -
    - -

    As of Recoll 1.19, - the module can be compiled for Python3.

    - -

    The normal Recoll - installer installs the Python2 API along with the main - code. The Python3 version must be explicitely built and - installed.

    - -

    When installing from a repository, and depending on - the distribution, the Python API can sometimes be found - in a separate package.

    - -

    The following small sample will run a query and list - the title and url for each of the results. It would - work with Recoll 1.19 - and later. The python/samples source directory - contains several examples of Python programming with - Recoll, exercising the - extension more completely, and especially its data - extraction features.

    -
    -          from recoll import recoll
    -
    -          db = recoll.connect()
    -          query = db.query()
    -          nres = query.execute("some query")
    -          results = query.fetchmany(20)
    -          for doc in results:
    -              print(doc.url, doc.title)
    -        
    -
    -
    - -
    -
    -
    -
    -

    4.3.2.2. Recoll + "RCL.PROGRAM.PYTHONAPI.PACKAGE" id= + "RCL.PROGRAM.PYTHONAPI.PACKAGE">4.3.3.1. Recoll package

    @@ -6632,7 +6699,9 @@ or
  • The recoll module contains functions and classes used to query (or - update) the index.

    + update) the index. This section will only + describe the query part, see further for the + update part.

  • @@ -6649,8 +6718,8 @@ or

    4.3.2.3. The + "RCL.PROGRAM.PYTHONAPI.RECOLL" id= + "RCL.PROGRAM.PYTHONAPI.RECOLL">4.3.3.2. The recoll module

    @@ -6661,8 +6730,8 @@ or
    Functions
    + "RCL.PROGRAM.PYTHONAPI.RECOLL.FUNCTIONS" id= + "RCL.PROGRAM.PYTHONAPI.RECOLL.FUNCTIONS">Functions
  • @@ -6673,33 +6742,38 @@ or extra_dbs=None, writable = False)
    - The connect() +

    The connect() function connects to one or several Recoll index(es) and returns a Db object. + "literal">Db object.

      -
    • confdir may specify a - configuration directory. The usual defaults - apply.
    • +
    • +

      confdir + may specify a configuration directory. + The usual defaults apply.

      +
    • -
    • extra_dbs is a list of - additional indexes (Xapian - directories).
    • +
    • +

      extra_dbs + is a list of additional indexes (Xapian + directories).

      +
    • -
    • writable decides if we can - index new data through this - connection.
    • +
    • +

      writable + decides if we can index new data through + this connection.

      +
    -
    This call initializes the recoll module, - and it should always be performed before any - other call or object creation. +
    + +

    This call initializes the recoll module, and + it should always be performed before any other + call or object creation.

    @@ -6710,8 +6784,8 @@ or
    Classes
    + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES" id= + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES">Classes
    @@ -6721,8 +6795,8 @@ or
    The + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DB" id= + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DB">The Db class
    @@ -6736,42 +6810,50 @@ or
    Db.close()
    -
    Closes the connection. You can't do - anything with the Db object after this.
    +
    +

    Closes the connection. You can't do + anything with the Db object after this.

    +
    Db.query(), Db.cursor()
    -
    These aliases return a blank Query object for this - index.
    +
    +

    These aliases return a blank Query object for this + index.

    +
    Db.setAbstractParams(maxchars, contextwords)
    -
    Set the parameters used to build snippets - (sets of keywords in context text fragments). - maxchars defines - the maximum total size of the abstract. - contextwords - defines how many terms are shown around the - keyword.
    +
    +

    Set the parameters used to build snippets + (sets of keywords in context text fragments). + maxchars defines + the maximum total size of the abstract. + contextwords + defines how many terms are shown around the + keyword.

    +
    Db.termMatch(match_type, expr, field='', maxlen=-1, casesens=False, diacsens=False, lang='english')
    -
    Expand an expression against the index term - list. Performs the basic function from the GUI - term explorer tool. match_type can be either of - wildcard, - regexp or - stem. Returns a - list of terms expanded from the input - expression.
    +
    +

    Expand an expression against the index + term list. Performs the basic function from + the GUI term explorer tool. match_type can be either of + wildcard, + regexp or + stem. Returns a + list of terms expanded from the input + expression.

    +
    @@ -6781,8 +6863,9 @@ or
    The + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY" + id= + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY">The Query class
    @@ -6799,107 +6882,133 @@ or
    Query.sortby(fieldname, ascending=True)
    -
    Sort results by fieldname, in - ascending or descending order. Must be called - before executing the search.
    +
    +

    Sort results by fieldname, in + ascending or descending order. Must be called + before executing the search.

    +
    Query.execute(query_string, stemming=1, stemlang="english")
    -
    Starts a search for query_string, a - Recoll search - language string.
    +
    +

    Starts a search for query_string, + a Recoll + search language string.

    +
    Query.executesd(SearchData)
    -
    Starts a search for the query defined by - the SearchData object.
    +
    +

    Starts a search for the query defined by + the SearchData object.

    +
    Query.fetchmany(size=query.arraysize)
    -
    Fetches the next Doc objects in the current - search results, and returns them as an array of - the required size, which is by default the - value of the arraysize data member.
    +
    +

    Fetches the next Doc objects in the current + search results, and returns them as an array + of the required size, which is by default the + value of the arraysize data member.

    +
    Query.fetchone()
    -
    Fetches the next Doc object from the current - search results.
    +
    +

    Fetches the next Doc object from the current + search results.

    +
    Query.close()
    -
    Closes the query. The object is unusable - after the call.
    +
    +

    Closes the query. The object is unusable + after the call.

    +
    Query.scroll(value, mode='relative')
    -
    Adjusts the position in the current result - set. mode can be - relative or - absolute.
    +
    +

    Adjusts the position in the current result + set. mode can be + relative or + absolute.

    +
    Query.getgroups()
    -
    Retrieves the expanded query terms as a - list of pairs. Meaningful only after executexx - In each pair, the first entry is a list of user - terms (of size one for simple terms, or more - for group and phrase clauses), the second a - list of query terms as derived from the user - terms and used in the Xapian Query.
    +
    +

    Retrieves the expanded query terms as a + list of pairs. Meaningful only after + executexx In each pair, the first entry is a + list of user terms (of size one for simple + terms, or more for group and phrase clauses), + the second a list of query terms as derived + from the user terms and used in the Xapian + Query.

    +
    Query.getxquery()
    -
    Return the Xapian query description as a - Unicode string. Meaningful only after - executexx.
    +
    +

    Return the Xapian query description as a + Unicode string. Meaningful only after + executexx.

    +
    Query.highlight(text, ishtml = 0, methods = object)
    -
    Will insert <span "class=rclmatch">, - </span> tags around the match areas in - the input text and return the modified text. - ishtml can be set - to indicate that the input text is HTML and - that HTML special characters should not be - escaped. methods - if set should be an object with methods - startMatch(i) and endMatch() which will be - called for each match and should return a begin - and end tag
    +
    +

    Will insert <span "class=rclmatch">, + </span> tags around the match areas in + the input text and return the modified text. + ishtml can be + set to indicate that the input text is HTML + and that HTML special characters should not + be escaped. methods if set should be an + object with methods startMatch(i) and + endMatch() which will be called for each + match and should return a begin and end + tag

    +
    Query.makedocabstract(doc, methods = object))
    -
    Create a snippets abstract for doc (a Doc object) by selecting text - around the match terms. If methods is set, will - also perform highlighting. See the highlight - method.
    +
    +

    Create a snippets abstract for + doc (a + Doc object) by + selecting text around the match terms. If + methods is set, will also perform + highlighting. See the highlight method.

    +
    Query.__iter__() and Query.next()
    -
    So that things like for doc in query: will - work.
    +
    +

    So that things like for doc in query: will + work.

    +
    @@ -6908,23 +7017,30 @@ or
    Query.arraysize
    -
    Default number of records processed by - fetchmany (r/w).
    +
    +

    Default number of records processed by + fetchmany (r/w).

    +
    Query.rowcount
    -
    Number of records returned by the last - execute.
    +
    +

    Number of records returned by the last + execute.

    +
    Query.rownumber
    -
    Next index to be fetched from results. - Normally increments after each fetchone() call, - but can be set/reset before the call to effect - seeking (equivalent to using scroll()). Starts at 0.
    +
    +

    Next index to be fetched from results. + Normally increments after each fetchone() + call, but can be set/reset before the call to + effect seeking (equivalent to using + scroll()). + Starts at 0.

    +
    @@ -6934,9 +7050,9 @@ or
    The - Doc class
    + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC" + id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC"> + The Doc class
    @@ -6972,23 +7088,51 @@ or
    get(key), [] operator
    -
    Retrieve the named doc attribute
    +
    +

    Retrieve the named doc attribute. You can + also use getattr(doc, + key) or doc.key.

    +
    + +
    doc.key = + value
    + +
    +

    Set the the named doc attribute. You can + also use setattr(doc, + key, value).

    +
    getbinurl()
    -
    Retrieve the URL in byte array format (no - transcoding), for use as parameter to a system - call.
    +
    +

    Retrieve the URL in byte array format (no + transcoding), for use as parameter to a + system call.

    +
    + +
    setbinurl(url)
    + +
    +

    Set the URL in byte array format (no + transcoding).

    +
    items()
    -
    Return a dictionary of doc object - keys/values
    +
    +

    Return a dictionary of doc object + keys/values

    +
    keys()
    -
    list of doc object keys (attribute - names).
    +
    +

    list of doc object keys (attribute + names).

    +
    @@ -6998,9 +7142,9 @@ or
    + "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA"> The SearchData class
    @@ -7031,8 +7175,8 @@ or

    4.3.2.4. The + "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT" id= + "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">4.3.3.3. The rclextract module

    @@ -7053,8 +7197,8 @@ or
    Classes
    + "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES" id= + "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">Classes
    @@ -7064,9 +7208,9 @@ or
    + "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"> The Extractor class
    @@ -7077,22 +7221,24 @@ or
    Extractor(doc)
    -
    An Extractor - object is built from a Doc object, output from a - query.
    +
    +

    An Extractor + object is built from a Doc object, output from a + query.

    +
    Extractor.textextract(ipath)
    - Extract document defined by Extract document defined by ipath and return a Doc object. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The - typical use would be as follows: + typical use would be as follows:

     qdoc = query.fetchone()
     extractor = recoll.Extractor(qdoc)
    @@ -7106,10 +7252,10 @@ doc = extractor.textextract(qdoc.ipath)
                         outfile='')
     
                         
    - Extracts document into an output file, which - can be given explicitly or will be created as - a temporary file to be deleted by the caller. - Typical use: +

    Extracts document into an output file, + which can be given explicitly or will be + created as a temporary file to be deleted by + the caller. Typical use:

     qdoc = query.fetchone()
     extractor = recoll.Extractor(qdoc)
    @@ -7127,9 +7273,9 @@ filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
                   

    4.3.2.5. Example - code

    + "RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE" id= + "RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">4.3.3.4. Search + API usage example
    @@ -7167,26 +7313,281 @@ for i in range(nres):
    + + +
    +
    +
    +
    +

    4.3.4. Creating + Python external indexers

    +
    +
    +
    + +

    The update API can be used to create an index from + data which is not accessible to the regular Recoll indexer, or structured to + present difficulties to the Recoll input handlers.

    + +

    An indexer created using this API will be have + equivalent work to do as the the Recoll file system + indexer: look for modified documents, extract their text, + call the API for indexing it, take care of purging the + index out of data from documents which do not exist in + the document store any more.

    + +

    The data for such an external indexer should be stored + in an index separate from any used by the Recoll internal file system indexer. + The reason is that the main document indexer purge pass + (removal of deleted documents) would also remove all the + documents belonging to the external indexer, as they were + not seen during the filesystem walk. The main indexer + documents would also probably be a problem for the + external indexer own purge operation.

    + +

    While there would be ways to enable multiple foreign + indexers to cooperate on a single index, it is just + simpler to use separate ones, and use the multiple index + access capabilities of the query interface, if + needed.

    + +

    There are two parts in the update interface:

    + +
    +
      +
    • +

      Methods inside the recoll module allow inserting + data into the index, to make it accessible by the + normal query interface.

      +
    • + +
    • +

      An interface based on scripts execution is + defined to allow either the GUI or the rclextract module to access + original document data for previewing or + editing.

      +
    • +
    +

    4.3.2.6. Compatibility - with the previous version

    + "RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE" id= + "RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE">4.3.4.1. Python + update interface
    -

    The following code fragments can be used to ensure - that code can run with both the old and the new API (as - long as it does not use the new abilities of the new - API of course).

    +

    The update methods are part of the recoll module described above. The + connect() method is used with a writable=true parameter to obtain a + writable Db object. The + following Db object + methods are then available.

    -

    Adapting to the new package structure:

    +
    +
    +
    addOrUpdate(udi, doc, + parent_udi=None)
    + +
    +

    Add or update index data for a given document + The udi + string must define a unique id for the document. + It is an opaque interface element and not + interpreted inside Recoll. doc is a Doc object, + created from the data to be indexed (the main + text should be in doc.text). If parent_udi + is set, this is a unique identifier for the + top-level container (e.g. for the filesystem + indexer, this would be the one which is an actual + file).

    +
    + +
    delete(udi)
    + +
    +

    Purge index from all data for udi, and all documents (if any) + which have a matrching parent_udi.

    +
    + +
    needUpdate(udi, + sig)
    + +
    +

    Test if the index needs to be updated for the + document identified by udi. If this call is to be used, + the doc.sig field + should contain a signature value when calling + addOrUpdate(). The + needUpdate() call + then compares its parameter value with the stored + sig for udi. sig is an opaque value, compared + as a string.

    + +

    The filesystem indexer uses a concatenation of + the decimal string values for file size and + update time, but a hash of the contents could + also be used.

    + +

    As a side effect, if the return value is false + (the index is up to date), the call will set the + existence flag for the document (and any + subdocument defined by its parent_udi), so that a later + purge() call will + preserve them).

    + +

    The use of needUpdate() and purge() is optional, and the + indexer may use another method for checking the + need to reindex or to delete stale entries.

    +
    + +
    purge()
    + +
    +

    Delete all documents that were not touched + during the just finished indexing pass (since + open-for-write). These are the documents for the + needUpdate() call was not performed, indicating + that they no longer exist in the primary storage + system.

    +
    +
    +
    +
    + +
    +
    +
    +
    +

    4.3.4.2. Query + data access for external indexers

    +
    +
    +
    + +

    Recoll has internal + methods to access document data for its internal + (filesystem) indexer. An external indexer needs to + provide data access methods if it needs integration + with the GUI (e.g. preview function), or support for + the rclextract + module.

    + +

    The index data and the access method are linked by + the rclbes (recoll backend + storage) Doc field. You + should set this to a short string value identifying + your indexer (e.g. the filesystem indexer uses either + "FS" or an empty value, the Web history indexer uses + "BGL").

    + +

    The link is actually performed inside a backends configuration file (stored + in the configuration directory). This defines commands + to execute to access data from the specified indexer. + Example, for the mbox indexing sample found in the + Recoll source (which sets rclbes="MBOX"):

    +[MBOX]
    +fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch
    +makesig = path/to/recoll/src/python/samples/rclmbox.py makesig
    +        
    +
    + +

    fetch and makesig define two commands to execute + to respectively retrieve the document text and compute + the document signature (the example implementation uses + the same script with different first parameters to + perform both operations).

    + +

    The scripts are called with three additional + arguments: udi, + url, ipath, stored with the document when + it was indexed, and may use any or all to perform the + requested operation. The caller expects the result data + on stdout.

    +
    + +
    +
    +
    +
    +

    4.3.4.3. External + indexer samples

    +
    +
    +
    + +

    The Recoll source tree has two samples of external + indexers in the src/python/samples directory. The + more interesting one is rclmbox.py which indexes a directory + containing mbox folder + files. It exercises most features in the update + interface, and has a data access interface.

    + +

    See the comments inside the file for more + information.

    +
    +
    + +
    +
    +
    +
    +

    4.3.5. Package + compatibility with the previous version

    +
    +
    +
    + +

    The following code fragments can be used to ensure + that code can run with both the old and the new API (as + long as it does not use the new abilities of the new API + of course).

    + +

    Adapting to the new package structure:

    +
     
     try:
         from recoll import recoll
    @@ -7196,21 +7597,21 @@ except:
         import recoll
         hasextract = False
     
    +      
     
    -

    Adapting to the change of nature of the next Query member. The same test can be - used to choose to use the scroll() method (new) or set the - next value (old).

    -
    +          

    Adapting to the change of nature of the next Query + member. The same test can be used to choose to use the + scroll() method (new) or set + the next value (old).

    +
     
            rownum = query.next if type(query.next) == int else \
                      query.rownumber
     
    +      
     
    -
    diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 68eea15a..2aa981aa 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -262,7 +262,7 @@ are other ways to perform &RCL; searches: mostly a command line interface, a - + Python programming interface, a KDE KIO slave module, and @@ -3094,7 +3094,7 @@ MimeType=*/* By writing a custom Python program, using the - Recoll Python API. + Recoll Python API. @@ -3950,7 +3950,7 @@ dir:recoll dir:src -dir:utils -dir:common Writing a document input handler - TerminologyThe small programs or pieces + TerminologyThe small programs or pieces of code which handle the processing of the different document types for &RCL; used to be called filters, which is still reflected in the name of the directory which @@ -3960,7 +3960,7 @@ dir:recoll dir:src -dir:utils -dir:common content. However these modules may have other behaviours, and the term input handler is now progressively substituted in the documentation. filter is - still used in many places though. + still used in many places though. &RCL; input handlers cooperate to translate from the multitude of input document formats, simple ones @@ -4392,83 +4392,26 @@ or - - API + + Python API - - Interface elements - - A few elements in the interface are specific and and need - an explanation. - - - - - udi An udi (unique document - identifier) identifies a document. Because of limitations - inside the index engine, it is restricted in length (to - 200 bytes), which is why a regular URI cannot be used. The - structure and contents of the udi is defined by the - application and opaque to the index engine. For example, - the internal file system indexer uses the complete - document path (file path + internal path), truncated to - length, the suppressed part being replaced by a hash - value. - - - - ipath - - This data value (set as a field in the Doc - object) is stored, along with the URL, but not indexed by - &RCL;. Its contents are not interpreted, and its use is up - to the application. For example, the &RCL; internal file - system indexer stores the part of the document access path - internal to the container file (ipath in - this case is a list of subdocument sequential numbers). url - and ipath are returned in every search result and permit - access to the original document. - - - - - Stored and indexed fields - - The fields file inside - the &RCL; configuration defines which document fields are - either "indexed" (searchable), "stored" (retrievable with - search results), or both. - - - - - - Data for an external indexer, should be stored in a - separate index, not the one for the &RCL; internal file system - indexer, except if the latter is not used at all). The reason - is that the main document indexer purge pass would remove all - the other indexer's documents, as they were not seen during - indexing. The main indexer documents would also probably be a - problem for the external indexer purge operation. - - - - - Python interface - - + Introduction &RCL; versions after 1.11 define a Python programming - interface, both for searching and indexing. + interface, both for searching and creating/updating an + index. - The search interface is used in the Recoll Ubuntu Unity Lens - and Recoll WebUI. + The search interface is used in the &RCL; Ubuntu Unity Lens + and the &RCL; Web UI. It can run queries on any &RCL; + configuration. + + The index update section of the API may be used to create and + update &RCL; indexes on specific configurations (separate from the + ones created by recollindex). The resulting + databases can be queried alone, or in conjunction with regular + ones, through the GUI or any of the query interfaces. - The indexing section of the API has seen little use, and is - more a proof of concept. In truth it is waiting for its killer - app... - The search API is modeled along the Python database API specification. There were two major changes along &RCL; versions: @@ -4483,10 +4426,9 @@ or - We will mostly describe the new API and package - structure here. A paragraph at the end of this section will - explain a few differences and ways to write code - compatible with both versions. + We will describe the new API and package structure here. A + paragraph at the end of this section will explain a few differences + and ways to write code compatible with both versions. The Python interface can be found in the source package, under python/recoll. @@ -4513,44 +4455,140 @@ or distribution, the Python API can sometimes be found in a separate package. - The following small sample will run a query and list - the title and url for each of the results. It would work with &RCL; - 1.19 and later. The python/samples source directory - contains several examples of Python programming with &RCL;, - exercising the extension more completely, and especially its data - extraction features. - - from recoll import recoll + As an introduction, the following small sample will run a + query and list the title and url for each of the results. It would + work with &RCL; 1.19 and later. The + python/samples source directory contains + several examples of Python programming with &RCL;, exercising the + extension more completely, and especially its data extraction + features. - db = recoll.connect() - query = db.query() - nres = query.execute("some query") - results = query.fetchmany(20) - for doc in results: - print(doc.url, doc.title) - - + +from recoll import recoll + +db = recoll.connect() +query = db.query() +nres = query.execute("some query") +results = query.fetchmany(20) +for doc in results: + print(doc.url, doc.title) +]]> + + + + + Interface elements + + A few elements in the interface are specific and and need + an explanation. + + + + > + ipath + + This data value (set as a field in the Doc + object) is stored, along with the URL, but not indexed by + &RCL;. Its contents are not interpreted by the index layer, and + its use is up to the application. For example, the &RCL; file + system indexer uses the ipath to store the + part of the document access path internal to (possibly + imbricated) container documents. ipath in + this case is a vector of access elements (e.g, the first part + could be a path inside a zip file to an archive member which + happens to be an mbox file, the second element would be the + message sequential number inside the mbox + etc.). url and ipath are + returned in every search result and define the access to the + original document. ipath is empty for + top-level document/files (e.g. a PDF document which is a + filesystem file). The &RCL; GUI knows about the structure of the + ipath values used by the filesystem indexer, + and uses it for such functions as opening the parent of a given + document. + + + + + udi + + An udi (unique document + identifier) identifies a document. Because of limitations inside + the index engine, it is restricted in length (to 200 bytes), + which is why a regular URI cannot be used. The structure and + contents of the udi is defined by the + application and opaque to the index engine. For example, the + internal file system indexer uses the complete document path + (file path + internal path), truncated to length, the suppressed + part being replaced by a hash value. The udi + is not explicit in the query interface (it is used "under the + hood" by the rclextract module), but it is + an explicit element of the update interface. + + + + parent_udi + + If this attribute is set on a document when + entering it in the index, it designates its physical container + document. In a multilevel hierarchy, this may not be the + immediate parent. parent_udi is optional, but + its use by an indexer may simplify index maintenance, as &RCL; + will automatically delete all children defined by + parent_udi == udi when the document designated + by udi is destroyed. e.g. if a + Zip archive contains entries which are + themselves containers, like mbox files, all + the subdocuments inside the Zip file (mbox, + messages, message attachments, etc.) would have the same + parent_udi, matching the + udi for the Zip file, and + all would be destroyed when the Zip file + (identified by its udi) is removed from the + index. The standard filesystem indexer uses + parent_udi. + + + + Stored and indexed fields + + The fields file inside + the &RCL; configuration defines which document fields are + either "indexed" (searchable), "stored" (retrievable with + search results), or both. + + + + + + + + + Python search interface + + Recoll package The recoll package contains two modules: The recoll module contains - functions and classes used to query (or update) the - index. + functions and classes used to query (or update) the + index. This section will only describe the query part, see + further for the update part. The rclextract module contains - functions and classes used to access document - data. + functions and classes used to access document + data. - + The recoll module - + Functions @@ -4558,32 +4596,32 @@ or connect(confdir=None, extra_dbs=None, writable = False) - The connect() function connects to + The connect() function connects to one or several &RCL; index(es) and returns - a Db object. + a Db object. - confdir may specify + confdir may specify a configuration directory. The usual defaults - apply. - extra_dbs is a list of - additional indexes (Xapian directories). - writable decides if + apply. + extra_dbs is a list of + additional indexes (Xapian directories). + writable decides if we can index new data through this - connection. + connection. - This call initializes the recoll module, and it should - always be performed before any other call or object creation. + This call initializes the recoll module, and it should + always be performed before any other call or object + creation. - - + Classes - + The Db class A Db object is created by @@ -4592,38 +4630,38 @@ or Db.close() - Closes the connection. You can't do anything + Closes the connection. You can't do anything with the Db object after - this. + this. - Db.query(), Db.cursor() These + Db.query(), Db.cursor() These aliases return a blank Query object - for this index. + for this index. Db.setAbstractParams(maxchars, - contextwords) Set the parameters used + contextwords) Set the parameters used to build snippets (sets of keywords in context text fragments). maxchars defines the maximum total size of the abstract. contextwords defines how many - terms are shown around the keyword. + terms are shown around the keyword. Db.termMatch(match_type, expr, field='', maxlen=-1, casesens=False, diacsens=False, lang='english') - Expand an expression against the + Expand an expression against the index term list. Performs the basic function from the GUI term explorer tool. match_type can be either of wildcard, regexp or stem. Returns a list of terms expanded from the input expression. - + @@ -4631,7 +4669,7 @@ or - + The Query class A Query object (equivalent to a @@ -4643,76 +4681,77 @@ or Query.sortby(fieldname, ascending=True) - Sort results + Sort results by fieldname, in ascending or descending order. Must be called before executing - the search. + the search. Query.execute(query_string, stemming=1, stemlang="english") - Starts a search + Starts a search for query_string, a &RCL; - search language string. + search language string. Query.executesd(SearchData) - Starts a search for the query defined by the - SearchData object. + Starts a search for the query defined by the + SearchData object. Query.fetchmany(size=query.arraysize) - Fetches + Fetches the next Doc objects in the current search results, and returns them as an array of the required size, which is by default the value of - the arraysize data member. + the arraysize data member. Query.fetchone() - Fetches the next Doc object - from the current search results. + Fetches the next Doc object + from the current search results. Query.close() - Closes the query. The object is unusable - after the call. + Closes the query. The object is unusable + after the call. Query.scroll(value, mode='relative') - Adjusts the position in the current result + Adjusts the position in the current result set. mode can be relative - or absolute. + or absolute. Query.getgroups() - Retrieves the expanded query terms as a list + Retrieves the expanded query terms as a list of pairs. Meaningful only after executexx In each pair, the first entry is a list of user terms (of size one for simple terms, or more for group and phrase clauses), the second a list of query terms as derived from the user terms and used in the Xapian - Query. + Query. Query.getxquery() - Return the Xapian query description as a Unicode string. - Meaningful only after executexx. + Return the Xapian query description as a + Unicode string. + Meaningful only after executexx. Query.highlight(text, ishtml = 0, methods = object) - Will insert <span "class=rclmatch">, + Will insert <span "class=rclmatch">, </span> tags around the match areas in the input text and return the modified text. ishtml can be set to indicate that the input text is HTML and @@ -4720,39 +4759,41 @@ or methods if set should be an object with methods startMatch(i) and endMatch() which will be called for each match and should return a begin and end - tag + tag Query.makedocabstract(doc, methods = object)) - Create a snippets abstract + Create a snippets abstract for doc (a Doc object) by selecting text around the match terms. If methods is set, will also perform highlighting. See the highlight method. - + Query.__iter__() and Query.next() - So that things like for doc in - query: will work. + So that things like for doc in + query: will work. - Query.arraysize Default - number of records processed by fetchmany (r/w). + Query.arraysize + Default number of records processed by fetchmany + (r/w). - Query.rowcountNumber of - records returned by the last execute. - Query.rownumberNext index - to be fetched from results. Normally increments after - each fetchone() call, but can be set/reset before the - call to effect seeking (equivalent to - using scroll()). Starts at - 0. + Query.rowcountNumber + of records returned by the last + execute. + Query.rownumberNext index + to be fetched from results. Normally increments after + each fetchone() call, but can be set/reset before the + call to effect seeking (equivalent to + using scroll()). Starts at + 0. @@ -4760,7 +4801,7 @@ or - + The Doc class A Doc object contains index data @@ -4789,27 +4830,52 @@ or get(key), [] operator - Retrieve the named doc attribute + + Retrieve the named doc + attribute. You can also use + getattr(doc, key) or + doc.key. - getbinurl()Retrieve - the URL in byte array format (no transcoding), for use as - parameter to a system call. + + + doc.key = value + + Set the the named doc + attribute. You can also use + setattr(doc, key, value). + + + getbinurl() + + Retrieve the URL in byte array format (no + transcoding), for use as parameter to a system + call. + + + + setbinurl(url) + + Set the URL in byte array format (no + transcoding). + + items() - Return a dictionary of doc object - keys/values + Return a dictionary of doc object + keys/values + keys() - list of doc object keys (attribute - names). + list of doc object keys (attribute + names). - + The SearchData class A SearchData object allows building @@ -4825,7 +4891,7 @@ or addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', qstring=string, slack=0, field='', stemming=1, subSearch=SearchData) - + @@ -4834,7 +4900,7 @@ or - + The rclextract module Index queries do not provide document content (only a @@ -4847,23 +4913,23 @@ or provides a single class which can be used to access the data content for result documents. - + Classes - + The Extractor class Extractor(doc) - An Extractor object is + An Extractor object is built from a Doc object, output - from a query. + from a query. Extractor.textextract(ipath) - Extract document defined + Extract document defined by ipath and return a Doc object. The doc.text field has the document text converted to either text/plain or @@ -4875,11 +4941,11 @@ extractor = recoll.Extractor(qdoc) doc = extractor.textextract(qdoc.ipath) # use doc.text, e.g. for previewing - + Extractor.idoctofile(ipath, targetmtype, outfile='') - Extracts document into an output file, + Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use: @@ -4887,7 +4953,7 @@ qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype) - + @@ -4896,10 +4962,8 @@ filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype) - - - - Example code + + Search API usage example The following sample would query the index with a user language string. See the python/samples @@ -4934,17 +4998,189 @@ for i in range(nres): + - - Compatibility with the previous version - The following code fragments can be used to ensure that - code can run with both the old and the new API (as long as it - does not use the new abilities of the new API of - course). + + Creating Python external indexers - Adapting to the new package structure: - + The update API can be used to create an index from data which + is not accessible to the regular &RCL; indexer, or structured to + present difficulties to the &RCL; input handlers. + + An indexer created using this API will be have equivalent work + to do as the the Recoll file system indexer: look for modified + documents, extract their text, call the API for indexing it, take + care of purging the index out of data from documents which do not + exist in the document store any more. + + The data for such an external indexer should be stored in an + index separate from any used by the &RCL; internal file system + indexer. The reason is that the main document indexer purge pass + (removal of deleted documents) would also remove all the documents + belonging to the external indexer, as they were not seen during the + filesystem walk. The main indexer documents would also probably be a + problem for the external indexer own purge operation. + + While there would be ways to enable multiple foreign indexers + to cooperate on a single index, it is just simpler to use separate + ones, and use the multiple index access capabilities of the query + interface, if needed. + + There are two parts in the update interface: + + + Methods inside the recoll + module allow inserting data into the index, to make it accessible by + the normal query interface. + An interface based on scripts execution is defined + to allow either the GUI or the rclextract + module to access original document data for previewing or + editing. + + + + Python update interface + + The update methods are part of the + recoll module described above. The connect() + method is used with a writable=true parameter to + obtain a writable Db object. The following + Db object methods are then available. + + + + + addOrUpdate(udi, doc, parent_udi=None) + Add or update index data for a given document + The + + udi string must define a unique id for + the document. It is an opaque interface element and not + interpreted inside Recoll. doc is a + + + Doc object, created from the data to be + indexed (the main text should be in + doc.text). If + + parent_udi is set, this is a unique + identifier for the top-level container (e.g. for the + filesystem indexer, this would be the one which is an actual + file). + + + + + delete(udi) + Purge index from all data for + udi, and all documents (if any) which have a + matrching parent_udi. + + + + needUpdate(udi, sig) + Test if the index needs to be updated for the + document identified by udi. If this call is + to be used, the doc.sig field should contain + a signature value when calling + addOrUpdate(). The + needUpdate() call then compares its + parameter value with the stored sig for + udi. sig is an opaque + value, compared as a string. + The filesystem indexer uses a + concatenation of the decimal string values for file size and + update time, but a hash of the contents could also be + used. + As a side effect, if the return value is false (the index + is up to date), the call will set the existence flag for the + document (and any subdocument defined by its + parent_udi), so that a later + purge() call will preserve them). + The use of needUpdate() and + purge() is optional, and the indexer may use + another method for checking the need to reindex or to delete + stale entries. + + + + purge() + Delete all documents that were not touched + during the just finished indexing pass (since + open-for-write). These are the documents for the needUpdate() + call was not performed, indicating that they no longer exist in + the primary storage system. + + + + + + + + Query data access for external indexers + + &RCL; has internal methods to access document data for its + internal (filesystem) indexer. An external indexer needs to provide + data access methods if it needs integration with the GUI + (e.g. preview function), or support for the + rclextract module. + + The index data and the access method are linked by the + rclbes (recoll backend storage) + Doc field. You should set this to a short string + value identifying your indexer (e.g. the filesystem indexer uses either + "FS" or an empty value, the Web history indexer uses "BGL"). + + The link is actually performed inside a + backends configuration file (stored in the + configuration directory). This defines commands to execute to + access data from the specified indexer. Example, for the mbox + indexing sample found in the Recoll source (which sets + rclbes="MBOX"): + [MBOX] +fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch +makesig = path/to/recoll/src/python/samples/rclmbox.py makesig + + fetch and makesig + define two commands to execute to respectively retrieve the + document text and compute the document signature (the example + implementation uses the same script with different first parameters + to perform both operations). + + The scripts are called with three additional arguments: + udi, url, + ipath, stored with the document when it was + indexed, and may use any or all to perform the requested + operation. The caller expects the result data on + stdout. + + + + + External indexer samples + + The Recoll source tree has two samples of external indexers + in the src/python/samples directory. The more + interesting one is rclmbox.py which indexes a + directory containing mbox folder files. It + exercises most features in the update interface, and has a data + access interface. + + See the comments inside the file for more information. + + + + + Package compatibility with the previous version + + The following code fragments can be used to ensure that + code can run with both the old and the new API (as long as it + does not use the new abilities of the new API of + course). + + Adapting to the new package structure: + - + - Adapting to the change of nature of - the next Query - member. The same test can be used to choose to use - the scroll() method (new) or set - the next value (old). + Adapting to the change of nature of + the next Query + member. The same test can be used to choose to use + the scroll() method (new) or set + the next value (old). - + - + - - + + +