document the python index update interface
This commit is contained in:
parent
5eba4eebcf
commit
cd11886f6c
3 changed files with 1180 additions and 542 deletions
|
@ -19,7 +19,7 @@ commonoptions=--stringparam section.autolabel 1 \
|
||||||
|
|
||||||
# index.html chunk format target replaced by nicer webhelp (needs separate
|
# index.html chunk format target replaced by nicer webhelp (needs separate
|
||||||
# make) in webhelp/ subdir
|
# make) in webhelp/ subdir
|
||||||
all: usermanual.html usermanual.pdf webh
|
all: usermanual.html webh usermanual.pdf
|
||||||
|
|
||||||
webh:
|
webh:
|
||||||
make -C webhelp
|
make -C webhelp
|
||||||
|
|
File diff suppressed because it is too large
Load diff
|
@ -262,7 +262,7 @@
|
||||||
are other ways to perform &RCL; searches: mostly a <link
|
are other ways to perform &RCL; searches: mostly a <link
|
||||||
linkend="RCL.SEARCH.COMMANDLINE">
|
linkend="RCL.SEARCH.COMMANDLINE">
|
||||||
command line interface</link>, a
|
command line interface</link>, a
|
||||||
<link linkend="RCL.PROGRAM.API.PYTHON">
|
<link linkend="RCL.PROGRAM.PYTHONAPI">
|
||||||
<application>Python</application>
|
<application>Python</application>
|
||||||
programming interface</link>, a <link linkend="RCL.SEARCH.KIO">
|
programming interface</link>, a <link linkend="RCL.SEARCH.KIO">
|
||||||
<application>KDE</application> KIO slave module</link>, and
|
<application>KDE</application> KIO slave module</link>, and
|
||||||
|
@ -3094,7 +3094,7 @@ MimeType=*/*
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem><para>By writing a custom
|
<listitem><para>By writing a custom
|
||||||
<application>Python</application> program, using the
|
<application>Python</application> program, using the
|
||||||
<link linkend="RCL.PROGRAM.API.PYTHON">Recoll Python API</link>.</para>
|
<link linkend="RCL.PROGRAM.PYTHONAPI">Recoll Python API</link>.</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
|
|
||||||
|
@ -3950,7 +3950,7 @@ dir:recoll dir:src -dir:utils -dir:common
|
||||||
<sect1 id="RCL.PROGRAM.FILTERS">
|
<sect1 id="RCL.PROGRAM.FILTERS">
|
||||||
<title>Writing a document input handler</title>
|
<title>Writing a document input handler</title>
|
||||||
|
|
||||||
<note><title>Terminology</title>The small programs or pieces
|
<note><title>Terminology</title><para>The small programs or pieces
|
||||||
of code which handle the processing of the different document
|
of code which handle the processing of the different document
|
||||||
types for &RCL; used to be called <literal>filters</literal>,
|
types for &RCL; used to be called <literal>filters</literal>,
|
||||||
which is still reflected in the name of the directory which
|
which is still reflected in the name of the directory which
|
||||||
|
@ -3960,7 +3960,7 @@ dir:recoll dir:src -dir:utils -dir:common
|
||||||
content. However these modules may have other behaviours, and
|
content. However these modules may have other behaviours, and
|
||||||
the term <literal>input handler</literal> is now progressively
|
the term <literal>input handler</literal> is now progressively
|
||||||
substituted in the documentation. <literal>filter</literal> is
|
substituted in the documentation. <literal>filter</literal> is
|
||||||
still used in many places though.</note>
|
still used in many places though.</para></note>
|
||||||
|
|
||||||
<para>&RCL; input handlers cooperate to translate from the multitude
|
<para>&RCL; input handlers cooperate to translate from the multitude
|
||||||
of input document formats, simple ones
|
of input document formats, simple ones
|
||||||
|
@ -4392,82 +4392,25 @@ or
|
||||||
</sect1>
|
</sect1>
|
||||||
|
|
||||||
|
|
||||||
<sect1 id="RCL.PROGRAM.API">
|
<sect1 id="RCL.PROGRAM.PYTHONAPI">
|
||||||
<title>API</title>
|
<title>Python API</title>
|
||||||
|
|
||||||
<sect2 id="RCL.PROGRAM.API.ELEMENTS">
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.INTRO">
|
||||||
<title>Interface elements</title>
|
|
||||||
|
|
||||||
<para>A few elements in the interface are specific and and need
|
|
||||||
an explanation.</para>
|
|
||||||
|
|
||||||
<variablelist>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>udi</term> <listitem><para>An udi (unique document
|
|
||||||
identifier) identifies a document. Because of limitations
|
|
||||||
inside the index engine, it is restricted in length (to
|
|
||||||
200 bytes), which is why a regular URI cannot be used. The
|
|
||||||
structure and contents of the udi is defined by the
|
|
||||||
application and opaque to the index engine. For example,
|
|
||||||
the internal file system indexer uses the complete
|
|
||||||
document path (file path + internal path), truncated to
|
|
||||||
length, the suppressed part being replaced by a hash
|
|
||||||
value.</para> </listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>ipath</term>
|
|
||||||
|
|
||||||
<listitem><para>This data value (set as a field in the Doc
|
|
||||||
object) is stored, along with the URL, but not indexed by
|
|
||||||
&RCL;. Its contents are not interpreted, and its use is up
|
|
||||||
to the application. For example, the &RCL; internal file
|
|
||||||
system indexer stores the part of the document access path
|
|
||||||
internal to the container file (<literal>ipath</literal> in
|
|
||||||
this case is a list of subdocument sequential numbers). url
|
|
||||||
and ipath are returned in every search result and permit
|
|
||||||
access to the original document.</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>Stored and indexed fields</term>
|
|
||||||
|
|
||||||
<listitem><para>The <filename>fields</filename> file inside
|
|
||||||
the &RCL; configuration defines which document fields are
|
|
||||||
either "indexed" (searchable), "stored" (retrievable with
|
|
||||||
search results), or both.</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
</variablelist>
|
|
||||||
|
|
||||||
<para>Data for an external indexer, should be stored in a
|
|
||||||
separate index, not the one for the &RCL; internal file system
|
|
||||||
indexer, except if the latter is not used at all). The reason
|
|
||||||
is that the main document indexer purge pass would remove all
|
|
||||||
the other indexer's documents, as they were not seen during
|
|
||||||
indexing. The main indexer documents would also probably be a
|
|
||||||
problem for the external indexer purge operation.</para>
|
|
||||||
|
|
||||||
</sect2>
|
|
||||||
|
|
||||||
<sect2 id="RCL.PROGRAM.API.PYTHON">
|
|
||||||
<title>Python interface</title>
|
|
||||||
|
|
||||||
<sect3 id="RCL.PROGRAM.PYTHON.INTRO">
|
|
||||||
<title>Introduction</title>
|
<title>Introduction</title>
|
||||||
|
|
||||||
<para>&RCL; versions after 1.11 define a Python programming
|
<para>&RCL; versions after 1.11 define a Python programming
|
||||||
interface, both for searching and indexing.</para>
|
interface, both for searching and creating/updating an
|
||||||
|
index.</para>
|
||||||
|
|
||||||
<para>The search interface is used in the Recoll Ubuntu Unity Lens
|
<para>The search interface is used in the &RCL; Ubuntu Unity Lens
|
||||||
and Recoll WebUI.</para>
|
and the &RCL; Web UI. It can run queries on any &RCL;
|
||||||
|
configuration.</para>
|
||||||
|
|
||||||
<para>The indexing section of the API has seen little use, and is
|
<para>The index update section of the API may be used to create and
|
||||||
more a proof of concept. In truth it is waiting for its killer
|
update &RCL; indexes on specific configurations (separate from the
|
||||||
app...</para>
|
ones created by <command>recollindex</command>). The resulting
|
||||||
|
databases can be queried alone, or in conjunction with regular
|
||||||
|
ones, through the GUI or any of the query interfaces.</para>
|
||||||
|
|
||||||
<para>The search API is modeled along the Python database API
|
<para>The search API is modeled along the Python database API
|
||||||
specification. There were two major changes along &RCL; versions:
|
specification. There were two major changes along &RCL; versions:
|
||||||
|
@ -4483,10 +4426,9 @@ or
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>We will mostly describe the new API and package
|
<para>We will describe the new API and package structure here. A
|
||||||
structure here. A paragraph at the end of this section will
|
paragraph at the end of this section will explain a few differences
|
||||||
explain a few differences and ways to write code
|
and ways to write code compatible with both versions.</para>
|
||||||
compatible with both versions.</para>
|
|
||||||
|
|
||||||
<para>The Python interface can be found in the source package,
|
<para>The Python interface can be found in the source package,
|
||||||
under <filename>python/recoll</filename>.</para>
|
under <filename>python/recoll</filename>.</para>
|
||||||
|
@ -4513,44 +4455,140 @@ or
|
||||||
distribution, the Python API can sometimes be found in a
|
distribution, the Python API can sometimes be found in a
|
||||||
separate package.</para>
|
separate package.</para>
|
||||||
|
|
||||||
<para>The following small sample will run a query and list
|
<para>As an introduction, the following small sample will run a
|
||||||
the title and url for each of the results. It would work with &RCL;
|
query and list the title and url for each of the results. It would
|
||||||
1.19 and later. The <filename>python/samples</filename> source directory
|
work with &RCL; 1.19 and later. The
|
||||||
contains several examples of Python programming with &RCL;,
|
<filename>python/samples</filename> source directory contains
|
||||||
exercising the extension more completely, and especially its data
|
several examples of Python programming with &RCL;, exercising the
|
||||||
extraction features.</para>
|
extension more completely, and especially its data extraction
|
||||||
<programlisting>
|
features.</para>
|
||||||
from recoll import recoll
|
|
||||||
|
|
||||||
db = recoll.connect()
|
<programlisting><![CDATA[
|
||||||
query = db.query()
|
#!/usr/bin/env python
|
||||||
nres = query.execute("some query")
|
|
||||||
results = query.fetchmany(20)
|
|
||||||
for doc in results:
|
|
||||||
print(doc.url, doc.title)
|
|
||||||
</programlisting>
|
|
||||||
</sect3>
|
|
||||||
|
|
||||||
<sect3 id="RCL.PROGRAM.PYTHON.PACKAGE">
|
from recoll import recoll
|
||||||
|
|
||||||
|
db = recoll.connect()
|
||||||
|
query = db.query()
|
||||||
|
nres = query.execute("some query")
|
||||||
|
results = query.fetchmany(20)
|
||||||
|
for doc in results:
|
||||||
|
print(doc.url, doc.title)
|
||||||
|
]]></programlisting>
|
||||||
|
|
||||||
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.ELEMENTS">
|
||||||
|
<title>Interface elements</title>
|
||||||
|
|
||||||
|
<para>A few elements in the interface are specific and and need
|
||||||
|
an explanation.</para>
|
||||||
|
|
||||||
|
<variablelist>
|
||||||
|
|
||||||
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">>
|
||||||
|
<term>ipath</term>
|
||||||
|
|
||||||
|
<listitem><para>This data value (set as a field in the Doc
|
||||||
|
object) is stored, along with the URL, but not indexed by
|
||||||
|
&RCL;. Its contents are not interpreted by the index layer, and
|
||||||
|
its use is up to the application. For example, the &RCL; file
|
||||||
|
system indexer uses the <literal>ipath</literal> to store the
|
||||||
|
part of the document access path internal to (possibly
|
||||||
|
imbricated) container documents. <literal>ipath</literal> in
|
||||||
|
this case is a vector of access elements (e.g, the first part
|
||||||
|
could be a path inside a zip file to an archive member which
|
||||||
|
happens to be an mbox file, the second element would be the
|
||||||
|
message sequential number inside the mbox
|
||||||
|
etc.). <literal>url</literal> and <literal>ipath</literal> are
|
||||||
|
returned in every search result and define the access to the
|
||||||
|
original document. <literal>ipath</literal> is empty for
|
||||||
|
top-level document/files (e.g. a PDF document which is a
|
||||||
|
filesystem file). The &RCL; GUI knows about the structure of the
|
||||||
|
<literal>ipath</literal> values used by the filesystem indexer,
|
||||||
|
and uses it for such functions as opening the parent of a given
|
||||||
|
document.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">
|
||||||
|
<term>udi</term>
|
||||||
|
|
||||||
|
<listitem><para>An <literal>udi</literal> (unique document
|
||||||
|
identifier) identifies a document. Because of limitations inside
|
||||||
|
the index engine, it is restricted in length (to 200 bytes),
|
||||||
|
which is why a regular URI cannot be used. The structure and
|
||||||
|
contents of the <literal>udi</literal> is defined by the
|
||||||
|
application and opaque to the index engine. For example, the
|
||||||
|
internal file system indexer uses the complete document path
|
||||||
|
(file path + internal path), truncated to length, the suppressed
|
||||||
|
part being replaced by a hash value. The <literal>udi</literal>
|
||||||
|
is not explicit in the query interface (it is used "under the
|
||||||
|
hood" by the <filename>rclextract</filename> module), but it is
|
||||||
|
an explicit element of the update interface.</para> </listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">
|
||||||
|
<term>parent_udi</term>
|
||||||
|
|
||||||
|
<listitem><para>If this attribute is set on a document when
|
||||||
|
entering it in the index, it designates its physical container
|
||||||
|
document. In a multilevel hierarchy, this may not be the
|
||||||
|
immediate parent. <literal>parent_udi</literal> is optional, but
|
||||||
|
its use by an indexer may simplify index maintenance, as &RCL;
|
||||||
|
will automatically delete all children defined by
|
||||||
|
<literal>parent_udi == udi</literal> when the document designated
|
||||||
|
by <literal>udi</literal> is destroyed. e.g. if a
|
||||||
|
<literal>Zip</literal> archive contains entries which are
|
||||||
|
themselves containers, like <literal>mbox</literal> files, all
|
||||||
|
the subdocuments inside the <literal>Zip</literal> file (mbox,
|
||||||
|
messages, message attachments, etc.) would have the same
|
||||||
|
<literal>parent_udi</literal>, matching the
|
||||||
|
<literal>udi</literal> for the <literal>Zip</literal> file, and
|
||||||
|
all would be destroyed when the <literal>Zip</literal> file
|
||||||
|
(identified by its <literal>udi</literal>) is removed from the
|
||||||
|
index. The standard filesystem indexer uses
|
||||||
|
<literal>parent_udi</literal>.</para></listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>Stored and indexed fields</term>
|
||||||
|
|
||||||
|
<listitem><para>The <filename>fields</filename> file inside
|
||||||
|
the &RCL; configuration defines which document fields are
|
||||||
|
either "indexed" (searchable), "stored" (retrievable with
|
||||||
|
search results), or both.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
</variablelist>
|
||||||
|
|
||||||
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.SEARCH">
|
||||||
|
<title>Python search interface</title>
|
||||||
|
|
||||||
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.PACKAGE">
|
||||||
<title>Recoll package</title>
|
<title>Recoll package</title>
|
||||||
|
|
||||||
<para>The <literal>recoll</literal> package contains two
|
<para>The <literal>recoll</literal> package contains two
|
||||||
modules:
|
modules:
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem><para>The <literal>recoll</literal> module contains
|
<listitem><para>The <literal>recoll</literal> module contains
|
||||||
functions and classes used to query (or update) the
|
functions and classes used to query (or update) the
|
||||||
index.</para></listitem>
|
index. This section will only describe the query part, see
|
||||||
|
further for the update part.</para></listitem>
|
||||||
<listitem><para>The <literal>rclextract</literal> module contains
|
<listitem><para>The <literal>rclextract</literal> module contains
|
||||||
functions and classes used to access document
|
functions and classes used to access document
|
||||||
data.</para></listitem>
|
data.</para></listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
</para>
|
</para>
|
||||||
</sect3>
|
</sect3>
|
||||||
|
|
||||||
<sect3 id="RCL.PROGRAM.PYTHON.RECOLL">
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.RECOLL">
|
||||||
<title>The recoll module</title>
|
<title>The recoll module</title>
|
||||||
|
|
||||||
<sect4 id="RCL.PROGRAM.PYTHON.RECOLL.FUNCTIONS">
|
<sect4 id="RCL.PROGRAM.PYTHONAPI.RECOLL.FUNCTIONS">
|
||||||
<title>Functions</title>
|
<title>Functions</title>
|
||||||
|
|
||||||
<variablelist>
|
<variablelist>
|
||||||
|
@ -4558,32 +4596,32 @@ or
|
||||||
<term>connect(confdir=None, extra_dbs=None,
|
<term>connect(confdir=None, extra_dbs=None,
|
||||||
writable = False)</term>
|
writable = False)</term>
|
||||||
<listitem>
|
<listitem>
|
||||||
The <literal>connect()</literal> function connects to
|
<para>The <literal>connect()</literal> function connects to
|
||||||
one or several &RCL; index(es) and returns
|
one or several &RCL; index(es) and returns
|
||||||
a <literal>Db</literal> object.
|
a <literal>Db</literal> object.</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem><literal>confdir</literal> may specify
|
<listitem><para><literal>confdir</literal> may specify
|
||||||
a configuration directory. The usual defaults
|
a configuration directory. The usual defaults
|
||||||
apply.</listitem>
|
apply.</para></listitem>
|
||||||
<listitem><literal>extra_dbs</literal> is a list of
|
<listitem><para><literal>extra_dbs</literal> is a list of
|
||||||
additional indexes (Xapian directories). </listitem>
|
additional indexes (Xapian directories).</para></listitem>
|
||||||
<listitem><literal>writable</literal> decides if
|
<listitem><para><literal>writable</literal> decides if
|
||||||
we can index new data through this
|
we can index new data through this
|
||||||
connection.</listitem>
|
connection.</para></listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
This call initializes the recoll module, and it should
|
<para>This call initializes the recoll module, and it should
|
||||||
always be performed before any other call or object creation.
|
always be performed before any other call or object
|
||||||
|
creation.</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
</variablelist>
|
</variablelist>
|
||||||
</sect4>
|
</sect4>
|
||||||
|
|
||||||
|
|
||||||
<sect4 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES">
|
<sect4 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES">
|
||||||
<title>Classes</title>
|
<title>Classes</title>
|
||||||
|
|
||||||
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DB">
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DB">
|
||||||
<title>The Db class</title>
|
<title>The Db class</title>
|
||||||
|
|
||||||
<para>A Db object is created by
|
<para>A Db object is created by
|
||||||
|
@ -4592,38 +4630,38 @@ or
|
||||||
<variablelist>
|
<variablelist>
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Db.close()</term>
|
<term>Db.close()</term>
|
||||||
<listitem>Closes the connection. You can't do anything
|
<listitem><para>Closes the connection. You can't do anything
|
||||||
with the <literal>Db</literal> object after
|
with the <literal>Db</literal> object after
|
||||||
this.</listitem>
|
this.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Db.query(), Db.cursor()</term> <listitem>These
|
<term>Db.query(), Db.cursor()</term> <listitem><para>These
|
||||||
aliases return a blank <literal>Query</literal> object
|
aliases return a blank <literal>Query</literal> object
|
||||||
for this index.</listitem>
|
for this index.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Db.setAbstractParams(maxchars,
|
<term>Db.setAbstractParams(maxchars,
|
||||||
contextwords)</term> <listitem>Set the parameters used
|
contextwords)</term> <listitem><para>Set the parameters used
|
||||||
to build snippets (sets of keywords in context text
|
to build snippets (sets of keywords in context text
|
||||||
fragments). <literal>maxchars</literal> defines the
|
fragments). <literal>maxchars</literal> defines the
|
||||||
maximum total size of the abstract.
|
maximum total size of the abstract.
|
||||||
<literal>contextwords</literal> defines how many
|
<literal>contextwords</literal> defines how many
|
||||||
terms are shown around the keyword.</listitem>
|
terms are shown around the keyword.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Db.termMatch(match_type, expr, field='',
|
<term>Db.termMatch(match_type, expr, field='',
|
||||||
maxlen=-1, casesens=False, diacsens=False, lang='english')
|
maxlen=-1, casesens=False, diacsens=False, lang='english')
|
||||||
</term>
|
</term>
|
||||||
<listitem>Expand an expression against the
|
<listitem><para>Expand an expression against the
|
||||||
index term list. Performs the basic function from the
|
index term list. Performs the basic function from the
|
||||||
GUI term explorer tool. <literal>match_type</literal>
|
GUI term explorer tool. <literal>match_type</literal>
|
||||||
can be either
|
can be either
|
||||||
of <literal>wildcard</literal>, <literal>regexp</literal>
|
of <literal>wildcard</literal>, <literal>regexp</literal>
|
||||||
or <literal>stem</literal>. Returns a list of terms
|
or <literal>stem</literal>. Returns a list of terms
|
||||||
expanded from the input expression.
|
expanded from the input expression.
|
||||||
</listitem>
|
</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
@ -4631,7 +4669,7 @@ or
|
||||||
</sect5>
|
</sect5>
|
||||||
|
|
||||||
|
|
||||||
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.QUERY">
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY">
|
||||||
<title>The Query class</title>
|
<title>The Query class</title>
|
||||||
|
|
||||||
<para>A <literal>Query</literal> object (equivalent to a
|
<para>A <literal>Query</literal> object (equivalent to a
|
||||||
|
@ -4643,76 +4681,77 @@ or
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.sortby(fieldname, ascending=True)</term>
|
<term>Query.sortby(fieldname, ascending=True)</term>
|
||||||
<listitem>Sort results
|
<listitem><para>Sort results
|
||||||
by <replaceable>fieldname</replaceable>, in ascending
|
by <replaceable>fieldname</replaceable>, in ascending
|
||||||
or descending order. Must be called before executing
|
or descending order. Must be called before executing
|
||||||
the search.</listitem>
|
the search.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.execute(query_string, stemming=1,
|
<term>Query.execute(query_string, stemming=1,
|
||||||
stemlang="english")</term>
|
stemlang="english")</term>
|
||||||
<listitem>Starts a search
|
<listitem><para>Starts a search
|
||||||
for <replaceable>query_string</replaceable>, a &RCL;
|
for <replaceable>query_string</replaceable>, a &RCL;
|
||||||
search language string.</listitem>
|
search language string.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.executesd(SearchData)</term>
|
<term>Query.executesd(SearchData)</term>
|
||||||
<listitem>Starts a search for the query defined by the
|
<listitem><para>Starts a search for the query defined by the
|
||||||
SearchData object.</listitem>
|
SearchData object.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.fetchmany(size=query.arraysize)</term>
|
<term>Query.fetchmany(size=query.arraysize)</term>
|
||||||
|
|
||||||
<listitem>Fetches
|
<listitem><para>Fetches
|
||||||
the next <literal>Doc</literal> objects in the current
|
the next <literal>Doc</literal> objects in the current
|
||||||
search results, and returns them as an array of the
|
search results, and returns them as an array of the
|
||||||
required size, which is by default the value of
|
required size, which is by default the value of
|
||||||
the <literal>arraysize</literal> data member.</listitem>
|
the <literal>arraysize</literal> data member.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.fetchone()</term>
|
<term>Query.fetchone()</term>
|
||||||
<listitem>Fetches the next <literal>Doc</literal> object
|
<listitem><para>Fetches the next <literal>Doc</literal> object
|
||||||
from the current search results.</listitem>
|
from the current search results.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.close()</term>
|
<term>Query.close()</term>
|
||||||
<listitem>Closes the query. The object is unusable
|
<listitem><para>Closes the query. The object is unusable
|
||||||
after the call.</listitem>
|
after the call.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.scroll(value, mode='relative')</term>
|
<term>Query.scroll(value, mode='relative')</term>
|
||||||
<listitem>Adjusts the position in the current result
|
<listitem><para>Adjusts the position in the current result
|
||||||
set. <literal>mode</literal> can
|
set. <literal>mode</literal> can
|
||||||
be <literal>relative</literal>
|
be <literal>relative</literal>
|
||||||
or <literal>absolute</literal>. </listitem>
|
or <literal>absolute</literal>. </para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.getgroups()</term>
|
<term>Query.getgroups()</term>
|
||||||
<listitem>Retrieves the expanded query terms as a list
|
<listitem><para>Retrieves the expanded query terms as a list
|
||||||
of pairs. Meaningful only after executexx In each
|
of pairs. Meaningful only after executexx In each
|
||||||
pair, the first entry is a list of user terms (of size
|
pair, the first entry is a list of user terms (of size
|
||||||
one for simple terms, or more for group and phrase
|
one for simple terms, or more for group and phrase
|
||||||
clauses), the second a list of query terms as derived
|
clauses), the second a list of query terms as derived
|
||||||
from the user terms and used in the Xapian
|
from the user terms and used in the Xapian
|
||||||
Query.</listitem>
|
Query.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.getxquery()</term>
|
<term>Query.getxquery()</term>
|
||||||
<listitem>Return the Xapian query description as a Unicode string.
|
<listitem><para>Return the Xapian query description as a
|
||||||
Meaningful only after executexx.</listitem>
|
Unicode string.
|
||||||
|
Meaningful only after executexx.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.highlight(text, ishtml = 0, methods = object)</term>
|
<term>Query.highlight(text, ishtml = 0, methods = object)</term>
|
||||||
<listitem>Will insert <span "class=rclmatch">,
|
<listitem><para>Will insert <span "class=rclmatch">,
|
||||||
</span> tags around the match areas in the input text
|
</span> tags around the match areas in the input text
|
||||||
and return the modified text. <literal>ishtml</literal>
|
and return the modified text. <literal>ishtml</literal>
|
||||||
can be set to indicate that the input text is HTML and
|
can be set to indicate that the input text is HTML and
|
||||||
|
@ -4720,39 +4759,41 @@ or
|
||||||
<literal>methods</literal> if set should be an object
|
<literal>methods</literal> if set should be an object
|
||||||
with methods startMatch(i) and endMatch() which will be
|
with methods startMatch(i) and endMatch() which will be
|
||||||
called for each match and should return a begin and end
|
called for each match and should return a begin and end
|
||||||
tag</listitem>
|
tag</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.makedocabstract(doc, methods = object))</term>
|
<term>Query.makedocabstract(doc, methods = object))</term>
|
||||||
<listitem>Create a snippets abstract
|
<listitem><para>Create a snippets abstract
|
||||||
for <literal>doc</literal> (a <literal>Doc</literal>
|
for <literal>doc</literal> (a <literal>Doc</literal>
|
||||||
object) by selecting text around the match terms.
|
object) by selecting text around the match terms.
|
||||||
If methods is set, will also perform highlighting. See
|
If methods is set, will also perform highlighting. See
|
||||||
the highlight method.
|
the highlight method.
|
||||||
</listitem>
|
</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Query.__iter__() and Query.next()</term>
|
<term>Query.__iter__() and Query.next()</term>
|
||||||
<listitem>So that things like <literal>for doc in
|
<listitem><para>So that things like <literal>for doc in
|
||||||
query:</literal> will work.</listitem>
|
query:</literal> will work.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
|
||||||
<variablelist>
|
<variablelist>
|
||||||
|
|
||||||
<varlistentry><term>Query.arraysize</term> <listitem>Default
|
<varlistentry><term>Query.arraysize</term>
|
||||||
number of records processed by fetchmany (r/w).</listitem>
|
<listitem><para>Default number of records processed by fetchmany
|
||||||
|
(r/w).</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
<varlistentry><term>Query.rowcount</term><listitem>Number of
|
<varlistentry><term>Query.rowcount</term><listitem><para>Number
|
||||||
records returned by the last execute.</listitem></varlistentry>
|
of records returned by the last
|
||||||
<varlistentry><term>Query.rownumber</term><listitem>Next index
|
execute.</para></listitem></varlistentry>
|
||||||
to be fetched from results. Normally increments after
|
<varlistentry><term>Query.rownumber</term><listitem><para>Next index
|
||||||
each fetchone() call, but can be set/reset before the
|
to be fetched from results. Normally increments after
|
||||||
call to effect seeking (equivalent to
|
each fetchone() call, but can be set/reset before the
|
||||||
using <literal>scroll()</literal>). Starts at
|
call to effect seeking (equivalent to
|
||||||
0.</listitem>
|
using <literal>scroll()</literal>). Starts at
|
||||||
|
0.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
@ -4760,7 +4801,7 @@ or
|
||||||
</sect5>
|
</sect5>
|
||||||
|
|
||||||
|
|
||||||
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DOC">
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC">
|
||||||
<title>The Doc class</title>
|
<title>The Doc class</title>
|
||||||
|
|
||||||
<para>A <literal>Doc</literal> object contains index data
|
<para>A <literal>Doc</literal> object contains index data
|
||||||
|
@ -4789,27 +4830,52 @@ or
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>get(key), [] operator</term>
|
<term>get(key), [] operator</term>
|
||||||
<listitem>Retrieve the named doc attribute</listitem>
|
|
||||||
|
<listitem><para>Retrieve the named doc
|
||||||
|
attribute. You can also use
|
||||||
|
<literal>getattr(doc, key)</literal> or
|
||||||
|
<literal>doc.key</literal>.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
<varlistentry><term>getbinurl()</term><listitem>Retrieve
|
|
||||||
the URL in byte array format (no transcoding), for use as
|
<varlistentry>
|
||||||
parameter to a system call.</listitem>
|
<term>doc.key = value</term>
|
||||||
|
|
||||||
|
<listitem><para>Set the the named doc
|
||||||
|
attribute. You can also use
|
||||||
|
<literal>setattr(doc, key, value)</literal>.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>getbinurl()</term>
|
||||||
|
|
||||||
|
<listitem><para>Retrieve the URL in byte array format (no
|
||||||
|
transcoding), for use as parameter to a system
|
||||||
|
call.</para></listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>setbinurl(url)</term>
|
||||||
|
|
||||||
|
<listitem><para>Set the URL in byte array format (no
|
||||||
|
transcoding).</para></listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>items()</term>
|
<term>items()</term>
|
||||||
<listitem>Return a dictionary of doc object
|
<listitem><para>Return a dictionary of doc object
|
||||||
keys/values</listitem>
|
keys/values</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>keys()</term>
|
<term>keys()</term>
|
||||||
<listitem>list of doc object keys (attribute
|
<listitem><para>list of doc object keys (attribute
|
||||||
names).</listitem>
|
names).</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
|
||||||
</sect5> <!-- Doc -->
|
</sect5> <!-- Doc -->
|
||||||
|
|
||||||
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.SEARCHDATA">
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA">
|
||||||
<title>The SearchData class</title>
|
<title>The SearchData class</title>
|
||||||
|
|
||||||
<para>A <literal>SearchData</literal> object allows building
|
<para>A <literal>SearchData</literal> object allows building
|
||||||
|
@ -4825,7 +4891,7 @@ or
|
||||||
<term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
<term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
||||||
qstring=string, slack=0, field='', stemming=1,
|
qstring=string, slack=0, field='', stemming=1,
|
||||||
subSearch=SearchData)</term>
|
subSearch=SearchData)</term>
|
||||||
<listitem></listitem>
|
<listitem><para></para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
|
||||||
|
@ -4834,7 +4900,7 @@ or
|
||||||
</sect4> <!-- recoll.classes -->
|
</sect4> <!-- recoll.classes -->
|
||||||
</sect3> <!-- Recoll module -->
|
</sect3> <!-- Recoll module -->
|
||||||
|
|
||||||
<sect3 id="RCL.PROGRAM.PYTHON.RCLEXTRACT">
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
|
||||||
<title>The rclextract module</title>
|
<title>The rclextract module</title>
|
||||||
|
|
||||||
<para>Index queries do not provide document content (only a
|
<para>Index queries do not provide document content (only a
|
||||||
|
@ -4847,23 +4913,23 @@ or
|
||||||
provides a single class which can be used to access the data
|
provides a single class which can be used to access the data
|
||||||
content for result documents.</para>
|
content for result documents.</para>
|
||||||
|
|
||||||
<sect4 id="RCL.PROGRAM.PYTHON.RCLEXTRACT.CLASSES">
|
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
|
||||||
<title>Classes</title>
|
<title>Classes</title>
|
||||||
|
|
||||||
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.EXTRACTOR">
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
|
||||||
<title>The Extractor class</title>
|
<title>The Extractor class</title>
|
||||||
|
|
||||||
<variablelist>
|
<variablelist>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Extractor(doc)</term>
|
<term>Extractor(doc)</term>
|
||||||
<listitem>An <literal>Extractor</literal> object is
|
<listitem><para>An <literal>Extractor</literal> object is
|
||||||
built from a <literal>Doc</literal> object, output
|
built from a <literal>Doc</literal> object, output
|
||||||
from a query.</listitem>
|
from a query.</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Extractor.textextract(ipath)</term>
|
<term>Extractor.textextract(ipath)</term>
|
||||||
<listitem>Extract document defined
|
<listitem><para>Extract document defined
|
||||||
by <replaceable>ipath</replaceable> and return
|
by <replaceable>ipath</replaceable> and return
|
||||||
a <literal>Doc</literal> object. The doc.text field
|
a <literal>Doc</literal> object. The doc.text field
|
||||||
has the document text converted to either text/plain or
|
has the document text converted to either text/plain or
|
||||||
|
@ -4875,11 +4941,11 @@ extractor = recoll.Extractor(qdoc)
|
||||||
doc = extractor.textextract(qdoc.ipath)
|
doc = extractor.textextract(qdoc.ipath)
|
||||||
# use doc.text, e.g. for previewing
|
# use doc.text, e.g. for previewing
|
||||||
</programlisting>
|
</programlisting>
|
||||||
</listitem>
|
</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
|
<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
|
||||||
<listitem>Extracts document into an output file,
|
<listitem><para>Extracts document into an output file,
|
||||||
which can be given explicitly or will be created as a
|
which can be given explicitly or will be created as a
|
||||||
temporary file to be deleted by the caller. Typical use:
|
temporary file to be deleted by the caller. Typical use:
|
||||||
<programlisting>
|
<programlisting>
|
||||||
|
@ -4887,7 +4953,7 @@ qdoc = query.fetchone()
|
||||||
extractor = recoll.Extractor(qdoc)
|
extractor = recoll.Extractor(qdoc)
|
||||||
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
||||||
|
|
||||||
</listitem>
|
</para></listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
@ -4896,10 +4962,8 @@ filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
||||||
</sect4> <!-- rclextract classes -->
|
</sect4> <!-- rclextract classes -->
|
||||||
</sect3> <!-- rclextract module -->
|
</sect3> <!-- rclextract module -->
|
||||||
|
|
||||||
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
|
||||||
|
<title>Search API usage example</title>
|
||||||
<sect3 id="RCL.PROGRAM.PYTHON.EXAMPLES">
|
|
||||||
<title>Example code</title>
|
|
||||||
|
|
||||||
<para>The following sample would query the index with a user
|
<para>The following sample would query the index with a user
|
||||||
language string. See the <filename>python/samples</filename>
|
language string. See the <filename>python/samples</filename>
|
||||||
|
@ -4934,17 +4998,189 @@ for i in range(nres):
|
||||||
</programlisting>
|
</programlisting>
|
||||||
|
|
||||||
</sect3>
|
</sect3>
|
||||||
|
</sect2>
|
||||||
|
|
||||||
<sect3 id="RCL.PROGRAM.PYTHON.COMPAT">
|
|
||||||
<title>Compatibility with the previous version</title>
|
|
||||||
|
|
||||||
<para>The following code fragments can be used to ensure that
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.UPDATE">
|
||||||
code can run with both the old and the new API (as long as it
|
<title>Creating Python external indexers</title>
|
||||||
does not use the new abilities of the new API of
|
|
||||||
course).</para>
|
|
||||||
|
|
||||||
<para>Adapting to the new package structure:</para>
|
<para>The update API can be used to create an index from data which
|
||||||
<programlisting>
|
is not accessible to the regular &RCL; indexer, or structured to
|
||||||
|
present difficulties to the &RCL; input handlers.</para>
|
||||||
|
|
||||||
|
<para>An indexer created using this API will be have equivalent work
|
||||||
|
to do as the the Recoll file system indexer: look for modified
|
||||||
|
documents, extract their text, call the API for indexing it, take
|
||||||
|
care of purging the index out of data from documents which do not
|
||||||
|
exist in the document store any more.</para>
|
||||||
|
|
||||||
|
<para>The data for such an external indexer should be stored in an
|
||||||
|
index separate from any used by the &RCL; internal file system
|
||||||
|
indexer. The reason is that the main document indexer purge pass
|
||||||
|
(removal of deleted documents) would also remove all the documents
|
||||||
|
belonging to the external indexer, as they were not seen during the
|
||||||
|
filesystem walk. The main indexer documents would also probably be a
|
||||||
|
problem for the external indexer own purge operation.</para>
|
||||||
|
|
||||||
|
<para>While there would be ways to enable multiple foreign indexers
|
||||||
|
to cooperate on a single index, it is just simpler to use separate
|
||||||
|
ones, and use the multiple index access capabilities of the query
|
||||||
|
interface, if needed.</para>
|
||||||
|
|
||||||
|
<para>There are two parts in the update interface:</para>
|
||||||
|
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para>Methods inside the <filename>recoll</filename>
|
||||||
|
module allow inserting data into the index, to make it accessible by
|
||||||
|
the normal query interface.</para></listitem>
|
||||||
|
<listitem><para>An interface based on scripts execution is defined
|
||||||
|
to allow either the GUI or the <filename>rclextract</filename>
|
||||||
|
module to access original document data for previewing or
|
||||||
|
editing.</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE">
|
||||||
|
<title>Python update interface</title>
|
||||||
|
|
||||||
|
<para>The update methods are part of the
|
||||||
|
<filename>recoll</filename> module described above. The connect()
|
||||||
|
method is used with a <literal>writable=true</literal> parameter to
|
||||||
|
obtain a writable <literal>Db</literal> object. The following
|
||||||
|
<literal>Db</literal> object methods are then available.</para>
|
||||||
|
|
||||||
|
<variablelist>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>addOrUpdate(udi, doc, parent_udi=None)</term>
|
||||||
|
<listitem><para>Add or update index data for a given document
|
||||||
|
The <literal>
|
||||||
|
<link linkend="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">
|
||||||
|
udi</link></literal> string must define a unique id for
|
||||||
|
the document. It is an opaque interface element and not
|
||||||
|
interpreted inside Recoll. <literal>doc</literal> is a
|
||||||
|
<literal>
|
||||||
|
<link linkend="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC">
|
||||||
|
Doc</link></literal> object, created from the data to be
|
||||||
|
indexed (the main text should be in
|
||||||
|
<literal>doc.text</literal>). If <literal>
|
||||||
|
<link linkend="RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">
|
||||||
|
parent_udi</link></literal> is set, this is a unique
|
||||||
|
identifier for the top-level container (e.g. for the
|
||||||
|
filesystem indexer, this would be the one which is an actual
|
||||||
|
file).</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>delete(udi)</term>
|
||||||
|
<listitem><para>Purge index from all data for
|
||||||
|
<literal>udi</literal>, and all documents (if any) which have a
|
||||||
|
matrching <literal>parent_udi</literal>. </para> </listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>needUpdate(udi, sig)</term>
|
||||||
|
<listitem><para>Test if the index needs to be updated for the
|
||||||
|
document identified by <literal>udi</literal>. If this call is
|
||||||
|
to be used, the <literal>doc.sig</literal> field should contain
|
||||||
|
a signature value when calling
|
||||||
|
<literal>addOrUpdate()</literal>. The
|
||||||
|
<literal>needUpdate()</literal> call then compares its
|
||||||
|
parameter value with the stored <literal>sig</literal> for
|
||||||
|
<literal>udi</literal>. <literal>sig</literal> is an opaque
|
||||||
|
value, compared as a string.</para>
|
||||||
|
<para>The filesystem indexer uses a
|
||||||
|
concatenation of the decimal string values for file size and
|
||||||
|
update time, but a hash of the contents could also be
|
||||||
|
used.</para>
|
||||||
|
<para>As a side effect, if the return value is false (the index
|
||||||
|
is up to date), the call will set the existence flag for the
|
||||||
|
document (and any subdocument defined by its
|
||||||
|
<literal>parent_udi</literal>), so that a later
|
||||||
|
<literal>purge()</literal> call will preserve them).</para>
|
||||||
|
<para>The use of <literal>needUpdate()</literal> and
|
||||||
|
<literal>purge()</literal> is optional, and the indexer may use
|
||||||
|
another method for checking the need to reindex or to delete
|
||||||
|
stale entries.</para></listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>purge()</term>
|
||||||
|
<listitem><para>Delete all documents that were not touched
|
||||||
|
during the just finished indexing pass (since
|
||||||
|
open-for-write). These are the documents for the needUpdate()
|
||||||
|
call was not performed, indicating that they no longer exist in
|
||||||
|
the primary storage system.</para></listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
</variablelist>
|
||||||
|
|
||||||
|
</sect3>
|
||||||
|
|
||||||
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.ACCESS">
|
||||||
|
<title>Query data access for external indexers</title>
|
||||||
|
|
||||||
|
<para>&RCL; has internal methods to access document data for its
|
||||||
|
internal (filesystem) indexer. An external indexer needs to provide
|
||||||
|
data access methods if it needs integration with the GUI
|
||||||
|
(e.g. preview function), or support for the
|
||||||
|
<filename>rclextract</filename> module.</para>
|
||||||
|
|
||||||
|
<para>The index data and the access method are linked by the
|
||||||
|
<literal>rclbes</literal> (recoll backend storage)
|
||||||
|
<literal>Doc</literal> field. You should set this to a short string
|
||||||
|
value identifying your indexer (e.g. the filesystem indexer uses either
|
||||||
|
"FS" or an empty value, the Web history indexer uses "BGL").</para>
|
||||||
|
|
||||||
|
<para>The link is actually performed inside a
|
||||||
|
<filename>backends</filename> configuration file (stored in the
|
||||||
|
configuration directory). This defines commands to execute to
|
||||||
|
access data from the specified indexer. Example, for the mbox
|
||||||
|
indexing sample found in the Recoll source (which sets
|
||||||
|
<literal>rclbes="MBOX"</literal>):</para>
|
||||||
|
<programlisting>[MBOX]
|
||||||
|
fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch
|
||||||
|
makesig = path/to/recoll/src/python/samples/rclmbox.py makesig
|
||||||
|
</programlisting>
|
||||||
|
<para><literal>fetch</literal> and <literal>makesig</literal>
|
||||||
|
define two commands to execute to respectively retrieve the
|
||||||
|
document text and compute the document signature (the example
|
||||||
|
implementation uses the same script with different first parameters
|
||||||
|
to perform both operations).</para>
|
||||||
|
|
||||||
|
<para>The scripts are called with three additional arguments:
|
||||||
|
<literal>udi</literal>, <literal>url</literal>,
|
||||||
|
<literal>ipath</literal>, stored with the document when it was
|
||||||
|
indexed, and may use any or all to perform the requested
|
||||||
|
operation. The caller expects the result data on
|
||||||
|
<literal>stdout</literal>.</para>
|
||||||
|
|
||||||
|
</sect3>
|
||||||
|
|
||||||
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.SAMPLES">
|
||||||
|
<title>External indexer samples</title>
|
||||||
|
|
||||||
|
<para>The Recoll source tree has two samples of external indexers
|
||||||
|
in the <filename>src/python/samples</filename> directory. The more
|
||||||
|
interesting one is <filename>rclmbox.py</filename> which indexes a
|
||||||
|
directory containing <literal>mbox</literal> folder files. It
|
||||||
|
exercises most features in the update interface, and has a data
|
||||||
|
access interface.</para>
|
||||||
|
|
||||||
|
<para>See the comments inside the file for more information.</para>
|
||||||
|
</sect3>
|
||||||
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.COMPAT">
|
||||||
|
<title>Package compatibility with the previous version</title>
|
||||||
|
|
||||||
|
<para>The following code fragments can be used to ensure that
|
||||||
|
code can run with both the old and the new API (as long as it
|
||||||
|
does not use the new abilities of the new API of
|
||||||
|
course).</para>
|
||||||
|
|
||||||
|
<para>Adapting to the new package structure:</para>
|
||||||
|
<programlisting>
|
||||||
<![CDATA[
|
<![CDATA[
|
||||||
try:
|
try:
|
||||||
from recoll import recoll
|
from recoll import recoll
|
||||||
|
@ -4954,23 +5190,24 @@ except:
|
||||||
import recoll
|
import recoll
|
||||||
hasextract = False
|
hasextract = False
|
||||||
]]>
|
]]>
|
||||||
</programlisting>
|
</programlisting>
|
||||||
|
|
||||||
<para>Adapting to the change of nature of
|
<para>Adapting to the change of nature of
|
||||||
the <literal>next</literal> <literal>Query</literal>
|
the <literal>next</literal> <literal>Query</literal>
|
||||||
member. The same test can be used to choose to use
|
member. The same test can be used to choose to use
|
||||||
the <literal>scroll()</literal> method (new) or set
|
the <literal>scroll()</literal> method (new) or set
|
||||||
the <literal>next</literal> value (old).</para>
|
the <literal>next</literal> value (old).</para>
|
||||||
|
|
||||||
<programlisting>
|
<programlisting>
|
||||||
<![CDATA[
|
<![CDATA[
|
||||||
rownum = query.next if type(query.next) == int else \
|
rownum = query.next if type(query.next) == int else \
|
||||||
query.rownumber
|
query.rownumber
|
||||||
]]>
|
]]>
|
||||||
</programlisting>
|
</programlisting>
|
||||||
|
|
||||||
|
</sect2> <!-- compat with previous version -->
|
||||||
|
|
||||||
|
|
||||||
</sect3> <!-- compat with previous version -->
|
|
||||||
</sect2>
|
|
||||||
</sect1>
|
</sect1>
|
||||||
</chapter>
|
</chapter>
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue