doc touchups

This commit is contained in:
Jean-Francois Dockes 2016-05-03 11:30:13 +02:00
parent 72ab23080d
commit ed0fca3dbf
7 changed files with 330 additions and 78 deletions

View file

@ -92,5 +92,6 @@ website/usermanual/*
website/idxthreads/forkingRecoll.html website/idxthreads/forkingRecoll.html
website/idxthreads/xapDocCopyCrash.html website/idxthreads/xapDocCopyCrash.html
website/pages/recoll-mingw.html website/pages/recoll-mingw.html
website/pages/recoll-webui-install-wsgi.html
website/pages/recoll-windows.html website/pages/recoll-windows.html
src/doc/user/RCL.SEARCH.SYNONYMS.html src/doc/user/RCL.SEARCH.SYNONYMS.html

View file

@ -20,8 +20,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h1 class="title"><a name="idp59627200" id= <h1 class="title"><a name="idp57237872" id=
"idp59627200"></a>Recoll user manual</h1> "idp57237872"></a>Recoll user manual</h1>
</div> </div>
<div> <div>
@ -109,13 +109,13 @@ alink="#0000FF">
multiple indexes</a></span></dt> multiple indexes</a></span></dt>
<dt><span class="sect2">2.1.3. <a href= <dt><span class="sect2">2.1.3. <a href=
"#idp65068656">Document types</a></span></dt> "#idp63233312">Document types</a></span></dt>
<dt><span class="sect2">2.1.4. <a href= <dt><span class="sect2">2.1.4. <a href=
"#idp65088336">Indexing failures</a></span></dt> "#idp63252992">Indexing failures</a></span></dt>
<dt><span class="sect2">2.1.5. <a href= <dt><span class="sect2">2.1.5. <a href=
"#idp65095792">Recovery</a></span></dt> "#idp63260448">Recovery</a></span></dt>
</dl> </dl>
</dd> </dd>
@ -981,8 +981,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idp65068656" id= <h3 class="title"><a name="idp63233312" id=
"idp65068656"></a>2.1.3.&nbsp;Document types</h3> "idp63233312"></a>2.1.3.&nbsp;Document types</h3>
</div> </div>
</div> </div>
</div> </div>
@ -1075,8 +1075,8 @@ indexedmimetypes = application/pdf
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idp65088336" id= <h3 class="title"><a name="idp63252992" id=
"idp65088336"></a>2.1.4.&nbsp;Indexing "idp63252992"></a>2.1.4.&nbsp;Indexing
failures</h3> failures</h3>
</div> </div>
</div> </div>
@ -1116,8 +1116,8 @@ indexedmimetypes = application/pdf
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idp65095792" id= <h3 class="title"><a name="idp63260448" id=
"idp65095792"></a>2.1.5.&nbsp;Recovery</h3> "idp63260448"></a>2.1.5.&nbsp;Recovery</h3>
</div> </div>
</div> </div>
</div> </div>
@ -6379,32 +6379,41 @@ or
<p><span class="application">Recoll</span> versions <p><span class="application">Recoll</span> versions
after 1.11 define a Python programming interface, both after 1.11 define a Python programming interface, both
for searching and indexing. The indexing portion has for searching and indexing.</p>
seen little use, but the searching one is used in the
Recoll Ubuntu Unity Lens and Recoll Web UI.</p>
<p>The API is inspired by the Python database API <p>The search interface is used in the Recoll Ubuntu
specification. There were two major changes in recent Unity Lens and Recoll WebUI.</p>
<p>The indexing section of the API has seen little use,
and is more a proof of concept. In truth it is waiting
for its killer app...</p>
<p>The search API is modeled along the Python database
API specification. There were two major changes along
<span class="application">Recoll</span> versions:</p> <span class="application">Recoll</span> versions:</p>
<div class="itemizedlist"> <div class="itemizedlist">
<ul class="itemizedlist" style= <ul class="itemizedlist" style=
"list-style-type: disc;"> "list-style-type: disc;">
<li class="listitem">The basis for the <span class= <li class="listitem">
"application">Recoll</span> API changed from Python <p>The basis for the <span class=
database API version 1.0 (<span class= "application">Recoll</span> API changed from
"application">Recoll</span> versions up to 1.18.1), Python database API version 1.0 (<span class=
to version 2.0 (<span class= "application">Recoll</span> versions up to
"application">Recoll</span> 1.18.2 and later).</li> 1.18.1), to version 2.0 (<span class=
"application">Recoll</span> 1.18.2 and
later).</p>
</li>
<li class="listitem">The <code class= <li class="listitem">
"literal">recoll</code> module became a package <p>The <code class="literal">recoll</code> module
(with an internal <code class= became a package (with an internal <code class=
"literal">recoll</code> module) as of <span class= "literal">recoll</code> module) as of
"application">Recoll</span> version 1.19, in order <span class="application">Recoll</span> version
to add more functions. For existing code, this only 1.19, in order to add more functions. For
changes the way the interface must be existing code, this only changes the way the
imported.</li> interface must be imported.</p>
</li>
</ul> </ul>
</div> </div>
@ -6433,13 +6442,38 @@ or
</pre> </pre>
<p>As of <span class="application">Recoll</span> 1.19,
the module can be compiled for Python3.</p>
<p>The normal <span class="application">Recoll</span> <p>The normal <span class="application">Recoll</span>
installer installs the Python API along with the main installer installs the Python2 API along with the main
code.</p> code. The Python3 version must be explicitely built and
installed.</p>
<p>When installing from a repository, and depending on <p>When installing from a repository, and depending on
the distribution, the Python API can sometimes be found the distribution, the Python API can sometimes be found
in a separate package.</p> in a separate package.</p>
<p>The following small sample will run a query and list
the title and url for each of the results. It would
work with <span class="application">Recoll</span> 1.19
and later. The <code class=
"filename">python/samples</code> source directory
contains several examples of Python programming with
<span class="application">Recoll</span>, exercising the
extension more completely, and especially its data
extraction features.</p>
<pre class="programlisting">
from recoll import recoll
db = recoll.connect()
query = db.query()
nres = query.execute("some query")
results = query.fetchmany(20)
for doc in results:
print(doc.url, doc.title)
</pre>
</div> </div>
<div class="sect3"> <div class="sect3">
@ -6564,8 +6598,6 @@ or
connection to a Recoll index.</p> connection to a Recoll index.</p>
<div class="variablelist"> <div class="variablelist">
<p class="title"><b>Methods</b></p>
<dl class="variablelist"> <dl class="variablelist">
<dt><span class="term">Db.close()</span></dt> <dt><span class="term">Db.close()</span></dt>
@ -6628,8 +6660,6 @@ or
execute index searches.</p> execute index searches.</p>
<div class="variablelist"> <div class="variablelist">
<p class="title"><b>Methods</b></p>
<dl class="variablelist"> <dl class="variablelist">
<dt><span class="term">Query.sortby(fieldname, <dt><span class="term">Query.sortby(fieldname,
ascending=True)</span></dt> ascending=True)</span></dt>
@ -6805,8 +6835,6 @@ or
document contents.</p> document contents.</p>
<div class="variablelist"> <div class="variablelist">
<p class="title"><b>Methods</b></p>
<dl class="variablelist"> <dl class="variablelist">
<dt><span class="term">get(key), [] <dt><span class="term">get(key), []
operator</span></dt> operator</span></dt>
@ -6854,8 +6882,6 @@ or
detailed doc for now...</p> detailed doc for now...</p>
<div class="variablelist"> <div class="variablelist">
<p class="title"><b>Methods</b></p>
<dl class="variablelist"> <dl class="variablelist">
<dt><span class= <dt><span class=
"term">addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', "term">addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
@ -6914,8 +6940,6 @@ or
</div> </div>
<div class="variablelist"> <div class="variablelist">
<p class="title"><b>Methods</b></p>
<dl class="variablelist"> <dl class="variablelist">
<dt><span class= <dt><span class=
"term">Extractor(doc)</span></dt> "term">Extractor(doc)</span></dt>

View file

@ -4354,22 +4354,26 @@ or
<title>Introduction</title> <title>Introduction</title>
<para>&RCL; versions after 1.11 define a Python programming <para>&RCL; versions after 1.11 define a Python programming
interface, both for searching and indexing. The indexing interface, both for searching and indexing.</para>
portion has seen little use, but the searching one is used
in the Recoll Ubuntu Unity Lens and Recoll Web UI.</para>
<para>The API is inspired by the Python database API <para>The search interface is used in the Recoll Ubuntu Unity Lens
specification. There were two major changes in recent &RCL; and Recoll WebUI.</para>
versions:
<para>The indexing section of the API has seen little use, and is
more a proof of concept. In truth it is waiting for its killer
app...</para>
<para>The search API is modeled along the Python database API
specification. There were two major changes along &RCL; versions:
<itemizedlist> <itemizedlist>
<listitem>The basis for the &RCL; API changed from Python <listitem><para>The basis for the &RCL; API changed from Python
database API version 1.0 (&RCL; versions up to 1.18.1), database API version 1.0 (&RCL; versions up to 1.18.1),
to version 2.0 (&RCL; 1.18.2 and later).</listitem> to version 2.0 (&RCL; 1.18.2 and later).</para></listitem>
<listitem>The <literal>recoll</literal> module became a <listitem><para>The <literal>recoll</literal> module became a
package (with an internal <literal>recoll</literal> package (with an internal <literal>recoll</literal>
module) as of &RCL; version 1.19, in order to add more module) as of &RCL; version 1.19, in order to add more
functions. For existing code, this only changes the way functions. For existing code, this only changes the way
the interface must be imported.</listitem> the interface must be imported.</para></listitem>
</itemizedlist> </itemizedlist>
</para> </para>
@ -4392,13 +4396,33 @@ or
</screen> </screen>
</para> </para>
<para>The normal &RCL; installer installs the Python <para>As of &RCL; 1.19, the module can be compiled for
API along with the main code.</para> Python3.</para>
<para>The normal &RCL; installer installs the Python2
API along with the main code. The Python3 version must be
explicitely built and installed.</para>
<para>When installing from a repository, and depending on the <para>When installing from a repository, and depending on the
distribution, the Python API can sometimes be found in a distribution, the Python API can sometimes be found in a
separate package.</para> separate package.</para>
<para>The following small sample will run a query and list
the title and url for each of the results. It would work with &RCL;
1.19 and later. The <filename>python/samples</filename> source directory
contains several examples of Python programming with &RCL;,
exercising the extension more completely, and especially its data
extraction features.</para>
<programlisting>
from recoll import recoll
db = recoll.connect()
query = db.query()
nres = query.execute("some query")
results = query.fetchmany(20)
for doc in results:
print(doc.url, doc.title)
</programlisting>
</sect3> </sect3>
<sect3 id="RCL.PROGRAM.PYTHON.PACKAGE"> <sect3 id="RCL.PROGRAM.PYTHON.PACKAGE">
@ -4460,7 +4484,6 @@ or
a <literal>connect()</literal> call and holds a a <literal>connect()</literal> call and holds a
connection to a Recoll index.</para> connection to a Recoll index.</para>
<variablelist> <variablelist>
<title>Methods</title>
<varlistentry> <varlistentry>
<term>Db.close()</term> <term>Db.close()</term>
<listitem>Closes the connection. You can't do anything <listitem>Closes the connection. You can't do anything
@ -4511,7 +4534,6 @@ or
execute index searches.</para> execute index searches.</para>
<variablelist> <variablelist>
<title>Methods</title>
<varlistentry> <varlistentry>
<term>Query.sortby(fieldname, ascending=True)</term> <term>Query.sortby(fieldname, ascending=True)</term>
@ -4659,7 +4681,6 @@ or
module for accessing document contents.</para> module for accessing document contents.</para>
<variablelist> <variablelist>
<title>Methods</title>
<varlistentry> <varlistentry>
<term>get(key), [] operator</term> <term>get(key), [] operator</term>
@ -4694,7 +4715,6 @@ or
for now...</para> for now...</para>
<variablelist> <variablelist>
<title>Methods</title>
<varlistentry> <varlistentry>
<term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', <term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
@ -4729,7 +4749,6 @@ or
<title>The Extractor class</title> <title>The Extractor class</title>
<variablelist> <variablelist>
<title>Methods</title>
<varlistentry> <varlistentry>
<term>Extractor(doc)</term> <term>Extractor(doc)</term>

View file

@ -97,6 +97,14 @@
<div class="news"> <div class="news">
<dl> <dl>
<dt>2016-04-21</dt><dd>I experimented with installing
the <a href="https://github.com/koniu/recoll-webui">Recoll
Web UI</a> with Apache, and found out
that <a href="pages/recoll-webui-install-wsgi.html">this
is really easy</a>, actually both easier to set up and
more useful than running it standalone.</dd>
<dt>2016-04-18</dt><dd>Found a <a href="BUGS.html#GUIADV">GUI <dt>2016-04-18</dt><dd>Found a <a href="BUGS.html#GUIADV">GUI
crash bug</a> with a reasonably easy workaround.</dd> crash bug</a> with a reasonably easy workaround.</dd>
@ -122,7 +130,6 @@
the <a href="BUGS.html">known bugs page</a></dd> the <a href="BUGS.html">known bugs page</a></dd>
<dt>2015-11-09</dt> <dt>2015-11-09</dt>
<dd>
<dd><a href="pics/windows-recoll.html"> <dd><a href="pics/windows-recoll.html">
<img align="left" width="100" alt="Recoll on MS-Windows" <img align="left" width="100" alt="Recoll on MS-Windows"
src="pics/windows-recoll-thumb.png"></a> src="pics/windows-recoll-thumb.png"></a>

View file

@ -3,7 +3,8 @@
.txt.html: .txt.html:
asciidoc $< asciidoc $<
all: recoll-mingw.html recoll-windows.html all: recoll-mingw.html recoll-windows.html \
recoll-webui-install-wsgi.html
clean: clean:
rm -f *.html rm -f *.html

View file

@ -0,0 +1,200 @@
= Recoll WebUI Apache installation from scratch
The https://github.com/koniu/recoll-webui[Recoll WebUI] offers an
alternative, WEB-based, interface for querying a Recoll index.
It can be quite useful to extend the use of a shared index to multiple
workstations, without the need for a local Recoll installation and shared
data storage.
The Recoll WebUI is based on the
http://bottlepy.org/docs/dev/index.html[Bottle Python framework], which has
a built-in WEB server, and the simplest deployment approach is to run it
standalone. However the built-in server is restricted to handling one
request at a time, which is problematic in multi-user situations,
especially because some requests, like extracting a result list into a CSV
file, can take a significant amount of time.
The Bottle framework can work with several multi-threading Python HTTP
server libraries, but, given the limitations of the Recoll Python module
and the Python interpreter itself, this will not yield optimal performance,
and, especially can't efficiently leverage the now ubiquitous
multiprocessors.
In multi-user situations, you can get better performance and ease of use
from the Recoll WebUI by running it under Apache rather than as a
standalone process. With this approach, a few requests per second can
easily be handled even in the presence of long-running ones.
Neither Recoll nor the WebUI are optimized for high multi-user load, and it
would be very unwise to use them as the search interface to a busy WEB
site.
The instructions about using the WebUI under Apache as given in the
repository README are a bit terse, and are missing a few details,
especially ones which impact performance.
Here follows the synopsis of two WebUI installations on initially
Apache-less Ubuntu (14.04) and DragonFly BSD systems. The first should
extend easily to other Debian-based systems, the second at least to
FreeBSD. rpm-based systems are left as an exercise to the reader, at least
for now...
CAUTION: THE CONFIGURATIONS DESCRIBED HAVE NO ACCESS CONTROL. ANYONE WITH
ACCESS TO THE NETWORK WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY
DOCUMENT.
== On a Debian/Ubuntu system
=== Install recoll
sudo apt-get install recoll python-recoll
Configure the indexing and check that the normal search works (I spent
quite a lot of time trying to understand why the WebUI did not work, when
in fact it was the normal recoll configuration which was broken and the
regular search did not work either).
Take care to be logged in as the user you want to run the web search as
while you do this.
=== Install the WebUI
Clone the github repository, or extract the master tar installation, and
move it to '/var/www/recoll-webui-master/'. Take care that it is read/execute
accessible by your user.
=== Install Apache and mod-wsgi
sudo apt-get install apache2 libapache2-mod-wsgi
I then got the following message:
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
To clear it, I added a ServerName directive to the apache config, maybe you
won't need it. Edit '/etc/apache2/sites-available/000-default.conf' and add
the following at the top (globally). Things work without this fix anyway,
this is just to suppress the error message. You probably need to adjust the
address or use a real host name:
ServerName 192.168.4.6
Edit '/etc/apache2/mods-enabled/wsgi.conf', add the following at the end of
the "IfModule" section.
Change the user ('dockes' in the example) taking care that he is the one who
owns the index ('.recoll' is in his home directory).
WSGIDaemonProcess recoll user=dockes group=dockes \
threads=1 processes=5 display-name=%{GROUP} \
python-path=/var/www/recoll-webui-master
WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
<Directory /var/www/recoll-webui-master>
WSGIProcessGroup recoll
Order allow,deny
allow from all
</Directory>
NOTE: the Recoll WebUI application is mostly single-threaded, so it is of
little use (and may actually be counter-productive in some cases) to
specify multiple threads on the WSGIDaemonProcess line. Specify multiple
processes instead to put multiple CPUs to work on simultaneous requests.
Then run the following to restart apache:
sudo apachectl restart
The Recoll WebUI should now be accessible. on 'http://my.server.com/recoll/'
NOTE: Take care that you need a '/' at the end of the URL used to access
the search (use: 'http://my.server.com/recoll/', not
'http://my.server.com/recoll'), else files other than the script itself are
not found (the page looks weird and the search does not work).
CAUTION: THERE IS NO ACCESS CONTROL. ANYONE WITH ACCESS TO THE NETWORK
WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY DOCUMENT.
== Variant for BSD/ports
=== Packages
As root:
pkg install recoll
Do what you need to do to configure the indexing and check that the normal
search works.
Take care to be logged in as the user you want to run the web search as
while you do this.
pkg install apache24
Add apache24_enable="YES" in /etc/rc.conf
pkg install ap24-mod_wsgi4
pkg install git
=== Clone the webui repository
cd /usr/local/www/apache24/
git clone https://github.com/koniu/recoll-webui.git recoll-webui-master
Important: most input handler helper applications (e.g. 'pdftotext') are
installed in '/usr/local/bin' which is not in the PATH as seen by Apache
(at least on DragonFly). The simplest way to fix this is to modify the
launcher module for the webui app so that it fixes the PATH.
Edit 'recoll-webui-master/webui-wsgi.py' and add the following line after
the 'import os' line:
os.environ['PATH'] = os.environ['PATH'] + ':' + '/usr/local/bin'
=== Configure apache
Edit /usr/local/etc/apache24/modules.d/270_mod_wsgi.conf
Uncomment the LoadModule line, and add the directives to alias /recoll/ to
the webui script.
Change the user (dockes in the example) taking care that he is the one who
owns the index (.recoll is in his home directory).
Contents of the file:
## $FreeBSD$
## vim: set filetype=apache:
##
## module file for mod_wsgi
##
## PROVIDE: mod_wsgi
## REQUIRE:
LoadModule wsgi_module libexec/apache24/mod_wsgi.so
WSGIDaemonProcess recoll user=dockes group=dockes \
threads=1 processes=5 display-name=%{GROUP} \
python-path=/usr/local/www/apache24/recoll-webui-master/
WSGIScriptAlias /recoll /usr/local/www/apache24/recoll-webui-master/webui-wsgi.py
<Directory /usr/local/www/apache24/recoll-webui-master>
WSGIProcessGroup recoll
Require all granted
</Directory>
=== Restart apache
As root:
apachectl restart

View file

@ -91,7 +91,7 @@ Changes in 20160414
- Fixed a bug which had the whole indexing stop if a script would time out - Fixed a bug which had the whole indexing stop if a script would time out
on a specific file (it will very rarely happen that a pathologically bad on a specific file (it will very rarely happen that a pathologically bad
file can throw an input handler in a loop). file can throw an input handler in a loop).
-
Changes in 20160317 Changes in 20160317