From ed0fca3dbfa4857ea08bbde3b5be761ec00c50bd Mon Sep 17 00:00:00 2001
From: Jean-Francois Dockes <jfd@recoll.org>
Date: Tue, 3 May 2016 11:30:13 +0200
Subject: [PATCH] doc touchups

---
 .hgignore                                   |   1 +
 src/doc/user/usermanual.html                | 108 +++++++----
 src/doc/user/usermanual.xml                 |  69 ++++---
 website/index.html.en                       |  25 ++-
 website/pages/Makefile                      |   3 +-
 website/pages/recoll-webui-install-wsgi.txt | 200 ++++++++++++++++++++
 website/pages/recoll-windows.txt            |   2 +-
 7 files changed, 330 insertions(+), 78 deletions(-)
 create mode 100644 website/pages/recoll-webui-install-wsgi.txt
diff --git a/.hgignore b/.hgignore
index afc23adc..1d76ecb3 100644
--- a/.hgignore
+++ b/.hgignore
@@ -92,5 +92,6 @@ website/usermanual/*
 website/idxthreads/forkingRecoll.html
 website/idxthreads/xapDocCopyCrash.html
 website/pages/recoll-mingw.html
+website/pages/recoll-webui-install-wsgi.html
 website/pages/recoll-windows.html
 src/doc/user/RCL.SEARCH.SYNONYMS.html
diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html
index 8899520a..49fb82a8 100644
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@@ -20,8 +20,8 @@ alink="#0000FF">
     <div class="titlepage">
       <div>
         <div>
-          <h1 class="title"><a name="idp59627200" id=
-          "idp59627200"></a>Recoll user manual</h1>
+          <h1 class="title"><a name="idp57237872" id=
+          "idp57237872"></a>Recoll user manual</h1>
         </div>
 
         <div>
@@ -109,13 +109,13 @@ alink="#0000FF">
                 multiple indexes</a></span></dt>
 
                 <dt><span class="sect2">2.1.3. <a href=
-                "#idp65068656">Document types</a></span></dt>
+                "#idp63233312">Document types</a></span></dt>
 
                 <dt><span class="sect2">2.1.4. <a href=
-                "#idp65088336">Indexing failures</a></span></dt>
+                "#idp63252992">Indexing failures</a></span></dt>
 
                 <dt><span class="sect2">2.1.5. <a href=
-                "#idp65095792">Recovery</a></span></dt>
+                "#idp63260448">Recovery</a></span></dt>
               </dl>
             </dd>
 
@@ -981,8 +981,8 @@ alink="#0000FF">
           <div class="titlepage">
             <div>
               <div>
-                <h3 class="title"><a name="idp65068656" id=
-                "idp65068656"></a>2.1.3.&nbsp;Document types</h3>
+                <h3 class="title"><a name="idp63233312" id=
+                "idp63233312"></a>2.1.3.&nbsp;Document types</h3>
               </div>
             </div>
           </div>
@@ -1075,8 +1075,8 @@ indexedmimetypes = application/pdf
           <div class="titlepage">
             <div>
               <div>
-                <h3 class="title"><a name="idp65088336" id=
-                "idp65088336"></a>2.1.4.&nbsp;Indexing
+                <h3 class="title"><a name="idp63252992" id=
+                "idp63252992"></a>2.1.4.&nbsp;Indexing
                 failures</h3>
               </div>
             </div>
@@ -1116,8 +1116,8 @@ indexedmimetypes = application/pdf
           <div class="titlepage">
             <div>
               <div>
-                <h3 class="title"><a name="idp65095792" id=
-                "idp65095792"></a>2.1.5.&nbsp;Recovery</h3>
+                <h3 class="title"><a name="idp63260448" id=
+                "idp63260448"></a>2.1.5.&nbsp;Recovery</h3>
               </div>
             </div>
           </div>
@@ -6379,32 +6379,41 @@ or
 
             <p><span class="application">Recoll</span> versions
             after 1.11 define a Python programming interface, both
-            for searching and indexing. The indexing portion has
-            seen little use, but the searching one is used in the
-            Recoll Ubuntu Unity Lens and Recoll Web UI.</p>
+            for searching and indexing.</p>
 
-            <p>The API is inspired by the Python database API
-            specification. There were two major changes in recent
+            <p>The search interface is used in the Recoll Ubuntu
+            Unity Lens and Recoll WebUI.</p>
+
+            <p>The indexing section of the API has seen little use,
+            and is more a proof of concept. In truth it is waiting
+            for its killer app...</p>
+
+            <p>The search API is modeled along the Python database
+            API specification. There were two major changes along
             <span class="application">Recoll</span> versions:</p>
 
             <div class="itemizedlist">
               <ul class="itemizedlist" style=
               "list-style-type: disc;">
-                <li class="listitem">The basis for the <span class=
-                "application">Recoll</span> API changed from Python
-                database API version 1.0 (<span class=
-                "application">Recoll</span> versions up to 1.18.1),
-                to version 2.0 (<span class=
-                "application">Recoll</span> 1.18.2 and later).</li>
+                <li class="listitem">
+                  <p>The basis for the <span class=
+                  "application">Recoll</span> API changed from
+                  Python database API version 1.0 (<span class=
+                  "application">Recoll</span> versions up to
+                  1.18.1), to version 2.0 (<span class=
+                  "application">Recoll</span> 1.18.2 and
+                  later).</p>
+                </li>
 
-                <li class="listitem">The <code class=
-                "literal">recoll</code> module became a package
-                (with an internal <code class=
-                "literal">recoll</code> module) as of <span class=
-                "application">Recoll</span> version 1.19, in order
-                to add more functions. For existing code, this only
-                changes the way the interface must be
-                imported.</li>
+                <li class="listitem">
+                  <p>The <code class="literal">recoll</code> module
+                  became a package (with an internal <code class=
+                  "literal">recoll</code> module) as of
+                  <span class="application">Recoll</span> version
+                  1.19, in order to add more functions. For
+                  existing code, this only changes the way the
+                  interface must be imported.</p>
+                </li>
               </ul>
             </div>
 
@@ -6433,13 +6442,38 @@ or
           
 </pre>
 
+            <p>As of <span class="application">Recoll</span> 1.19,
+            the module can be compiled for Python3.</p>
+
             <p>The normal <span class="application">Recoll</span>
-            installer installs the Python API along with the main
-            code.</p>
+            installer installs the Python2 API along with the main
+            code. The Python3 version must be explicitely built and
+            installed.</p>
 
             <p>When installing from a repository, and depending on
             the distribution, the Python API can sometimes be found
             in a separate package.</p>
+
+            <p>The following small sample will run a query and list
+            the title and url for each of the results. It would
+            work with <span class="application">Recoll</span> 1.19
+            and later. The <code class=
+            "filename">python/samples</code> source directory
+            contains several examples of Python programming with
+            <span class="application">Recoll</span>, exercising the
+            extension more completely, and especially its data
+            extraction features.</p>
+            <pre class="programlisting">
+          from recoll import recoll
+
+          db = recoll.connect()
+          query = db.query()
+          nres = query.execute("some query")
+          results = query.fetchmany(20)
+          for doc in results:
+              print(doc.url, doc.title)
+        
+</pre>
           </div>
 
           <div class="sect3">
@@ -6564,8 +6598,6 @@ or
                 connection to a Recoll index.</p>
 
                 <div class="variablelist">
-                  <p class="title"><b>Methods</b></p>
-
                   <dl class="variablelist">
                     <dt><span class="term">Db.close()</span></dt>
 
@@ -6628,8 +6660,6 @@ or
                 execute index searches.</p>
 
                 <div class="variablelist">
-                  <p class="title"><b>Methods</b></p>
-
                   <dl class="variablelist">
                     <dt><span class="term">Query.sortby(fieldname,
                     ascending=True)</span></dt>
@@ -6805,8 +6835,6 @@ or
                 document contents.</p>
 
                 <div class="variablelist">
-                  <p class="title"><b>Methods</b></p>
-
                   <dl class="variablelist">
                     <dt><span class="term">get(key), []
                     operator</span></dt>
@@ -6854,8 +6882,6 @@ or
                 detailed doc for now...</p>
 
                 <div class="variablelist">
-                  <p class="title"><b>Methods</b></p>
-
                   <dl class="variablelist">
                     <dt><span class=
                     "term">addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
@@ -6914,8 +6940,6 @@ or
                 </div>
 
                 <div class="variablelist">
-                  <p class="title"><b>Methods</b></p>
-
                   <dl class="variablelist">
                     <dt><span class=
                     "term">Extractor(doc)</span></dt>
diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml
index f9f05ed5..a0e4754a 100644
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@@ -4354,23 +4354,27 @@ or
         <title>Introduction</title>
 
         <para>&RCL; versions after 1.11 define a Python programming
-          interface, both for searching and indexing. The indexing
-          portion has seen little use, but the searching one is used
-          in the Recoll Ubuntu Unity Lens and Recoll Web UI.</para> 
+        interface, both for searching and indexing.</para>
 
-        <para>The API is inspired by the Python database API
-          specification. There were two major changes in recent &RCL;
-          versions:
-          <itemizedlist>
-            <listitem>The basis for the &RCL; API changed from Python
-              database API version 1.0 (&RCL; versions up to 1.18.1),
-              to version 2.0 (&RCL; 1.18.2 and later).</listitem>
-            <listitem>The <literal>recoll</literal> module became a
-              package (with an internal <literal>recoll</literal>
-              module) as of &RCL; version 1.19, in order to add more
-              functions. For existing code, this only changes the way
-              the interface must be imported.</listitem>
-          </itemizedlist>
+        <para>The search interface is used in the Recoll Ubuntu Unity Lens
+        and Recoll WebUI.</para>
+
+        <para>The indexing section of the API has seen little use, and is
+        more a proof of concept. In truth it is waiting for its killer
+        app...</para>
+        
+        <para>The search API is modeled along the Python database API
+        specification. There were two major changes along &RCL; versions:
+        <itemizedlist>
+          <listitem><para>The basis for the &RCL; API changed from Python
+          database API version 1.0 (&RCL; versions up to 1.18.1),
+          to version 2.0 (&RCL; 1.18.2 and later).</para></listitem>
+          <listitem><para>The <literal>recoll</literal> module became a
+          package (with an internal <literal>recoll</literal>
+          module) as of &RCL; version 1.19, in order to add more
+          functions. For existing code, this only changes the way
+          the interface must be imported.</para></listitem>
+        </itemizedlist>
         </para>
 
         <para>We will mostly describe the new API and package
@@ -4392,13 +4396,33 @@ or
           </screen>
         </para> 
 
-        <para>The normal &RCL; installer installs the Python
-          API along with the main code.</para>
+        <para>As of &RCL; 1.19, the module can be compiled for
+        Python3.</para>
+
+        <para>The normal &RCL; installer installs the Python2
+          API along with the main code. The Python3 version must be
+        explicitely built and installed.</para>
 
         <para>When installing from a repository, and depending on the
-          distribution, the Python API can sometimes be found in a
-          separate package.</para>
+        distribution, the Python API can sometimes be found in a
+        separate package.</para>
 
+        <para>The following small sample will run a query and list
+        the title and url for each of the results. It would work with &RCL;
+        1.19 and later. The <filename>python/samples</filename> source directory
+        contains several examples of Python programming with &RCL;,
+        exercising the extension more completely, and especially its data
+        extraction features.</para>
+        <programlisting>
+          from recoll import recoll
+
+          db = recoll.connect()
+          query = db.query()
+          nres = query.execute("some query")
+          results = query.fetchmany(20)
+          for doc in results:
+              print(doc.url, doc.title)
+        </programlisting>
       </sect3>
 
       <sect3 id="RCL.PROGRAM.PYTHON.PACKAGE">
@@ -4460,7 +4484,6 @@ or
             a <literal>connect()</literal> call and holds a 
             connection to a Recoll index.</para>
           <variablelist>
-            <title>Methods</title>
             <varlistentry>
               <term>Db.close()</term>
               <listitem>Closes the connection. You can't do anything
@@ -4511,7 +4534,6 @@ or
             execute index searches.</para>
 
           <variablelist>
-            <title>Methods</title>
 
             <varlistentry>
               <term>Query.sortby(fieldname, ascending=True)</term>
@@ -4659,7 +4681,6 @@ or
             module for accessing document contents.</para> 
 
           <variablelist>
-            <title>Methods</title>
 
             <varlistentry>
               <term>get(key), [] operator</term>
@@ -4694,7 +4715,6 @@ or
             for now...</para>
 
           <variablelist>
-            <title>Methods</title>
 
             <varlistentry>
               <term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
@@ -4729,7 +4749,6 @@ or
             <title>The Extractor class</title>
 
             <variablelist>
-              <title>Methods</title>
 
               <varlistentry>
                 <term>Extractor(doc)</term>
diff --git a/website/index.html.en b/website/index.html.en
index 06adf4d0..08489dba 100644
--- a/website/index.html.en
+++ b/website/index.html.en
@@ -71,12 +71,12 @@
 	interface.</p>
 
       <p class="remark">Recoll will index an <b>MS-Word</b> document
-      stored as an <b>attachment</b> to an <b>e-mail message</b> inside
-	  a <b>Thunderbird folder</b> archived in a <b>Zip file</b> (and
-	  more...). It will also help you search for it with a friendly and
-	  powerful interface, and let you open a copy of a PDF at the right
-	  page with two clicks. There is little that will remain
-	  hidden on your disk.</p>
+        stored as an <b>attachment</b> to an <b>e-mail message</b> inside
+	a <b>Thunderbird folder</b> archived in a <b>Zip file</b> (and
+	more...). It will also help you search for it with a friendly and
+	powerful interface, and let you open a copy of a PDF at the right
+	page with two clicks. There is little that will remain
+	hidden on your disk.</p>
 
       <p>Recoll has extensive <a href="doc.html">
           documentation</a>. If you run into a problem, or want to
@@ -96,8 +96,16 @@
       <h2>News</h2>
       <div class="news">
       
-      <dl>
-        <dt>2016-04-18</dt><dd>Found a <a href="BUGS.html#GUIADV">GUI
+        <dl>
+
+          <dt>2016-04-21</dt><dd>I experimented with installing
+            the <a href="https://github.com/koniu/recoll-webui">Recoll
+              Web UI</a> with Apache, and found out
+            that <a href="pages/recoll-webui-install-wsgi.html">this
+            is really easy</a>, actually both easier to set up and
+            more useful than running it standalone.</dd>
+          
+          <dt>2016-04-18</dt><dd>Found a <a href="BUGS.html#GUIADV">GUI
             crash bug</a> with a reasonably easy workaround.</dd>
 
         <dt>2016-04-14</dt><dd>Release 1.22.0 is now available from
@@ -122,7 +130,6 @@
           the <a href="BUGS.html">known bugs page</a></dd>
 
         <dt>2015-11-09</dt>
-        <dd>
         <dd><a href="pics/windows-recoll.html">
             <img align="left" width="100" alt="Recoll on MS-Windows" 
                  src="pics/windows-recoll-thumb.png"></a>
diff --git a/website/pages/Makefile b/website/pages/Makefile
index c41a0cac..420b997d 100644
--- a/website/pages/Makefile
+++ b/website/pages/Makefile
@@ -3,7 +3,8 @@
 .txt.html:
 	asciidoc $<
 
-all: recoll-mingw.html recoll-windows.html
+all: recoll-mingw.html recoll-windows.html \
+     recoll-webui-install-wsgi.html
 
 clean:
 	rm -f *.html
diff --git a/website/pages/recoll-webui-install-wsgi.txt b/website/pages/recoll-webui-install-wsgi.txt
new file mode 100644
index 00000000..237923eb
--- /dev/null
+++ b/website/pages/recoll-webui-install-wsgi.txt
@@ -0,0 +1,200 @@
+= Recoll WebUI Apache installation from scratch 
+
+The https://github.com/koniu/recoll-webui[Recoll WebUI] offers an
+alternative, WEB-based, interface for querying a Recoll index.
+
+It can be quite useful to extend the use of a shared index to multiple
+workstations, without the need for a local Recoll installation and shared
+data storage.
+
+The Recoll WebUI is based on the
+http://bottlepy.org/docs/dev/index.html[Bottle Python framework], which has
+a built-in WEB server, and the simplest deployment approach is to run it
+standalone. However the built-in server is restricted to handling one
+request at a time, which is problematic in multi-user situations,
+especially because some requests, like extracting a result list into a CSV
+file, can take a significant amount of time.
+
+The Bottle framework can work with several multi-threading Python HTTP
+server libraries, but, given the limitations of the Recoll Python module
+and the Python interpreter itself, this will not yield optimal performance,
+and, especially can't efficiently leverage the now ubiquitous
+multiprocessors.
+
+In multi-user situations, you can get better performance and ease of use
+from the Recoll WebUI by running it under Apache rather than as a
+standalone process. With this approach, a few requests per second can
+easily be handled even in the presence of long-running ones.
+
+Neither Recoll nor the WebUI are optimized for high multi-user load, and it
+would be very unwise to use them as the search interface to a busy WEB
+site.
+
+The instructions about using the WebUI under Apache as given in the
+repository README are a bit terse, and are missing a few details,
+especially ones which impact performance.
+
+Here follows the synopsis of two WebUI installations on initially
+Apache-less Ubuntu (14.04) and DragonFly BSD systems. The first should
+extend easily to other Debian-based systems, the second at least to
+FreeBSD. rpm-based systems are left as an exercise to the reader, at least
+for now...
+
+
+CAUTION: THE CONFIGURATIONS DESCRIBED HAVE NO ACCESS CONTROL. ANYONE WITH
+ACCESS TO THE NETWORK WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY
+DOCUMENT.
+
+== On a Debian/Ubuntu system
+
+=== Install recoll 
+
+    sudo apt-get install recoll python-recoll
+
+Configure the indexing and check that the normal search works (I spent
+quite a lot of time trying to understand why the WebUI did not work, when
+in fact it was the normal recoll configuration which was broken and the
+regular search did not work either).
+
+Take care to be logged in as the user you want to run the web search as
+while you do this.
+
+
+=== Install the WebUI
+
+Clone the github repository, or extract the master tar installation, and
+move it to '/var/www/recoll-webui-master/'. Take care that it is read/execute
+accessible by your user.
+
+=== Install Apache and mod-wsgi
+
+
+    sudo apt-get install apache2 libapache2-mod-wsgi
+
+I then got the following message:
+
+    AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
+
+To clear it, I added a ServerName directive to the apache config, maybe you
+won't need it. Edit '/etc/apache2/sites-available/000-default.conf' and add
+the following at the top (globally). Things work without this fix anyway,
+this is just to suppress the error message. You probably need to adjust the
+address or use a real host name:
+
+    ServerName 192.168.4.6
+
+
+Edit '/etc/apache2/mods-enabled/wsgi.conf', add the following at the end of
+the "IfModule" section.
+
+Change the user ('dockes' in the example) taking care that he is the one who
+owns the index ('.recoll' is in his home directory).
+
+    WSGIDaemonProcess recoll user=dockes group=dockes \
+        threads=1 processes=5 display-name=%{GROUP} \
+        python-path=/var/www/recoll-webui-master
+    WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
+    <Directory /var/www/recoll-webui-master>
+            WSGIProcessGroup recoll
+            Order allow,deny
+            allow from all
+    </Directory>
+
+NOTE: the Recoll WebUI application is mostly single-threaded, so it is of
+little use (and may actually be counter-productive in some cases) to
+specify multiple threads on the WSGIDaemonProcess line. Specify multiple
+processes instead to put multiple CPUs to work on simultaneous requests.
+
+
+Then run the following to restart apache:
+
+    sudo apachectl restart
+
+The Recoll WebUI should now be accessible. on 'http://my.server.com/recoll/'
+
+NOTE: Take care that you need a '/' at the end of the URL used to access
+the search (use: 'http://my.server.com/recoll/', not
+'http://my.server.com/recoll'), else files other than the script itself are
+not found (the page looks weird and the search does not work).
+
+CAUTION: THERE IS NO ACCESS CONTROL. ANYONE WITH ACCESS TO THE NETWORK
+WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY DOCUMENT.
+
+== Variant for BSD/ports
+
+=== Packages
+
+As root:
+
+    pkg install recoll
+
+
+Do what you need to do to configure the indexing and check that the normal
+search works.
+
+Take care to be logged in as the user you want to run the web search as
+while you do this.
+
+    pkg install apache24
+
+Add apache24_enable="YES" in /etc/rc.conf
+
+    pkg install ap24-mod_wsgi4
+    pkg install git
+
+=== Clone the webui repository
+
+    cd /usr/local/www/apache24/
+    git clone https://github.com/koniu/recoll-webui.git recoll-webui-master
+
+Important: most input handler helper applications (e.g. 'pdftotext') are
+installed in '/usr/local/bin' which is not in the PATH as seen by Apache
+(at least on DragonFly). The simplest way to fix this is to modify the
+launcher module for the webui app so that it fixes the PATH.
+
+Edit 'recoll-webui-master/webui-wsgi.py' and add the following line after
+the 'import os' line:
+
+    os.environ['PATH'] = os.environ['PATH'] + ':' + '/usr/local/bin'
+
+
+
+=== Configure apache
+
+Edit /usr/local/etc/apache24/modules.d/270_mod_wsgi.conf
+
+Uncomment the LoadModule line, and add the directives to alias /recoll/ to
+the webui script.
+
+Change the user (dockes in the example) taking care that he is the one who
+owns the index (.recoll is in his home directory).
+
+Contents of the file:
+
+    ## $FreeBSD$
+    ## vim: set filetype=apache:
+    ##
+    ## module file for mod_wsgi
+    ##
+    ## PROVIDE: mod_wsgi
+    ## REQUIRE:
+    
+    LoadModule wsgi_module        libexec/apache24/mod_wsgi.so
+    
+    WSGIDaemonProcess recoll user=dockes group=dockes \
+        threads=1 processes=5 display-name=%{GROUP} \
+        python-path=/usr/local/www/apache24/recoll-webui-master/
+    WSGIScriptAlias /recoll /usr/local/www/apache24/recoll-webui-master/webui-wsgi.py
+    
+    <Directory /usr/local/www/apache24/recoll-webui-master>
+            WSGIProcessGroup recoll
+            Require all granted
+    </Directory>
+
+=== Restart apache
+
+As root:
+
+    apachectl restart
+
+
diff --git a/website/pages/recoll-windows.txt b/website/pages/recoll-windows.txt
index a96238df..d4907848 100644
--- a/website/pages/recoll-windows.txt
+++ b/website/pages/recoll-windows.txt
@@ -91,7 +91,7 @@ Changes in 20160414
 - Fixed a bug which had the whole indexing stop if a script would time out
   on a specific file (it will very rarely happen that a pathologically bad
   file can throw an input handler in a loop).
--
+
 
 Changes in 20160317