doc touchups

2016-05-03 11:30:13 +02:00 · 2016-05-03 11:30:13 +02:00 · ed0fca3dbf
commit ed0fca3dbf
parent 72ab23080d
7 changed files with 330 additions and 78 deletions
--- a/.hgignore
+++ b/.hgignore
@ -92,5 +92,6 @@ website/usermanual/*
 website/idxthreads/forkingRecoll.html
 website/idxthreads/xapDocCopyCrash.html
 website/pages/recoll-mingw.html
 website/pages/recoll-webui-install-wsgi.html
 website/pages/recoll-windows.html
 src/doc/user/RCL.SEARCH.SYNONYMS.html
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@ -20,8 +20,8 @@ alink="#0000FF">
    <div class="titlepage">
      <div>
        <div>
-          <h1 class="title"><a name="idp59627200" id=
+          <h1 class="title"><a name="idp57237872" id=
-          "idp59627200"></a>Recoll user manual</h1>
+          "idp57237872"></a>Recoll user manual</h1>
        </div>
        <div>
@ -109,13 +109,13 @@ alink="#0000FF">
                multiple indexes</a></span></dt>
                <dt><span class="sect2">2.1.3. <a href=
-                "#idp65068656">Document types</a></span></dt>
+                "#idp63233312">Document types</a></span></dt>
                <dt><span class="sect2">2.1.4. <a href=
-                "#idp65088336">Indexing failures</a></span></dt>
+                "#idp63252992">Indexing failures</a></span></dt>
                <dt><span class="sect2">2.1.5. <a href=
-                "#idp65095792">Recovery</a></span></dt>
+                "#idp63260448">Recovery</a></span></dt>
              </dl>
            </dd>
@ -981,8 +981,8 @@ alink="#0000FF">
          <div class="titlepage">
            <div>
              <div>
-                <h3 class="title"><a name="idp65068656" id=
+                <h3 class="title"><a name="idp63233312" id=
-                "idp65068656"></a>2.1.3.&nbsp;Document types</h3>
+                "idp63233312"></a>2.1.3.&nbsp;Document types</h3>
              </div>
            </div>
          </div>
@ -1075,8 +1075,8 @@ indexedmimetypes = application/pdf
          <div class="titlepage">
            <div>
              <div>
-                <h3 class="title"><a name="idp65088336" id=
+                <h3 class="title"><a name="idp63252992" id=
-                "idp65088336"></a>2.1.4.&nbsp;Indexing
+                "idp63252992"></a>2.1.4.&nbsp;Indexing
                failures</h3>
              </div>
            </div>
@ -1116,8 +1116,8 @@ indexedmimetypes = application/pdf
          <div class="titlepage">
            <div>
              <div>
-                <h3 class="title"><a name="idp65095792" id=
+                <h3 class="title"><a name="idp63260448" id=
-                "idp65095792"></a>2.1.5.&nbsp;Recovery</h3>
+                "idp63260448"></a>2.1.5.&nbsp;Recovery</h3>
              </div>
            </div>
          </div>
@ -6379,32 +6379,41 @@ or
            <p><span class="application">Recoll</span> versions
            after 1.11 define a Python programming interface, both
-            for searching and indexing. The indexing portion has
+            for searching and indexing.</p>
            seen little use, but the searching one is used in the
            Recoll Ubuntu Unity Lens and Recoll Web UI.</p>
-            <p>The API is inspired by the Python database API
+            <p>The search interface is used in the Recoll Ubuntu
-            specification. There were two major changes in recent
+            Unity Lens and Recoll WebUI.</p>
            <p>The indexing section of the API has seen little use,
            and is more a proof of concept. In truth it is waiting
            for its killer app...</p>
            <p>The search API is modeled along the Python database
            API specification. There were two major changes along
            <span class="application">Recoll</span> versions:</p>
            <div class="itemizedlist">
              <ul class="itemizedlist" style=
              "list-style-type: disc;">
-                <li class="listitem">The basis for the <span class=
+                <li class="listitem">
-                "application">Recoll</span> API changed from Python
+                  <p>The basis for the <span class=
-                database API version 1.0 (<span class=
+                  "application">Recoll</span> API changed from
-                "application">Recoll</span> versions up to 1.18.1),
+                  Python database API version 1.0 (<span class=
-                to version 2.0 (<span class=
+                  "application">Recoll</span> versions up to
-                "application">Recoll</span> 1.18.2 and later).</li>
+                  1.18.1), to version 2.0 (<span class=
                  "application">Recoll</span> 1.18.2 and
                  later).</p>
                </li>
-                <li class="listitem">The <code class=
+                <li class="listitem">
-                "literal">recoll</code> module became a package
+                  <p>The <code class="literal">recoll</code> module
-                (with an internal <code class=
+                  became a package (with an internal <code class=
-                "literal">recoll</code> module) as of <span class=
+                  "literal">recoll</code> module) as of
-                "application">Recoll</span> version 1.19, in order
+                  <span class="application">Recoll</span> version
-                to add more functions. For existing code, this only
+                  1.19, in order to add more functions. For
-                changes the way the interface must be
+                  existing code, this only changes the way the
-                imported.</li>
+                  interface must be imported.</p>
                </li>
              </ul>
            </div>
@ -6433,13 +6442,38 @@ or
 </pre>
            <p>As of <span class="application">Recoll</span> 1.19,
            the module can be compiled for Python3.</p>
            <p>The normal <span class="application">Recoll</span>
-            installer installs the Python API along with the main
+            installer installs the Python2 API along with the main
-            code.</p>
+            code. The Python3 version must be explicitely built and
            installed.</p>
            <p>When installing from a repository, and depending on
            the distribution, the Python API can sometimes be found
            in a separate package.</p>
            <p>The following small sample will run a query and list
            the title and url for each of the results. It would
            work with <span class="application">Recoll</span> 1.19
            and later. The <code class=
            "filename">python/samples</code> source directory
            contains several examples of Python programming with
            <span class="application">Recoll</span>, exercising the
            extension more completely, and especially its data
            extraction features.</p>
            <pre class="programlisting">
          from recoll import recoll
          db = recoll.connect()
          query = db.query()
          nres = query.execute("some query")
          results = query.fetchmany(20)
          for doc in results:
              print(doc.url, doc.title)
 </pre>
          </div>
          <div class="sect3">
@ -6564,8 +6598,6 @@ or
                connection to a Recoll index.</p>
                <div class="variablelist">
                  <p class="title"><b>Methods</b></p>
                  <dl class="variablelist">
                    <dt><span class="term">Db.close()</span></dt>
@ -6628,8 +6660,6 @@ or
                execute index searches.</p>
                <div class="variablelist">
                  <p class="title"><b>Methods</b></p>
                  <dl class="variablelist">
                    <dt><span class="term">Query.sortby(fieldname,
                    ascending=True)</span></dt>
@ -6805,8 +6835,6 @@ or
                document contents.</p>
                <div class="variablelist">
                  <p class="title"><b>Methods</b></p>
                  <dl class="variablelist">
                    <dt><span class="term">get(key), []
                    operator</span></dt>
@ -6854,8 +6882,6 @@ or
                detailed doc for now...</p>
                <div class="variablelist">
                  <p class="title"><b>Methods</b></p>
                  <dl class="variablelist">
                    <dt><span class=
                    "term">addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
@ -6914,8 +6940,6 @@ or
                </div>
                <div class="variablelist">
                  <p class="title"><b>Methods</b></p>
                  <dl class="variablelist">
                    <dt><span class=
                    "term">Extractor(doc)</span></dt>
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@ -4354,22 +4354,26 @@ or
        <title>Introduction</title>
        <para>&RCL; versions after 1.11 define a Python programming
-          interface, both for searching and indexing. The indexing
+        interface, both for searching and indexing.</para>
          portion has seen little use, but the searching one is used
          in the Recoll Ubuntu Unity Lens and Recoll Web UI.</para> 
-        <para>The API is inspired by the Python database API
+        <para>The search interface is used in the Recoll Ubuntu Unity Lens
-          specification. There were two major changes in recent &RCL;
+        and Recoll WebUI.</para>
-          versions:
+
        <para>The indexing section of the API has seen little use, and is
        more a proof of concept. In truth it is waiting for its killer
        app...</para>
        <para>The search API is modeled along the Python database API
        specification. There were two major changes along &RCL; versions:
        <itemizedlist>
-            <listitem>The basis for the &RCL; API changed from Python
+          <listitem><para>The basis for the &RCL; API changed from Python
          database API version 1.0 (&RCL; versions up to 1.18.1),
-              to version 2.0 (&RCL; 1.18.2 and later).</listitem>
+          to version 2.0 (&RCL; 1.18.2 and later).</para></listitem>
-            <listitem>The <literal>recoll</literal> module became a
+          <listitem><para>The <literal>recoll</literal> module became a
          package (with an internal <literal>recoll</literal>
          module) as of &RCL; version 1.19, in order to add more
          functions. For existing code, this only changes the way
-              the interface must be imported.</listitem>
+          the interface must be imported.</para></listitem>
        </itemizedlist>
        </para>
@ -4392,13 +4396,33 @@ or
          </screen>
        </para> 
-        <para>The normal &RCL; installer installs the Python
+        <para>As of &RCL; 1.19, the module can be compiled for
-          API along with the main code.</para>
+        Python3.</para>
        <para>The normal &RCL; installer installs the Python2
          API along with the main code. The Python3 version must be
        explicitely built and installed.</para>
        <para>When installing from a repository, and depending on the
        distribution, the Python API can sometimes be found in a
        separate package.</para>
        <para>The following small sample will run a query and list
        the title and url for each of the results. It would work with &RCL;
        1.19 and later. The <filename>python/samples</filename> source directory
        contains several examples of Python programming with &RCL;,
        exercising the extension more completely, and especially its data
        extraction features.</para>
        <programlisting>
          from recoll import recoll
          db = recoll.connect()
          query = db.query()
          nres = query.execute("some query")
          results = query.fetchmany(20)
          for doc in results:
              print(doc.url, doc.title)
        </programlisting>
      </sect3>
      <sect3 id="RCL.PROGRAM.PYTHON.PACKAGE">
@ -4460,7 +4484,6 @@ or
            a <literal>connect()</literal> call and holds a 
            connection to a Recoll index.</para>
          <variablelist>
            <title>Methods</title>
            <varlistentry>
              <term>Db.close()</term>
              <listitem>Closes the connection. You can't do anything
@ -4511,7 +4534,6 @@ or
            execute index searches.</para>
          <variablelist>
            <title>Methods</title>
            <varlistentry>
              <term>Query.sortby(fieldname, ascending=True)</term>
@ -4659,7 +4681,6 @@ or
            module for accessing document contents.</para> 
          <variablelist>
            <title>Methods</title>
            <varlistentry>
              <term>get(key), [] operator</term>
@ -4694,7 +4715,6 @@ or
            for now...</para>
          <variablelist>
            <title>Methods</title>
            <varlistentry>
              <term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
@ -4729,7 +4749,6 @@ or
            <title>The Extractor class</title>
            <variablelist>
              <title>Methods</title>
              <varlistentry>
                <term>Extractor(doc)</term>
--- a/website/index.html.en
+++ b/website/index.html.en
@ -97,6 +97,14 @@
      <div class="news">
        <dl>
          <dt>2016-04-21</dt><dd>I experimented with installing
            the <a href="https://github.com/koniu/recoll-webui">Recoll
              Web UI</a> with Apache, and found out
            that <a href="pages/recoll-webui-install-wsgi.html">this
            is really easy</a>, actually both easier to set up and
            more useful than running it standalone.</dd>
          <dt>2016-04-18</dt><dd>Found a <a href="BUGS.html#GUIADV">GUI
            crash bug</a> with a reasonably easy workaround.</dd>
@ -122,7 +130,6 @@
          the <a href="BUGS.html">known bugs page</a></dd>
        <dt>2015-11-09</dt>
        <dd>
        <dd><a href="pics/windows-recoll.html">
            <img align="left" width="100" alt="Recoll on MS-Windows" 
                 src="pics/windows-recoll-thumb.png"></a>
--- a/website/pages/Makefile
+++ b/website/pages/Makefile
@ -3,7 +3,8 @@
 .txt.html:
 	asciidoc $<
-all: recoll-mingw.html recoll-windows.html
+all: recoll-mingw.html recoll-windows.html \
     recoll-webui-install-wsgi.html
 clean:
 	rm -f *.html
--- a/website/pages/recoll-webui-install-wsgi.txt
+++ b/website/pages/recoll-webui-install-wsgi.txt
@ -0,0 +1,200 @@
 = Recoll WebUI Apache installation from scratch 
 The https://github.com/koniu/recoll-webui[Recoll WebUI] offers an
 alternative, WEB-based, interface for querying a Recoll index.
 It can be quite useful to extend the use of a shared index to multiple
 workstations, without the need for a local Recoll installation and shared
 data storage.
 The Recoll WebUI is based on the
 http://bottlepy.org/docs/dev/index.html[Bottle Python framework], which has
 a built-in WEB server, and the simplest deployment approach is to run it
 standalone. However the built-in server is restricted to handling one
 request at a time, which is problematic in multi-user situations,
 especially because some requests, like extracting a result list into a CSV
 file, can take a significant amount of time.
 The Bottle framework can work with several multi-threading Python HTTP
 server libraries, but, given the limitations of the Recoll Python module
 and the Python interpreter itself, this will not yield optimal performance,
 and, especially can't efficiently leverage the now ubiquitous
 multiprocessors.
 In multi-user situations, you can get better performance and ease of use
 from the Recoll WebUI by running it under Apache rather than as a
 standalone process. With this approach, a few requests per second can
 easily be handled even in the presence of long-running ones.
 Neither Recoll nor the WebUI are optimized for high multi-user load, and it
 would be very unwise to use them as the search interface to a busy WEB
 site.
 The instructions about using the WebUI under Apache as given in the
 repository README are a bit terse, and are missing a few details,
 especially ones which impact performance.
 Here follows the synopsis of two WebUI installations on initially
 Apache-less Ubuntu (14.04) and DragonFly BSD systems. The first should
 extend easily to other Debian-based systems, the second at least to
 FreeBSD. rpm-based systems are left as an exercise to the reader, at least
 for now...
 CAUTION: THE CONFIGURATIONS DESCRIBED HAVE NO ACCESS CONTROL. ANYONE WITH
 ACCESS TO THE NETWORK WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY
 DOCUMENT.
 == On a Debian/Ubuntu system
 === Install recoll 
    sudo apt-get install recoll python-recoll
 Configure the indexing and check that the normal search works (I spent
 quite a lot of time trying to understand why the WebUI did not work, when
 in fact it was the normal recoll configuration which was broken and the
 regular search did not work either).
 Take care to be logged in as the user you want to run the web search as
 while you do this.
 === Install the WebUI
 Clone the github repository, or extract the master tar installation, and
 move it to '/var/www/recoll-webui-master/'. Take care that it is read/execute
 accessible by your user.
 === Install Apache and mod-wsgi
    sudo apt-get install apache2 libapache2-mod-wsgi
 I then got the following message:
    AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
 To clear it, I added a ServerName directive to the apache config, maybe you
 won't need it. Edit '/etc/apache2/sites-available/000-default.conf' and add
 the following at the top (globally). Things work without this fix anyway,
 this is just to suppress the error message. You probably need to adjust the
 address or use a real host name:
    ServerName 192.168.4.6
 Edit '/etc/apache2/mods-enabled/wsgi.conf', add the following at the end of
 the "IfModule" section.
 Change the user ('dockes' in the example) taking care that he is the one who
 owns the index ('.recoll' is in his home directory).
    WSGIDaemonProcess recoll user=dockes group=dockes \
        threads=1 processes=5 display-name=%{GROUP} \
        python-path=/var/www/recoll-webui-master
    WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py
    <Directory /var/www/recoll-webui-master>
            WSGIProcessGroup recoll
            Order allow,deny
            allow from all
    </Directory>
 NOTE: the Recoll WebUI application is mostly single-threaded, so it is of
 little use (and may actually be counter-productive in some cases) to
 specify multiple threads on the WSGIDaemonProcess line. Specify multiple
 processes instead to put multiple CPUs to work on simultaneous requests.
 Then run the following to restart apache:
    sudo apachectl restart
 The Recoll WebUI should now be accessible. on 'http://my.server.com/recoll/'
 NOTE: Take care that you need a '/' at the end of the URL used to access
 the search (use: 'http://my.server.com/recoll/', not
 'http://my.server.com/recoll'), else files other than the script itself are
 not found (the page looks weird and the search does not work).
 CAUTION: THERE IS NO ACCESS CONTROL. ANYONE WITH ACCESS TO THE NETWORK
 WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY DOCUMENT.
 == Variant for BSD/ports
 === Packages
 As root:
    pkg install recoll
 Do what you need to do to configure the indexing and check that the normal
 search works.
 Take care to be logged in as the user you want to run the web search as
 while you do this.
    pkg install apache24
 Add apache24_enable="YES" in /etc/rc.conf
    pkg install ap24-mod_wsgi4
    pkg install git
 === Clone the webui repository
    cd /usr/local/www/apache24/
    git clone https://github.com/koniu/recoll-webui.git recoll-webui-master
 Important: most input handler helper applications (e.g. 'pdftotext') are
 installed in '/usr/local/bin' which is not in the PATH as seen by Apache
 (at least on DragonFly). The simplest way to fix this is to modify the
 launcher module for the webui app so that it fixes the PATH.
 Edit 'recoll-webui-master/webui-wsgi.py' and add the following line after
 the 'import os' line:
    os.environ['PATH'] = os.environ['PATH'] + ':' + '/usr/local/bin'
 === Configure apache
 Edit /usr/local/etc/apache24/modules.d/270_mod_wsgi.conf
 Uncomment the LoadModule line, and add the directives to alias /recoll/ to
 the webui script.
 Change the user (dockes in the example) taking care that he is the one who
 owns the index (.recoll is in his home directory).
 Contents of the file:
    ## $FreeBSD$
    ## vim: set filetype=apache:
    ##
    ## module file for mod_wsgi
    ##
    ## PROVIDE: mod_wsgi
    ## REQUIRE:
    LoadModule wsgi_module        libexec/apache24/mod_wsgi.so
    WSGIDaemonProcess recoll user=dockes group=dockes \
        threads=1 processes=5 display-name=%{GROUP} \
        python-path=/usr/local/www/apache24/recoll-webui-master/
    WSGIScriptAlias /recoll /usr/local/www/apache24/recoll-webui-master/webui-wsgi.py
    <Directory /usr/local/www/apache24/recoll-webui-master>
            WSGIProcessGroup recoll
            Require all granted
    </Directory>
 === Restart apache
 As root:
    apachectl restart
--- a/website/pages/recoll-windows.txt
+++ b/website/pages/recoll-windows.txt
@ -91,7 +91,7 @@ Changes in 20160414
 - Fixed a bug which had the whole indexing stop if a script would time out
  on a specific file (it will very rarely happen that a pathologically bad
  file can throw an input handler in a loop).
-
+
 Changes in 20160317