From ed0fca3dbfa4857ea08bbde3b5be761ec00c50bd Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Tue, 3 May 2016 11:30:13 +0200 Subject: [PATCH] doc touchups --- .hgignore | 1 + src/doc/user/usermanual.html | 108 +++++++---- src/doc/user/usermanual.xml | 69 ++++--- website/index.html.en | 25 ++- website/pages/Makefile | 3 +- website/pages/recoll-webui-install-wsgi.txt | 200 ++++++++++++++++++++ website/pages/recoll-windows.txt | 2 +- 7 files changed, 330 insertions(+), 78 deletions(-) create mode 100644 website/pages/recoll-webui-install-wsgi.txt diff --git a/.hgignore b/.hgignore index afc23adc..1d76ecb3 100644 --- a/.hgignore +++ b/.hgignore @@ -92,5 +92,6 @@ website/usermanual/* website/idxthreads/forkingRecoll.html website/idxthreads/xapDocCopyCrash.html website/pages/recoll-mingw.html +website/pages/recoll-webui-install-wsgi.html website/pages/recoll-windows.html src/doc/user/RCL.SEARCH.SYNONYMS.html diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 8899520a..49fb82a8 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -20,8 +20,8 @@ alink="#0000FF">
-

Recoll user manual

+

Recoll user manual

@@ -109,13 +109,13 @@ alink="#0000FF"> multiple indexes
2.1.3. Document types
+ "#idp63233312">Document types
2.1.4. Indexing failures
+ "#idp63252992">Indexing failures
2.1.5. Recovery
+ "#idp63260448">Recovery @@ -981,8 +981,8 @@ alink="#0000FF">
-

2.1.3. Document types

+

2.1.3. Document types

@@ -1075,8 +1075,8 @@ indexedmimetypes = application/pdf
-

2.1.4. Indexing +

2.1.4. Indexing failures

@@ -1116,8 +1116,8 @@ indexedmimetypes = application/pdf
-

2.1.5. Recovery

+

2.1.5. Recovery

@@ -6379,32 +6379,41 @@ or

Recoll versions after 1.11 define a Python programming interface, both - for searching and indexing. The indexing portion has - seen little use, but the searching one is used in the - Recoll Ubuntu Unity Lens and Recoll Web UI.

+ for searching and indexing.

-

The API is inspired by the Python database API - specification. There were two major changes in recent +

The search interface is used in the Recoll Ubuntu + Unity Lens and Recoll WebUI.

+ +

The indexing section of the API has seen little use, + and is more a proof of concept. In truth it is waiting + for its killer app...

+ +

The search API is modeled along the Python database + API specification. There were two major changes along Recoll versions:

    -
  • The basis for the Recoll API changed from Python - database API version 1.0 (Recoll versions up to 1.18.1), - to version 2.0 (Recoll 1.18.2 and later).
  • +
  • +

    The basis for the Recoll API changed from + Python database API version 1.0 (Recoll versions up to + 1.18.1), to version 2.0 (Recoll 1.18.2 and + later).

    +
  • -
  • The recoll module became a package - (with an internal recoll module) as of Recoll version 1.19, in order - to add more functions. For existing code, this only - changes the way the interface must be - imported.
  • +
  • +

    The recoll module + became a package (with an internal recoll module) as of + Recoll version + 1.19, in order to add more functions. For + existing code, this only changes the way the + interface must be imported.

    +
@@ -6433,13 +6442,38 @@ or +

As of Recoll 1.19, + the module can be compiled for Python3.

+

The normal Recoll - installer installs the Python API along with the main - code.

+ installer installs the Python2 API along with the main + code. The Python3 version must be explicitely built and + installed.

When installing from a repository, and depending on the distribution, the Python API can sometimes be found in a separate package.

+ +

The following small sample will run a query and list + the title and url for each of the results. It would + work with Recoll 1.19 + and later. The python/samples source directory + contains several examples of Python programming with + Recoll, exercising the + extension more completely, and especially its data + extraction features.

+
+          from recoll import recoll
+
+          db = recoll.connect()
+          query = db.query()
+          nres = query.execute("some query")
+          results = query.fetchmany(20)
+          for doc in results:
+              print(doc.url, doc.title)
+        
+
@@ -6564,8 +6598,6 @@ or connection to a Recoll index.

-

Methods

-
Db.close()
@@ -6628,8 +6660,6 @@ or execute index searches.

-

Methods

-
Query.sortby(fieldname, ascending=True)
@@ -6805,8 +6835,6 @@ or document contents.

-

Methods

-
get(key), [] operator
@@ -6854,8 +6882,6 @@ or detailed doc for now...

-

Methods

-
addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', @@ -6914,8 +6940,6 @@ or
-

Methods

-
Extractor(doc)
diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index f9f05ed5..a0e4754a 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -4354,23 +4354,27 @@ or Introduction &RCL; versions after 1.11 define a Python programming - interface, both for searching and indexing. The indexing - portion has seen little use, but the searching one is used - in the Recoll Ubuntu Unity Lens and Recoll Web UI. + interface, both for searching and indexing. - The API is inspired by the Python database API - specification. There were two major changes in recent &RCL; - versions: - - The basis for the &RCL; API changed from Python - database API version 1.0 (&RCL; versions up to 1.18.1), - to version 2.0 (&RCL; 1.18.2 and later). - The recoll module became a - package (with an internal recoll - module) as of &RCL; version 1.19, in order to add more - functions. For existing code, this only changes the way - the interface must be imported. - + The search interface is used in the Recoll Ubuntu Unity Lens + and Recoll WebUI. + + The indexing section of the API has seen little use, and is + more a proof of concept. In truth it is waiting for its killer + app... + + The search API is modeled along the Python database API + specification. There were two major changes along &RCL; versions: + + The basis for the &RCL; API changed from Python + database API version 1.0 (&RCL; versions up to 1.18.1), + to version 2.0 (&RCL; 1.18.2 and later). + The recoll module became a + package (with an internal recoll + module) as of &RCL; version 1.19, in order to add more + functions. For existing code, this only changes the way + the interface must be imported. + We will mostly describe the new API and package @@ -4392,13 +4396,33 @@ or - The normal &RCL; installer installs the Python - API along with the main code. + As of &RCL; 1.19, the module can be compiled for + Python3. + + The normal &RCL; installer installs the Python2 + API along with the main code. The Python3 version must be + explicitely built and installed. When installing from a repository, and depending on the - distribution, the Python API can sometimes be found in a - separate package. + distribution, the Python API can sometimes be found in a + separate package. + The following small sample will run a query and list + the title and url for each of the results. It would work with &RCL; + 1.19 and later. The python/samples source directory + contains several examples of Python programming with &RCL;, + exercising the extension more completely, and especially its data + extraction features. + + from recoll import recoll + + db = recoll.connect() + query = db.query() + nres = query.execute("some query") + results = query.fetchmany(20) + for doc in results: + print(doc.url, doc.title) + @@ -4460,7 +4484,6 @@ or a connect() call and holds a connection to a Recoll index. - Methods Db.close() Closes the connection. You can't do anything @@ -4511,7 +4534,6 @@ or execute index searches. - Methods Query.sortby(fieldname, ascending=True) @@ -4659,7 +4681,6 @@ or module for accessing document contents. - Methods get(key), [] operator @@ -4694,7 +4715,6 @@ or for now... - Methods addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', @@ -4729,7 +4749,6 @@ or The Extractor class - Methods Extractor(doc) diff --git a/website/index.html.en b/website/index.html.en index 06adf4d0..08489dba 100644 --- a/website/index.html.en +++ b/website/index.html.en @@ -71,12 +71,12 @@ interface.

Recoll will index an MS-Word document - stored as an attachment to an e-mail message inside - a Thunderbird folder archived in a Zip file (and - more...). It will also help you search for it with a friendly and - powerful interface, and let you open a copy of a PDF at the right - page with two clicks. There is little that will remain - hidden on your disk.

+ stored as an attachment to an e-mail message inside + a Thunderbird folder archived in a Zip file (and + more...). It will also help you search for it with a friendly and + powerful interface, and let you open a copy of a PDF at the right + page with two clicks. There is little that will remain + hidden on your disk.

Recoll has extensive documentation. If you run into a problem, or want to @@ -96,8 +96,16 @@

News

-
-
2016-04-18
Found a GUI +
+ +
2016-04-21
I experimented with installing + the Recoll + Web UI with Apache, and found out + that this + is really easy, actually both easier to set up and + more useful than running it standalone.
+ +
2016-04-18
Found a GUI crash bug with a reasonably easy workaround.
2016-04-14
Release 1.22.0 is now available from @@ -122,7 +130,6 @@ the known bugs page
2015-11-09
-
Recoll on MS-Windows diff --git a/website/pages/Makefile b/website/pages/Makefile index c41a0cac..420b997d 100644 --- a/website/pages/Makefile +++ b/website/pages/Makefile @@ -3,7 +3,8 @@ .txt.html: asciidoc $< -all: recoll-mingw.html recoll-windows.html +all: recoll-mingw.html recoll-windows.html \ + recoll-webui-install-wsgi.html clean: rm -f *.html diff --git a/website/pages/recoll-webui-install-wsgi.txt b/website/pages/recoll-webui-install-wsgi.txt new file mode 100644 index 00000000..237923eb --- /dev/null +++ b/website/pages/recoll-webui-install-wsgi.txt @@ -0,0 +1,200 @@ += Recoll WebUI Apache installation from scratch + +The https://github.com/koniu/recoll-webui[Recoll WebUI] offers an +alternative, WEB-based, interface for querying a Recoll index. + +It can be quite useful to extend the use of a shared index to multiple +workstations, without the need for a local Recoll installation and shared +data storage. + +The Recoll WebUI is based on the +http://bottlepy.org/docs/dev/index.html[Bottle Python framework], which has +a built-in WEB server, and the simplest deployment approach is to run it +standalone. However the built-in server is restricted to handling one +request at a time, which is problematic in multi-user situations, +especially because some requests, like extracting a result list into a CSV +file, can take a significant amount of time. + +The Bottle framework can work with several multi-threading Python HTTP +server libraries, but, given the limitations of the Recoll Python module +and the Python interpreter itself, this will not yield optimal performance, +and, especially can't efficiently leverage the now ubiquitous +multiprocessors. + +In multi-user situations, you can get better performance and ease of use +from the Recoll WebUI by running it under Apache rather than as a +standalone process. With this approach, a few requests per second can +easily be handled even in the presence of long-running ones. + +Neither Recoll nor the WebUI are optimized for high multi-user load, and it +would be very unwise to use them as the search interface to a busy WEB +site. + +The instructions about using the WebUI under Apache as given in the +repository README are a bit terse, and are missing a few details, +especially ones which impact performance. + +Here follows the synopsis of two WebUI installations on initially +Apache-less Ubuntu (14.04) and DragonFly BSD systems. The first should +extend easily to other Debian-based systems, the second at least to +FreeBSD. rpm-based systems are left as an exercise to the reader, at least +for now... + + +CAUTION: THE CONFIGURATIONS DESCRIBED HAVE NO ACCESS CONTROL. ANYONE WITH +ACCESS TO THE NETWORK WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY +DOCUMENT. + +== On a Debian/Ubuntu system + +=== Install recoll + + sudo apt-get install recoll python-recoll + +Configure the indexing and check that the normal search works (I spent +quite a lot of time trying to understand why the WebUI did not work, when +in fact it was the normal recoll configuration which was broken and the +regular search did not work either). + +Take care to be logged in as the user you want to run the web search as +while you do this. + + +=== Install the WebUI + +Clone the github repository, or extract the master tar installation, and +move it to '/var/www/recoll-webui-master/'. Take care that it is read/execute +accessible by your user. + +=== Install Apache and mod-wsgi + + + sudo apt-get install apache2 libapache2-mod-wsgi + +I then got the following message: + + AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message + +To clear it, I added a ServerName directive to the apache config, maybe you +won't need it. Edit '/etc/apache2/sites-available/000-default.conf' and add +the following at the top (globally). Things work without this fix anyway, +this is just to suppress the error message. You probably need to adjust the +address or use a real host name: + + ServerName 192.168.4.6 + + +Edit '/etc/apache2/mods-enabled/wsgi.conf', add the following at the end of +the "IfModule" section. + +Change the user ('dockes' in the example) taking care that he is the one who +owns the index ('.recoll' is in his home directory). + + WSGIDaemonProcess recoll user=dockes group=dockes \ + threads=1 processes=5 display-name=%{GROUP} \ + python-path=/var/www/recoll-webui-master + WSGIScriptAlias /recoll /var/www/recoll-webui-master/webui-wsgi.py + + WSGIProcessGroup recoll + Order allow,deny + allow from all + + +NOTE: the Recoll WebUI application is mostly single-threaded, so it is of +little use (and may actually be counter-productive in some cases) to +specify multiple threads on the WSGIDaemonProcess line. Specify multiple +processes instead to put multiple CPUs to work on simultaneous requests. + + +Then run the following to restart apache: + + sudo apachectl restart + +The Recoll WebUI should now be accessible. on 'http://my.server.com/recoll/' + +NOTE: Take care that you need a '/' at the end of the URL used to access +the search (use: 'http://my.server.com/recoll/', not +'http://my.server.com/recoll'), else files other than the script itself are +not found (the page looks weird and the search does not work). + +CAUTION: THERE IS NO ACCESS CONTROL. ANYONE WITH ACCESS TO THE NETWORK +WHERE THE SERVER IS LOCATED CAN RETRIEVE ANY DOCUMENT. + +== Variant for BSD/ports + +=== Packages + +As root: + + pkg install recoll + + +Do what you need to do to configure the indexing and check that the normal +search works. + +Take care to be logged in as the user you want to run the web search as +while you do this. + + pkg install apache24 + +Add apache24_enable="YES" in /etc/rc.conf + + pkg install ap24-mod_wsgi4 + pkg install git + +=== Clone the webui repository + + cd /usr/local/www/apache24/ + git clone https://github.com/koniu/recoll-webui.git recoll-webui-master + +Important: most input handler helper applications (e.g. 'pdftotext') are +installed in '/usr/local/bin' which is not in the PATH as seen by Apache +(at least on DragonFly). The simplest way to fix this is to modify the +launcher module for the webui app so that it fixes the PATH. + +Edit 'recoll-webui-master/webui-wsgi.py' and add the following line after +the 'import os' line: + + os.environ['PATH'] = os.environ['PATH'] + ':' + '/usr/local/bin' + + + +=== Configure apache + +Edit /usr/local/etc/apache24/modules.d/270_mod_wsgi.conf + +Uncomment the LoadModule line, and add the directives to alias /recoll/ to +the webui script. + +Change the user (dockes in the example) taking care that he is the one who +owns the index (.recoll is in his home directory). + +Contents of the file: + + ## $FreeBSD$ + ## vim: set filetype=apache: + ## + ## module file for mod_wsgi + ## + ## PROVIDE: mod_wsgi + ## REQUIRE: + + LoadModule wsgi_module libexec/apache24/mod_wsgi.so + + WSGIDaemonProcess recoll user=dockes group=dockes \ + threads=1 processes=5 display-name=%{GROUP} \ + python-path=/usr/local/www/apache24/recoll-webui-master/ + WSGIScriptAlias /recoll /usr/local/www/apache24/recoll-webui-master/webui-wsgi.py + + + WSGIProcessGroup recoll + Require all granted + + +=== Restart apache + +As root: + + apachectl restart + + diff --git a/website/pages/recoll-windows.txt b/website/pages/recoll-windows.txt index a96238df..d4907848 100644 --- a/website/pages/recoll-windows.txt +++ b/website/pages/recoll-windows.txt @@ -91,7 +91,7 @@ Changes in 20160414 - Fixed a bug which had the whole indexing stop if a script would time out on a specific file (it will very rarely happen that a pathologically bad file can throw an input handler in a loop). -- + Changes in 20160317