9786 lines
411 KiB
HTML
9786 lines
411 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<meta name="generator" content=
|
|
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
|
|
<meta http-equiv="Content-Type" content=
|
|
"text/html; charset=us-ascii">
|
|
|
|
<title>Recoll user manual</title>
|
|
<link rel="stylesheet" type="text/css" href="docbook-xsl.css">
|
|
<meta name="generator" content="DocBook XSL Stylesheets V1.78.1">
|
|
<meta name="description" content=
|
|
"Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license can be found at the following location: GNU web site. This document introduces full text search notions and describes the installation and use of the Recoll application. This version describes Recoll 1.22.">
|
|
</head>
|
|
|
|
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084"
|
|
alink="#0000FF">
|
|
<div lang="en" class="book">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="idp57237872" id=
|
|
"idp57237872"></a>Recoll user manual</h1>
|
|
</div>
|
|
|
|
<div>
|
|
<div class="author">
|
|
<h3 class="author"><span class=
|
|
"firstname">Jean-Francois</span> <span class=
|
|
"surname">Dockes</span></h3>
|
|
|
|
<div class="affiliation">
|
|
<div class="address">
|
|
<p><code class="email"><<a class="email" href=
|
|
"mailto:jfd@recoll.org">jfd@recoll.org</a>></code></p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div>
|
|
<p class="copyright">Copyright © 2005-2015
|
|
Jean-Francois Dockes</p>
|
|
</div>
|
|
|
|
<div>
|
|
<div class="abstract">
|
|
<p><code class="literal">Permission is granted to copy,
|
|
distribute and/or modify this document under the terms
|
|
of the GNU Free Documentation License, Version 1.3 or
|
|
any later version published by the Free Software
|
|
Foundation; with no Invariant Sections, no Front-Cover
|
|
Texts, and no Back-Cover Texts. A copy of the license
|
|
can be found at the following location: <a class=
|
|
"ulink" href="http://www.gnu.org/licenses/fdl.html"
|
|
target="_top">GNU web site</a>.</code></p>
|
|
|
|
<p>This document introduces full text search notions
|
|
and describes the installation and use of the
|
|
<span class="application">Recoll</span> application.
|
|
This version describes <span class=
|
|
"application">Recoll</span> 1.22.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<hr>
|
|
</div>
|
|
|
|
<div class="toc">
|
|
<p><b>Table of Contents</b></p>
|
|
|
|
<dl class="toc">
|
|
<dt><span class="chapter">1. <a href=
|
|
"#RCL.INTRODUCTION">Introduction</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">1.1. <a href=
|
|
"#RCL.INTRODUCTION.TRYIT">Giving it a
|
|
try</a></span></dt>
|
|
|
|
<dt><span class="sect1">1.2. <a href=
|
|
"#RCL.INTRODUCTION.SEARCH">Full text
|
|
search</a></span></dt>
|
|
|
|
<dt><span class="sect1">1.3. <a href=
|
|
"#RCL.INTRODUCTION.RECOLL">Recoll
|
|
overview</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="chapter">2. <a href=
|
|
"#RCL.INDEXING">Indexing</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">2.1. <a href=
|
|
"#RCL.INDEXING.INTRODUCTION">Introduction</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.1.1. <a href=
|
|
"#RCL.INDEXING.INTRODUCTION.MODES">Indexing
|
|
modes</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.1.2. <a href=
|
|
"#RCL.INDEXING.INTRODUCTION.CONFIG">Configurations,
|
|
multiple indexes</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.1.3. <a href=
|
|
"#idp63233312">Document types</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.1.4. <a href=
|
|
"#idp63252992">Indexing failures</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.1.5. <a href=
|
|
"#idp63260448">Recovery</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">2.2. <a href=
|
|
"#RCL.INDEXING.STORAGE">Index storage</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.2.1. <a href=
|
|
"#RCL.INDEXING.STORAGE.FORMAT"><span class=
|
|
"application">Xapian</span> index
|
|
formats</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.2.2. <a href=
|
|
"#RCL.INDEXING.STORAGE.SECURITY">Security
|
|
aspects</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">2.3. <a href=
|
|
"#RCL.INDEXING.CONFIG">Index
|
|
configuration</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.3.1. <a href=
|
|
"#RCL.INDEXING.CONFIG.MULTIPLE">Multiple
|
|
indexes</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.3.2. <a href=
|
|
"#RCL.INDEXING.CONFIG.SENS">Index case and
|
|
diacritics sensitivity</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.3.3. <a href=
|
|
"#RCL.INDEXING.CONFIG.GUI">The index configuration
|
|
GUI</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">2.4. <a href=
|
|
"#RCL.INDEXING.WEBQUEUE">Indexing WEB pages you
|
|
wisit</a></span></dt>
|
|
|
|
<dt><span class="sect1">2.5. <a href=
|
|
"#RCL.INDEXING.EXTATTR">Extended attributes
|
|
data</a></span></dt>
|
|
|
|
<dt><span class="sect1">2.6. <a href=
|
|
"#RCL.INDEXING.EXTTAGS">Importing external
|
|
tags</a></span></dt>
|
|
|
|
<dt><span class="sect1">2.7. <a href=
|
|
"#RCL.INDEXING.PERIODIC">Periodic
|
|
indexing</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.7.1. <a href=
|
|
"#RCL.INDEXING.PERIODIC.EXEC">Running
|
|
indexing</a></span></dt>
|
|
|
|
<dt><span class="sect2">2.7.2. <a href=
|
|
"#RCL.INDEXING.PERIODIC.AUTOMAT">Using <span class=
|
|
"command"><strong>cron</strong></span> to automate
|
|
indexing</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">2.8. <a href=
|
|
"#RCL.INDEXING.MONITOR">Real time
|
|
indexing</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.8.1. <a href=
|
|
"#RCL.INDEXING.MONITOR.FASTFILES">Slowing down the
|
|
reindexing rate for fast changing
|
|
files</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="chapter">3. <a href=
|
|
"#RCL.SEARCH">Searching</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">3.1. <a href=
|
|
"#RCL.SEARCH.GUI">Searching with the Qt graphical user
|
|
interface</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.1.1. <a href=
|
|
"#RCL.SEARCH.GUI.SIMPLE">Simple
|
|
search</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.2. <a href=
|
|
"#RCL.SEARCH.GUI.RESLIST">The default result
|
|
list</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.3. <a href=
|
|
"#RCL.SEARCH.GUI.RESTABLE">The result
|
|
table</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.4. <a href=
|
|
"#RCL.SEARCH.GUI.RUNSCRIPT">Running arbitrary
|
|
commands on result files (1.20 and
|
|
later)</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.5. <a href=
|
|
"#RCL.SEARCH.GUI.THUMBNAILS">Displaying
|
|
thumbnails</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.6. <a href=
|
|
"#RCL.SEARCH.GUI.PREVIEW">The preview
|
|
window</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.7. <a href=
|
|
"#RCL.SEARCH.GUI.FRAGBUTS">The Query Fragments
|
|
window</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.8. <a href=
|
|
"#RCL.SEARCH.GUI.COMPLEX">Complex/advanced
|
|
search</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.9. <a href=
|
|
"#RCL.SEARCH.GUI.TERMEXPLORER">The term explorer
|
|
tool</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.10. <a href=
|
|
"#RCL.SEARCH.GUI.MULTIDB">Multiple
|
|
indexes</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.11. <a href=
|
|
"#RCL.SEARCH.GUI.HISTORY">Document
|
|
history</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.12. <a href=
|
|
"#RCL.SEARCH.GUI.SORT">Sorting search results and
|
|
collapsing duplicates</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.13. <a href=
|
|
"#RCL.SEARCH.GUI.TIPS">Search tips,
|
|
shortcuts</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.14. <a href=
|
|
"#RCL.SEARCH.SAVING">Saving and restoring queries
|
|
(1.21 and later)</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.1.15. <a href=
|
|
"#RCL.SEARCH.GUI.CUSTOM">Customizing the search
|
|
interface</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">3.2. <a href=
|
|
"#RCL.SEARCH.KIO">Searching with the KDE KIO
|
|
slave</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.2.1. <a href=
|
|
"#RCL.SEARCH.KIO.INTRO">What's this</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.2.2. <a href=
|
|
"#RCL.SEARCH.KIO.SEARCHABLEDOCS">Searchable
|
|
documents</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">3.3. <a href=
|
|
"#RCL.SEARCH.COMMANDLINE">Searching on the command
|
|
line</a></span></dt>
|
|
|
|
<dt><span class="sect1">3.4. <a href=
|
|
"#RCL.SEARCH.SYNONYMS">Using Synonyms
|
|
(1.22)</a></span></dt>
|
|
|
|
<dt><span class="sect1">3.5. <a href=
|
|
"#RCL.SEARCH.PTRANS">Path translations</a></span></dt>
|
|
|
|
<dt><span class="sect1">3.6. <a href=
|
|
"#RCL.SEARCH.LANG">The query language</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.6.1. <a href=
|
|
"#RCL.SEARCH.LANG.MODIFIERS">Modifiers</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">3.7. <a href=
|
|
"#RCL.SEARCH.CASEDIAC">Search case and diacritics
|
|
sensitivity</a></span></dt>
|
|
|
|
<dt><span class="sect1">3.8. <a href=
|
|
"#RCL.SEARCH.ANCHORWILD">Anchored searches and
|
|
wildcards</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.8.1. <a href=
|
|
"#RCL.SEARCH.WILDCARDS">More about
|
|
wildcards</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.8.2. <a href=
|
|
"#RCL.SEARCH.ANCHOR">Anchored
|
|
searches</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">3.9. <a href=
|
|
"#RCL.SEARCH.DESKTOP">Desktop
|
|
integration</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.9.1. <a href=
|
|
"#RCL.SEARCH.SHORTCUT">Hotkeying
|
|
recoll</a></span></dt>
|
|
|
|
<dt><span class="sect2">3.9.2. <a href=
|
|
"#RCL.KICKER-APPLET">The KDE Kicker Recoll
|
|
applet</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="chapter">4. <a href=
|
|
"#RCL.PROGRAM">Programming interface</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">4.1. <a href=
|
|
"#RCL.PROGRAM.FILTERS">Writing a document input
|
|
handler</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">4.1.1. <a href=
|
|
"#RCL.PROGRAM.FILTERS.SIMPLE">Simple input
|
|
handlers</a></span></dt>
|
|
|
|
<dt><span class="sect2">4.1.2. <a href=
|
|
"#RCL.PROGRAM.FILTERS.MULTIPLE">"Multiple"
|
|
handlers</a></span></dt>
|
|
|
|
<dt><span class="sect2">4.1.3. <a href=
|
|
"#RCL.PROGRAM.FILTERS.ASSOCIATION">Telling
|
|
<span class="application">Recoll</span> about the
|
|
handler</a></span></dt>
|
|
|
|
<dt><span class="sect2">4.1.4. <a href=
|
|
"#RCL.PROGRAM.FILTERS.HTML">Input handler HTML
|
|
output</a></span></dt>
|
|
|
|
<dt><span class="sect2">4.1.5. <a href=
|
|
"#RCL.PROGRAM.FILTERS.PAGES">Page
|
|
numbers</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">4.2. <a href=
|
|
"#RCL.PROGRAM.FIELDS">Field data
|
|
processing</a></span></dt>
|
|
|
|
<dt><span class="sect1">4.3. <a href=
|
|
"#RCL.PROGRAM.API">API</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">4.3.1. <a href=
|
|
"#RCL.PROGRAM.API.ELEMENTS">Interface
|
|
elements</a></span></dt>
|
|
|
|
<dt><span class="sect2">4.3.2. <a href=
|
|
"#RCL.PROGRAM.API.PYTHON">Python
|
|
interface</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="chapter">5. <a href=
|
|
"#RCL.INSTALL">Installation and
|
|
configuration</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">5.1. <a href=
|
|
"#RCL.INSTALL.BINARY">Installing a binary
|
|
copy</a></span></dt>
|
|
|
|
<dt><span class="sect1">5.2. <a href=
|
|
"#RCL.INSTALL.EXTERNAL">Supporting
|
|
packages</a></span></dt>
|
|
|
|
<dt><span class="sect1">5.3. <a href=
|
|
"#RCL.INSTALL.BUILDING">Building from
|
|
source</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">5.3.1. <a href=
|
|
"#RCL.INSTALL.BUILDING.PREREQS">Prerequisites</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.3.2. <a href=
|
|
"#RCL.INSTALL.BUILDING.BUILD">Building</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.3.3. <a href=
|
|
"#RCL.INSTALL.BUILDING.INSTALL">Installation</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt><span class="sect1">5.4. <a href=
|
|
"#RCL.INSTALL.CONFIG">Configuration
|
|
overview</a></span></dt>
|
|
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">5.4.1. <a href=
|
|
"#RCL.INSTALL.CONFIG.ENVIR">Environment
|
|
variables</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.2. <a href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF">The main
|
|
configuration file, recoll.conf</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.3. <a href=
|
|
"#RCL.INSTALL.CONFIG.FIELDS">The fields
|
|
file</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.4. <a href=
|
|
"#RCL.INSTALL.CONFIG.MIMEMAP">The mimemap
|
|
file</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.5. <a href=
|
|
"#RCL.INSTALL.CONFIG.MIMECONF">The mimeconf
|
|
file</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.6. <a href=
|
|
"#RCL.INSTALL.CONFIG.MIMEVIEW">The mimeview
|
|
file</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.7. <a href=
|
|
"#RCL.INSTALL.CONFIG.PTRANS">The <code class=
|
|
"filename">ptrans</code> file</a></span></dt>
|
|
|
|
<dt><span class="sect2">5.4.8. <a href=
|
|
"#RCL.INSTALL.CONFIG.EXAMPLES">Examples of
|
|
configuration adjustments</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.INTRODUCTION" id=
|
|
"RCL.INTRODUCTION"></a>Chapter 1. Introduction</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>This document introduces full text search notions and
|
|
describes the installation and use of the <span class=
|
|
"application">Recoll</span> application. This version
|
|
describes <span class="application">Recoll</span> 1.22.</p>
|
|
|
|
<p><span class="application">Recoll</span> was for a long
|
|
time dedicated to Unix-like systems. It was only lately
|
|
(2015) ported to <span class="application">MS-Windows</span>.
|
|
Many references in this manual, especially file locations,
|
|
are specific to Unix, and not valid on <span class=
|
|
"application">Windows</span>. Some described features are
|
|
also not available on <span class=
|
|
"application">Windows</span>. The manual will be
|
|
progressively updated. Until this happens, most references to
|
|
shared files can be translated by looking under the Recoll
|
|
installation directory (esp. the <code class=
|
|
"filename">Share</code> subdirectory). The user configuration
|
|
is stored by default under <code class=
|
|
"filename">AppData/Local/Recoll</code> inside the user
|
|
directory, along with the index itself.</p>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INTRODUCTION.TRYIT" id=
|
|
"RCL.INTRODUCTION.TRYIT"></a>1.1. Giving it a
|
|
try</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>If you do not like reading manuals (who does?) but wish
|
|
to give <span class="application">Recoll</span> a try, just
|
|
<a class="link" href="#RCL.INSTALL.BINARY" title=
|
|
"5.1. Installing a binary copy">install</a> the
|
|
application and start the <span class=
|
|
"command"><strong>recoll</strong></span> graphical user
|
|
interface (GUI), which will ask permission to index your
|
|
home directory by default, allowing you to search
|
|
immediately after indexing completes.</p>
|
|
|
|
<p>Do not do this if your home directory contains a huge
|
|
number of documents and you do not want to wait or are very
|
|
short on disk space. In this case, you may first want to
|
|
customize the <a class="link" href="#RCL.INDEXING.CONFIG"
|
|
title="2.3. Index configuration">configuration</a> to
|
|
restrict the indexed area (for the very impatient with a
|
|
completed package install, from the <span class=
|
|
"command"><strong>recoll</strong></span> GUI: <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Indexing configuration</span>, then adjust
|
|
the <span class="guilabel">Top directories</span>
|
|
section).</p>
|
|
|
|
<p>Also be aware that, on Unix/Linux, you may need to
|
|
install the appropriate <a class="link" href=
|
|
"#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">supporting applications</a>
|
|
for document types that need them (for example <span class=
|
|
"application">antiword</span> for <span class=
|
|
"application">Microsoft Word</span> files).</p>
|
|
|
|
<p>The <span class="application">Recoll</span> installation
|
|
for <span class="application">Windows</span> is
|
|
self-contained and includes most useful auxiliary programs.
|
|
You will just need to install Python 2.7.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INTRODUCTION.SEARCH" id=
|
|
"RCL.INTRODUCTION.SEARCH"></a>1.2. Full text
|
|
search</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> is a full text
|
|
search application, which means that it finds your data by
|
|
content rather than by external attributes (like the file
|
|
name). You specify words (terms) which should or should not
|
|
appear in the text you are looking for, and receive in
|
|
return a list of matching documents, ordered so that the
|
|
most <span class="emphasis"><em>relevant</em></span>
|
|
documents will appear first.</p>
|
|
|
|
<p>You do not need to remember in what file or email
|
|
message you stored a given piece of information. You just
|
|
ask for related terms, and the tool will return a list of
|
|
documents where these terms are prominent, in a similar way
|
|
to Internet search engines.</p>
|
|
|
|
<p>Full text search applications try to determine which
|
|
documents are most relevant to the search terms you
|
|
provide. Computer algorithms for determining relevance can
|
|
be very complex, and in general are inferior to the power
|
|
of the human mind to rapidly determine relevance. The
|
|
quality of relevance guessing is probably the most
|
|
important aspect when evaluating a search application.</p>
|
|
|
|
<p>In many cases, you are looking for all the forms of a
|
|
word, including plurals, different tenses for a verb, or
|
|
terms derived from the same root or <span class=
|
|
"emphasis"><em>stem</em></span> (example: <em class=
|
|
"replaceable"><code>floor, floors, floored,
|
|
flooring...</code></em>). Queries are usually automatically
|
|
expanded to all such related terms (words that reduce to
|
|
the same stem). This can be prevented for searching for a
|
|
specific form.</p>
|
|
|
|
<p>Stemming, by itself, does not accommodate for
|
|
misspellings or phonetic searches. A full text search
|
|
application may also support this form of approximation.
|
|
For example, a search for <em class=
|
|
"replaceable"><code>aliterattion</code></em> returning no
|
|
result may propose, depending on index contents, <em class=
|
|
"replaceable"><code>alliteration alteration alterations
|
|
altercation</code></em> as possible replacement terms.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INTRODUCTION.RECOLL" id=
|
|
"RCL.INTRODUCTION.RECOLL"></a>1.3. Recoll
|
|
overview</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> uses the
|
|
<a class="ulink" href="http://www.xapian.org" target=
|
|
"_top"><span class="application">Xapian</span></a>
|
|
information retrieval library as its storage and retrieval
|
|
engine. <span class="application">Xapian</span> is a very
|
|
mature package using <a class="ulink" href=
|
|
"http://www.xapian.org/docs/intro_ir.html" target="_top">a
|
|
sophisticated probabilistic ranking model</a>.</p>
|
|
|
|
<p>The <span class="application">Xapian</span> library
|
|
manages an index database which describes where terms
|
|
appear in your document files. It efficiently processes the
|
|
complex queries which are produced by the <span class=
|
|
"application">Recoll</span> query expansion mechanism, and
|
|
is in charge of the all-important relevance computation
|
|
task.</p>
|
|
|
|
<p><span class="application">Recoll</span> provides the
|
|
mechanisms and interface to get data into and out of the
|
|
index. This includes translating the many possible document
|
|
formats into pure text, handling term variations (using
|
|
<span class="application">Xapian</span> stemmers), and
|
|
spelling approximations (using the <span class=
|
|
"application">aspell</span> speller), interpreting user
|
|
queries and presenting results.</p>
|
|
|
|
<p>In a shorter way, <span class=
|
|
"application">Recoll</span> does the dirty footwork,
|
|
<span class="application">Xapian</span> deals with the
|
|
intelligent parts of the process.</p>
|
|
|
|
<p>The <span class="application">Xapian</span> index can be
|
|
big (roughly the size of the original document set), but it
|
|
is not a document archive. <span class=
|
|
"application">Recoll</span> can only display documents that
|
|
still exist at the place from which they were indexed.
|
|
(Actually, there is a way to reconstruct a document from
|
|
the information in the index, but the result is not nice,
|
|
as all formatting, punctuation and capitalization are
|
|
lost).</p>
|
|
|
|
<p><span class="application">Recoll</span> stores all
|
|
internal data in <span class="application">Unicode
|
|
UTF-8</span> format, and it can index files of many types
|
|
with different character sets, encodings, and languages
|
|
into the same index. It can process documents embedded
|
|
inside other documents (for example a pdf document stored
|
|
inside a Zip archive sent as an email attachment...), down
|
|
to an arbitrary depth.</p>
|
|
|
|
<p>Stemming is the process by which <span class=
|
|
"application">Recoll</span> reduces words to their radicals
|
|
so that searching does not depend, for example, on a word
|
|
being singular or plural (floor, floors), or on a verb
|
|
tense (flooring, floored). Because the mechanisms used for
|
|
stemming depend on the specific grammatical rules for each
|
|
language, there is a separate <span class=
|
|
"application">Xapian</span> stemmer module for most common
|
|
languages where stemming makes sense.</p>
|
|
|
|
<p><span class="application">Recoll</span> stores the
|
|
unstemmed versions of terms in the main index and uses
|
|
auxiliary databases for term expansion (one for each
|
|
stemming language), which means that you can switch
|
|
stemming languages between searches, or add a language
|
|
without needing a full reindex.</p>
|
|
|
|
<p>Storing documents written in different languages in the
|
|
same index is possible, and commonly done. In this
|
|
situation, you can specify several stemming languages for
|
|
the index.</p>
|
|
|
|
<p><span class="application">Recoll</span> currently makes
|
|
no attempt at automatic language recognition, which means
|
|
that the stemmer will sometimes be applied to terms from
|
|
other languages with potentially strange results. In
|
|
practise, even if this introduces possibilities of
|
|
confusion, this approach has been proven quite useful, and
|
|
it is much less cumbersome than separating your documents
|
|
according to what language they are written in.</p>
|
|
|
|
<p>By default, <span class="application">Recoll</span>
|
|
strips most accents and diacritics from terms, and converts
|
|
them to lower case before either storing them in the index
|
|
or searching for them. As a consequence, it is impossible
|
|
to search for a particular capitalization of a term
|
|
(<code class="literal">US</code> / <code class=
|
|
"literal">us</code>), or to discriminate two terms based on
|
|
diacritics (<code class="literal">sake</code> /
|
|
<code class="literal">saké</code>, <code class=
|
|
"literal">mate</code> / <code class=
|
|
"literal">maté</code>).</p>
|
|
|
|
<p><span class="application">Recoll</span> versions 1.18
|
|
and newer can optionally store the raw terms, without
|
|
accent stripping or case conversion. In this configuration,
|
|
default searches will behave as before, but it is possible
|
|
to perform searches sensitive to case and diacritics. This
|
|
is described in more detail in the <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.SENS" title=
|
|
"2.3.2. Index case and diacritics sensitivity">section
|
|
about index case and diacritics sensitivity</a>.</p>
|
|
|
|
<p><span class="application">Recoll</span> has many
|
|
parameters which define exactly what to index, and how to
|
|
classify and decode the source documents. These are kept in
|
|
<a class="link" href="#RCL.INDEXING.CONFIG" title=
|
|
"2.3. Index configuration">configuration files</a>. A
|
|
default configuration is copied into a standard location
|
|
(usually something like <code class=
|
|
"filename">/usr/share/recoll/examples</code>) during
|
|
installation. The default values set by the configuration
|
|
files in this directory may be overridden by values set
|
|
inside your personal configuration, found by default in the
|
|
<code class="filename">.recoll</code> sub-directory of your
|
|
home directory. The default configuration will index your
|
|
home directory with default parameters and should be
|
|
sufficient for giving <span class=
|
|
"application">Recoll</span> a try, but you may want to
|
|
adjust it later, which can be done either by editing the
|
|
text files or by using configuration menus in the
|
|
<span class="command"><strong>recoll</strong></span> GUI.
|
|
Some other parameters affecting only the <span class=
|
|
"command"><strong>recoll</strong></span> GUI are stored in
|
|
the standard location defined by <span class=
|
|
"application">Qt</span>.</p>
|
|
|
|
<p>The <a class="link" href="#RCL.INDEXING.PERIODIC.EXEC"
|
|
title="2.7.1. Running indexing">indexing process</a>
|
|
is started automatically the first time you execute the
|
|
<span class="command"><strong>recoll</strong></span> GUI.
|
|
Indexing can also be performed by executing the
|
|
<span class="command"><strong>recollindex</strong></span>
|
|
command. <span class="application">Recoll</span> indexing
|
|
is multithreaded by default when appropriate hardware
|
|
resources are available, and can perform in parallel
|
|
multiple tasks among text extraction, segmentation and
|
|
index updates.</p>
|
|
|
|
<p><a class="link" href="#RCL.SEARCH" title=
|
|
"Chapter 3. Searching">Searches</a> are usually
|
|
performed inside the <span class=
|
|
"command"><strong>recoll</strong></span> GUI, which has
|
|
many options to help you find what you are looking for.
|
|
However, there are other ways to perform <span class=
|
|
"application">Recoll</span> searches: mostly a <a class=
|
|
"link" href="#RCL.SEARCH.COMMANDLINE" title=
|
|
"3.3. Searching on the command line">command line
|
|
interface</a>, a <a class="link" href=
|
|
"#RCL.PROGRAM.API.PYTHON" title=
|
|
"4.3.2. Python interface"><span class=
|
|
"application">Python</span> programming interface</a>, a
|
|
<a class="link" href="#RCL.SEARCH.KIO" title=
|
|
"3.2. Searching with the KDE KIO slave"><span class=
|
|
"application">KDE</span> KIO slave module</a>, and Ubuntu
|
|
Unity <a class="ulink" href=
|
|
"https://bitbucket.org/medoc/unity-lens-recoll" target=
|
|
"_top">Lens</a> (for older versions) or <a class="ulink"
|
|
href="https://bitbucket.org/medoc/unity-scope-recoll"
|
|
target="_top">Scope</a> (for current versions) modules.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.INDEXING" id=
|
|
"RCL.INDEXING"></a>Chapter 2. Indexing</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.INTRODUCTION" id=
|
|
"RCL.INDEXING.INTRODUCTION"></a>2.1. Introduction</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Indexing is the process by which the set of documents is
|
|
analyzed and the data entered into the database.
|
|
<span class="application">Recoll</span> indexing is
|
|
normally incremental: documents will only be processed if
|
|
they have been modified since the last run. On the first
|
|
execution, all documents will need processing. A full index
|
|
build can be forced later by specifying an option to the
|
|
indexing command (<span class=
|
|
"command"><strong>recollindex</strong></span> <code class=
|
|
"option">-z</code> or <code class="option">-Z</code>).</p>
|
|
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span> skips files
|
|
which caused an error during a previous pass. This is a
|
|
performance optimization, and a new behaviour in version
|
|
1.21 (failed files were always retried by previous
|
|
versions). The command line option <code class=
|
|
"option">-k</code> can be set to retry failed files, for
|
|
example after updating a filter.</p>
|
|
|
|
<p>The following sections give an overview of different
|
|
aspects of the indexing processes and configuration, with
|
|
links to detailed sections.</p>
|
|
|
|
<p>Depending on your data, temporary files may be needed
|
|
during indexing, some of them possibly quite big. You can
|
|
use the <code class="envar">RECOLL_TMPDIR</code> or
|
|
<code class="envar">TMPDIR</code> environment variables to
|
|
determine where they are created (the default is to use
|
|
<code class="filename">/tmp</code>). Using <code class=
|
|
"envar">TMPDIR</code> has the nice property that it may
|
|
also be taken into account by auxiliary commands executed
|
|
by <span class=
|
|
"command"><strong>recollindex</strong></span>.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.INTRODUCTION.MODES" id=
|
|
"RCL.INDEXING.INTRODUCTION.MODES"></a>2.1.1. Indexing
|
|
modes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> indexing can
|
|
be performed along two different modes:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
|
|
title="2.7. Periodic indexing">Periodic (or
|
|
batch) indexing:</a> </b>indexing takes place
|
|
at discrete times, by executing the <span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
command. The typical usage is to have a nightly
|
|
indexing run <a class="link" href=
|
|
"#RCL.INDEXING.PERIODIC.AUTOMAT" title=
|
|
"2.7.2. Using cron to automate indexing">programmed</a>
|
|
into your <span class=
|
|
"command"><strong>cron</strong></span> file.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b><a class="link" href="#RCL.INDEXING.MONITOR"
|
|
title="2.8. Real time indexing">Real time
|
|
indexing:</a> </b>indexing takes place as soon
|
|
as a file is created or changed. <span class=
|
|
"command"><strong>recollindex</strong></span> runs
|
|
as a daemon and uses a file system alteration
|
|
monitor such as <span class=
|
|
"application">inotify</span>, <span class=
|
|
"application">Fam</span> or <span class=
|
|
"application">Gamin</span> to detect file
|
|
changes.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The choice between the two methods is mostly a matter
|
|
of preference, and they can be combined by setting up
|
|
multiple indexes (ie: use periodic indexing on a big
|
|
documentation directory, and real time indexing on a
|
|
small home directory). Monitoring a big file system tree
|
|
can consume significant system resources.</p>
|
|
|
|
<p>The choice of method and the parameters used can be
|
|
configured from the <span class=
|
|
"command"><strong>recoll</strong></span> GUI:
|
|
<span class="guimenu">Preferences</span> →
|
|
<span class="guimenuitem">Indexing schedule</span></p>
|
|
|
|
<p>The <span class="guimenu">File</span> menu also has
|
|
entries to start or stop the current indexing operation.
|
|
Stopping indexing is performed by killing the
|
|
<span class="command"><strong>recollindex</strong></span>
|
|
process, which will checkpoint its state and exit. A
|
|
later restart of indexing will mostly resume from where
|
|
things stopped (the file tree walk has to be restarted
|
|
from the beginning).</p>
|
|
|
|
<p>When the real time indexer is running, only a stop
|
|
operation is available from the menu. When no indexing is
|
|
running, you have a choice of updating the index or
|
|
rebuilding it (the first choice only processes changed
|
|
files, the second one zeroes the index before starting so
|
|
that all files are processed).</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.INTRODUCTION.CONFIG" id=
|
|
"RCL.INDEXING.INTRODUCTION.CONFIG"></a>2.1.2. Configurations,
|
|
multiple indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The parameters describing what is to be indexed and
|
|
local preferences are defined in text files contained in
|
|
a <a class="link" href="#RCL.INDEXING.CONFIG" title=
|
|
"2.3. Index configuration">configuration
|
|
directory</a>.</p>
|
|
|
|
<p>All parameters have defaults, defined in system-wide
|
|
files.</p>
|
|
|
|
<p>Without further configuration, <span class=
|
|
"application">Recoll</span> will index all appropriate
|
|
files from your home directory, with a reasonable set of
|
|
defaults.</p>
|
|
|
|
<p>A default personal configuration directory
|
|
(<code class="filename">$HOME/.recoll/</code>) is created
|
|
when a <span class="application">Recoll</span> program is
|
|
first executed. It is possible to create other
|
|
configuration directories, and use them by setting the
|
|
<code class="envar">RECOLL_CONFDIR</code> environment
|
|
variable, or giving the <code class="option">-c</code>
|
|
option to any of the <span class=
|
|
"application">Recoll</span> commands.</p>
|
|
|
|
<p>In some cases, it may be interesting to index
|
|
different areas of the file system to separate databases.
|
|
You can do this by using multiple configuration
|
|
directories, each indexing a file system area to a
|
|
specific database. Typically, this would be done to
|
|
separate personal and shared indexes, or to take
|
|
advantage of the organization of your data to improve
|
|
search precision.</p>
|
|
|
|
<p>The generated indexes can be queried concurrently in a
|
|
transparent manner.</p>
|
|
|
|
<p>For index generation, multiple configurations are
|
|
totally independant from each other. When multiple
|
|
indexes need to be used for a single search, <a class=
|
|
"link" href="#RCL.INDEXING.CONFIG.MULTIPLE" title=
|
|
"2.3.1. Multiple indexes">some parameters should be
|
|
consistent among the configurations</a>.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="idp63233312" id=
|
|
"idp63233312"></a>2.1.3. Document types</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> knows about
|
|
quite a few different document types. The parameters for
|
|
document types recognition and processing are set in
|
|
<a class="link" href="#RCL.INDEXING.CONFIG" title=
|
|
"2.3. Index configuration">configuration
|
|
files</a>.</p>
|
|
|
|
<p>Most file types, like HTML or word processing files,
|
|
only hold one document. Some file types, like email
|
|
folders or zip archives, can hold many individually
|
|
indexed documents, which may themselves be compound ones.
|
|
Such hierarchies can go quite deep, and <span class=
|
|
"application">Recoll</span> can process, for example, a
|
|
<span class="application">LibreOffice</span> document
|
|
stored as an attachment to an email message inside an
|
|
email folder archived in a zip file...</p>
|
|
|
|
<p><span class="application">Recoll</span> indexing
|
|
processes plain text, HTML, OpenDocument
|
|
(Open/LibreOffice), email formats, and a few others
|
|
internally.</p>
|
|
|
|
<p>Other file types (ie: postscript, pdf, ms-word, rtf
|
|
...) need external applications for preprocessing. The
|
|
list is in the <a class="link" href=
|
|
"#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">installation</a> section.
|
|
After every indexing operation, <span class=
|
|
"application">Recoll</span> updates a list of commands
|
|
that would be needed for indexing existing files types.
|
|
This list can be displayed by selecting the menu option
|
|
<span class="guimenu">File</span> → <span class=
|
|
"guimenuitem">Show Missing Helpers</span> in the
|
|
<span class="command"><strong>recoll</strong></span> GUI.
|
|
It is stored in the <code class="filename">missing</code>
|
|
text file inside the configuration directory.</p>
|
|
|
|
<p>By default, <span class="application">Recoll</span>
|
|
will try to index any file type that it has a way to
|
|
read. This is sometimes not desirable, and there are ways
|
|
to either exclude some types, or on the contrary to
|
|
define a positive list of types to be indexed. In the
|
|
latter case, any type not in the list will be
|
|
ignored.</p>
|
|
|
|
<p>Excluding types can be done by adding wildcard name
|
|
patterns to the <code class="literal">skippedNames</code>
|
|
list, which can be done from the GUI Index configuration
|
|
menu. For versions 1.20 and later, you can alternatively
|
|
set the <code class="literal">excludedmimetypes</code>
|
|
list in the configuration file. This can be redefined for
|
|
subdirectories.</p>
|
|
|
|
<p>You can also define an exclusive list of MIME types to
|
|
be indexed (no others will be indexed), by settting the
|
|
<code class="literal">indexedmimetypes</code>
|
|
configuration variable. Example:</p>
|
|
<pre class="programlisting">
|
|
indexedmimetypes = text/html application/pdf
|
|
|
|
</pre>
|
|
|
|
<p>It is possible to redefine this parameter for
|
|
subdirectories. Example:</p>
|
|
<pre class="programlisting">
|
|
[/path/to/my/dir]
|
|
indexedmimetypes = application/pdf
|
|
|
|
</pre>
|
|
|
|
<p>(When using sections like this, don't forget that they
|
|
remain in effect until the end of the file or another
|
|
section indicator).</p>
|
|
|
|
<p><code class="literal">excludedmimetypes</code> or
|
|
<code class="literal">indexedmimetypes</code>, can be set
|
|
either by editing the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF" title=
|
|
"5.4.2. The main configuration file, recoll.conf">main
|
|
configuration file (<code class=
|
|
"filename">recoll.conf</code>)</a>, or from the GUI index
|
|
configuration tool.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="idp63252992" id=
|
|
"idp63252992"></a>2.1.4. Indexing
|
|
failures</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Indexing may fail for some documents, for a number of
|
|
reasons: a helper program may be missing, the document
|
|
may be corrupt, we may fail to uncompress a file because
|
|
no file system space is available, etc.</p>
|
|
|
|
<p><span class="application">Recoll</span> versions prior
|
|
to 1.21 always retried to index files which had
|
|
previously caused an error. This guaranteed that anything
|
|
that may have become indexable (for example because a
|
|
helper had been installed) would be indexed. However this
|
|
was bad for performance because some indexing failures
|
|
may be quite costly (for example failing to uncompress a
|
|
big file because of insufficient disk space).</p>
|
|
|
|
<p>The indexer in <span class="application">Recoll</span>
|
|
versions 1.21 and later does not retry failed file by
|
|
default. Retrying will only occur if an explicit option
|
|
(<code class="option">-k</code>) is set on the
|
|
<span class="command"><strong>recollindex</strong></span>
|
|
command line, or if a script executed when <span class=
|
|
"command"><strong>recollindex</strong></span> starts up
|
|
says so. The script is defined by a configuration
|
|
variable (<code class=
|
|
"literal">checkneedretryindexscript</code>), and makes a
|
|
rather lame attempt at deciding if a helper command may
|
|
have been installed, by checking if any of the common
|
|
<code class="filename">bin</code> directories have
|
|
changed.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="idp63260448" id=
|
|
"idp63260448"></a>2.1.5. Recovery</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>In the rare case where the index becomes corrupted
|
|
(which can signal itself by weird search results or
|
|
crashes), the index files need to be erased before
|
|
restarting a clean indexing pass. Just delete the
|
|
<code class="filename">xapiandb</code> directory (see
|
|
<a class="link" href="#RCL.INDEXING.STORAGE" title=
|
|
"2.2. Index storage">next section</a>), or,
|
|
alternatively, start the next <span class=
|
|
"command"><strong>recollindex</strong></span> with the
|
|
<code class="option">-z</code> option, which will reset
|
|
the database before indexing.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.STORAGE" id=
|
|
"RCL.INDEXING.STORAGE"></a>2.2. Index
|
|
storage</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The default location for the index data is the
|
|
<code class="filename">xapiandb</code> subdirectory of the
|
|
<span class="application">Recoll</span> configuration
|
|
directory, typically <code class=
|
|
"filename">$HOME/.recoll/xapiandb/</code>. This can be
|
|
changed via two different methods (with different
|
|
purposes):</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>You can specify a different configuration
|
|
directory by setting the <code class=
|
|
"envar">RECOLL_CONFDIR</code> environment variable,
|
|
or using the <code class="option">-c</code> option to
|
|
the <span class="application">Recoll</span> commands.
|
|
This method would typically be used to index
|
|
different areas of the file system to different
|
|
indexes. For example, if you were to issue the
|
|
following command:</p>
|
|
<pre class="programlisting">
|
|
recoll -c ~/.indexes-email
|
|
</pre>
|
|
|
|
<p>Then <span class="application">Recoll</span> would
|
|
use configuration files stored in <code class=
|
|
"filename">~/.indexes-email/</code> and, (unless
|
|
specified otherwise in <code class=
|
|
"filename">recoll.conf</code>) would look for the
|
|
index in <code class=
|
|
"filename">~/.indexes-email/xapiandb/</code>.</p>
|
|
|
|
<p>Using multiple configuration directories and
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
|
title=
|
|
"5.4.2. The main configuration file, recoll.conf">
|
|
configuration options</a> allows you to tailor
|
|
multiple configurations and indexes to handle
|
|
whatever subset of the available data you wish to
|
|
make searchable.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>For a given configuration directory, you can
|
|
specify a non-default storage location for the index
|
|
by setting the <code class="varname">dbdir</code>
|
|
parameter in the configuration file (see the
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
|
title=
|
|
"5.4.2. The main configuration file, recoll.conf">
|
|
configuration section</a>). This method would mainly
|
|
be of use if you wanted to keep the configuration
|
|
directory in its default location, but desired
|
|
another location for the index, typically out of disk
|
|
occupation concerns.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The size of the index is determined by the size of the
|
|
set of documents, but the ratio can vary a lot. For a
|
|
typical mixed set of documents, the index size will often
|
|
be close to the data set size. In specific cases (a set of
|
|
compressed mbox files for example), the index can become
|
|
much bigger than the documents. It may also be much smaller
|
|
if the documents contain a lot of images or other
|
|
non-indexed data (an extreme example being a set of mp3
|
|
files where only the tags would be indexed).</p>
|
|
|
|
<p>Of course, images, sound and video do not increase the
|
|
index size, which means that nowadays (2012), typically,
|
|
even a big index will be negligible against the total
|
|
amount of data on the computer.</p>
|
|
|
|
<p>The index data directory (<code class=
|
|
"filename">xapiandb</code>) only contains data that can be
|
|
completely rebuilt by an index run (as long as the original
|
|
documents exist), and it can always be destroyed
|
|
safely.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.STORAGE.FORMAT" id=
|
|
"RCL.INDEXING.STORAGE.FORMAT"></a>2.2.1. <span class="application">Xapian</span>
|
|
index formats</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Xapian</span> versions
|
|
usually support several formats for index storage. A
|
|
given major <span class="application">Xapian</span>
|
|
version will have a current format, used to create new
|
|
indexes, and will also support the format from the
|
|
previous major version.</p>
|
|
|
|
<p><span class="application">Xapian</span> will not
|
|
convert automatically an existing index from the older
|
|
format to the newer one. If you want to upgrade to the
|
|
new format, or if a very old index needs to be converted
|
|
because its format is not supported any more, you will
|
|
have to explicitly delete the old index, then run a
|
|
normal indexing process.</p>
|
|
|
|
<p>Using the <code class="option">-z</code> option to
|
|
<span class="command"><strong>recollindex</strong></span>
|
|
is not sufficient to change the format, you will have to
|
|
delete all files inside the index directory (typically
|
|
<code class="filename">~/.recoll/xapiandb</code>) before
|
|
starting the indexing.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.STORAGE.SECURITY" id=
|
|
"RCL.INDEXING.STORAGE.SECURITY"></a>2.2.2. Security
|
|
aspects</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The <span class="application">Recoll</span> index does
|
|
not hold copies of the indexed documents. But it does
|
|
hold enough data to allow for an almost complete
|
|
reconstruction. If confidential data is indexed, access
|
|
to the database directory should be restricted.</p>
|
|
|
|
<p><span class="application">Recoll</span> will create
|
|
the configuration directory with a mode of 0700 (access
|
|
by owner only). As the index data directory is by default
|
|
a sub-directory of the configuration directory, this
|
|
should result in appropriate protection.</p>
|
|
|
|
<p>If you use another setup, you should think of the kind
|
|
of protection you need for your index, set the directory
|
|
and files access modes appropriately, and also maybe
|
|
adjust the <code class="literal">umask</code> used during
|
|
index updates.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.CONFIG" id=
|
|
"RCL.INDEXING.CONFIG"></a>2.3. Index
|
|
configuration</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Variables set inside the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview"><span class=
|
|
"application">Recoll</span> configuration files</a> control
|
|
which areas of the file system are indexed, and how files
|
|
are processed. These variables can be set either by editing
|
|
the text files or by using the <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.GUI" title=
|
|
"2.3.3. The index configuration GUI">dialogs in the
|
|
<span class="command"><strong>recoll</strong></span>
|
|
GUI</a>.</p>
|
|
|
|
<p>The first time you start <span class=
|
|
"command"><strong>recoll</strong></span>, you will be asked
|
|
whether or not you would like it to build the index. If you
|
|
want to adjust the configuration before indexing, just
|
|
click <span class="guilabel">Cancel</span> at this point,
|
|
which will get you into the configuration interface. If you
|
|
exit at this point, <code class="filename">recoll</code>
|
|
will have created a <code class="filename">~/.recoll</code>
|
|
directory containing empty configuration files, which you
|
|
can edit by hand.</p>
|
|
|
|
<p>The configuration is documented inside the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">installation chapter</a>
|
|
of this document, or in the <span class=
|
|
"citerefentry"><span class=
|
|
"refentrytitle">recoll.conf</span>(5)</span> man page, but
|
|
the most current information will most likely be the
|
|
comments inside the sample file. The most immediately
|
|
useful variable you may interested in is probably <a class=
|
|
"link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><code class=
|
|
"varname">topdirs</code></a>, which determines what
|
|
subtrees get indexed.</p>
|
|
|
|
<p>The applications needed to index file types other than
|
|
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
|
described in the <a class="link" href=
|
|
"#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">external packages
|
|
section.</a></p>
|
|
|
|
<p>As of Recoll 1.18 there are two incompatible types of
|
|
Recoll indexes, depending on the treatment of character
|
|
case and diacritics. The next section describes the two
|
|
types in more detail.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.CONFIG.MULTIPLE" id=
|
|
"RCL.INDEXING.CONFIG.MULTIPLE"></a>2.3.1. Multiple
|
|
indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Multiple <span class="application">Recoll</span>
|
|
indexes can be created by using several configuration
|
|
directories which are usually set to index different
|
|
areas of the file system. A specific index can be
|
|
selected for updating or searching, using the
|
|
<code class="envar">RECOLL_CONFDIR</code> environment
|
|
variable or the <code class="option">-c</code> option to
|
|
<span class="command"><strong>recoll</strong></span> and
|
|
<span class=
|
|
"command"><strong>recollindex</strong></span>.</p>
|
|
|
|
<p>When working with the <span class=
|
|
"command"><strong>recoll</strong></span> index
|
|
configuration GUI, the configuration directory for which
|
|
parameters are modified is the one which was selected by
|
|
<code class="envar">RECOLL_CONFDIR</code> or the
|
|
<code class="option">-c</code> parameter, and there is no
|
|
way to switch configurations within the GUI.</p>
|
|
|
|
<p>Additional configuration directory (beyond
|
|
<code class="filename">~/.recoll</code>) must be created
|
|
by hand (<span class=
|
|
"command"><strong>mkdir</strong></span> or such), the GUI
|
|
will not do it. This is to avoid mistakenly creating
|
|
additional directories when an argument is mistyped.</p>
|
|
|
|
<p>A typical usage scenario for the multiple index
|
|
feature would be for a system administrator to set up a
|
|
central index for shared data, that you choose to search
|
|
or not in addition to your personal data. Of course,
|
|
there are other possibilities. There are many cases where
|
|
you know the subset of files that should be searched, and
|
|
where narrowing the search can improve the results. You
|
|
can achieve approximately the same effect with the
|
|
directory filter in advanced search, but multiple indexes
|
|
will have much better performance and may be worth the
|
|
trouble.</p>
|
|
|
|
<p>A <span class=
|
|
"command"><strong>recollindex</strong></span> program
|
|
instance can only update one specific index.</p>
|
|
|
|
<p>The main index (defined by <code class=
|
|
"envar">RECOLL_CONFDIR</code> or <code class=
|
|
"option">-c</code>) is always active. If this is
|
|
undesirable, you can set up your base configuration to
|
|
index an empty directory.</p>
|
|
|
|
<p>The different search interfaces (GUI, command line,
|
|
...) have different methods to define the set of indexes
|
|
to be used, see the appropriate section.</p>
|
|
|
|
<p>If a set of multiple indexes are to be used together
|
|
for searches, some configuration parameters must be
|
|
consistent among the set. These are parameters which need
|
|
to be the same when indexing and searching. As the
|
|
parameters come from the main configuration when
|
|
searching, they need to be compatible with what was set
|
|
when creating the other indexes (which came from their
|
|
respective configuration directories).</p>
|
|
|
|
<p>Most importantly, all indexes to be queried
|
|
concurrently must have the same option concerning
|
|
character case and diacritics stripping, but there are
|
|
other constraints. Most of the relevant parameters are
|
|
described in the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.TERMS" title=
|
|
"5.4.2.2. Parameters affecting how we generate terms:">
|
|
linked section</a>.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.CONFIG.SENS" id=
|
|
"RCL.INDEXING.CONFIG.SENS"></a>2.3.2. Index
|
|
case and diacritics sensitivity</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>As of <span class="application">Recoll</span> version
|
|
1.18 you have a choice of building an index with terms
|
|
stripped of character case and diacritics, or one with
|
|
raw terms. For a source term of <code class=
|
|
"literal">Résumé</code>, the former will
|
|
store <code class="literal">resume</code>, the latter
|
|
<code class="literal">Résumé</code>.</p>
|
|
|
|
<p>Each type of index allows performing searches
|
|
insensitive to case and diacritics: with a raw index, the
|
|
user entry will be expanded to match all case and
|
|
diacritics variations present in the index. With a
|
|
stripped index, the search term will be stripped before
|
|
searching.</p>
|
|
|
|
<p>A raw index allows for another possibility which a
|
|
stripped index cannot offer: using case and diacritics to
|
|
discriminate between terms, returning different results
|
|
when searching for <code class="literal">US</code> and
|
|
<code class="literal">us</code> or <code class=
|
|
"literal">resume</code> and <code class=
|
|
"literal">résumé</code>. Read the <a class=
|
|
"link" href="#RCL.SEARCH.CASEDIAC" title=
|
|
"3.7. Search case and diacritics sensitivity">section
|
|
about search case and diacritics sensitivity</a> for more
|
|
details.</p>
|
|
|
|
<p>The type of index to be created is controlled by the
|
|
<code class="literal">indexStripChars</code>
|
|
configuration variable which can only be changed by
|
|
editing the configuration file. Any change implies an
|
|
index reset (not automated by <span class=
|
|
"application">Recoll</span>), and all indexes in a search
|
|
must be set in the same way (again, not checked by
|
|
<span class="application">Recoll</span>).</p>
|
|
|
|
<p>If the <code class="literal">indexStripChars</code> is
|
|
not set, <span class="application">Recoll</span> 1.18
|
|
creates a stripped index by default, for compatibility
|
|
with previous versions.</p>
|
|
|
|
<p>As a cost for added capability, a raw index will be
|
|
slightly bigger than a stripped one (around 10%). Also,
|
|
searches will be more complex, so probably slightly
|
|
slower, and the feature is still young, so that a certain
|
|
amount of weirdness cannot be excluded.</p>
|
|
|
|
<p>One of the most adverse consequence of using a raw
|
|
index is that some phrase and proximity searches may
|
|
become impossible: because each term needs to be
|
|
expanded, and all combinations searched for, the
|
|
multiplicative expansion may become unmanageable.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.INDEXING.CONFIG.GUI"
|
|
id="RCL.INDEXING.CONFIG.GUI"></a>2.3.3. The
|
|
index configuration GUI</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Most parameters for a given index configuration can be
|
|
set from a <span class=
|
|
"command"><strong>recoll</strong></span> GUI running on
|
|
this configuration (either as default, or by setting
|
|
<code class="envar">RECOLL_CONFDIR</code> or the
|
|
<code class="option">-c</code> option.)</p>
|
|
|
|
<p>The interface is started from the <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Index Configuration</span> menu entry. It
|
|
is divided in four tabs, <span class="guilabel">Global
|
|
parameters</span>, <span class="guilabel">Local
|
|
parameters</span>, <span class="guilabel">Web
|
|
history</span> (which is explained in the next section)
|
|
and <span class="guilabel">Search parameters</span>.</p>
|
|
|
|
<p>The <span class="guilabel">Global parameters</span>
|
|
tab allows setting global variables, like the lists of
|
|
top directories, skipped paths, or stemming
|
|
languages.</p>
|
|
|
|
<p>The <span class="guilabel">Local parameters</span> tab
|
|
allows setting variables that can be redefined for
|
|
subdirectories. This second tab has an initially empty
|
|
list of customisation directories, to which you can add.
|
|
The variables are then set for the currently selected
|
|
directory (or at the top level if the empty line is
|
|
selected).</p>
|
|
|
|
<p>The <span class="guilabel">Search parameters</span>
|
|
section defines parameters which are used at query time,
|
|
but are global to an index and affect all search tools,
|
|
not only the GUI.</p>
|
|
|
|
<p>The meaning for most entries in the interface is
|
|
self-evident and documented by a <code class=
|
|
"literal">ToolTip</code> popup on the text label. For
|
|
more detail, you will need to refer to the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">configuration
|
|
section</a> of this guide.</p>
|
|
|
|
<p>The configuration tool normally respects the comments
|
|
and most of the formatting inside the configuration file,
|
|
so that it is quite possible to use it on hand-edited
|
|
files, which you might nevertheless want to backup
|
|
first...</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.WEBQUEUE" id=
|
|
"RCL.INDEXING.WEBQUEUE"></a>2.4. Indexing WEB
|
|
pages you wisit</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>With the help of a <span class=
|
|
"application">Firefox</span> extension, <span class=
|
|
"application">Recoll</span> can index the Internet pages
|
|
that you visit. The extension was initially designed for
|
|
the <span class="application">Beagle</span> indexer, but it
|
|
has recently be renamed and better adapted to <span class=
|
|
"application">Recoll</span>.</p>
|
|
|
|
<p>The extension works by copying visited WEB pages to an
|
|
indexing queue directory, which <span class=
|
|
"application">Recoll</span> then processes, indexing the
|
|
data, storing it into a local cache, then removing the file
|
|
from the queue.</p>
|
|
|
|
<p>This feature can be enabled in the GUI <span class=
|
|
"guilabel">Index configuration</span> panel, or by editing
|
|
the configuration file (set <code class=
|
|
"varname">processwebqueue</code> to 1).</p>
|
|
|
|
<p>A current pointer to the extension can be found, along
|
|
with up-to-date instructions, on the <a class="ulink" href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexWebHistory"
|
|
target="_top">Recoll wiki</a>.</p>
|
|
|
|
<p>A copy of the indexed WEB pages is retained by Recoll in
|
|
a local cache (from which previews can be fetched). The
|
|
cache size can be adjusted from the <span class=
|
|
"guilabel">Index configuration</span> / <span class=
|
|
"guilabel">Web history</span> panel. Once the maximum size
|
|
is reached, old pages are purged - both from the cache and
|
|
the index - to make room for new ones, so you need to
|
|
explicitly archive in some other place the pages that you
|
|
want to keep indefinitely.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.EXTATTR" id=
|
|
"RCL.INDEXING.EXTATTR"></a>2.5. Extended
|
|
attributes data</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>User extended attributes are named pieces of information
|
|
that most modern file systems can attach to any file.</p>
|
|
|
|
<p><span class="application">Recoll</span> versions 1.19
|
|
and later process extended attributes as document fields by
|
|
default. For older versions, this has to be activated at
|
|
build time.</p>
|
|
|
|
<p>A <a class="ulink" href=
|
|
"http://www.freedesktop.org/wiki/CommonExtendedAttributes"
|
|
target="_top">freedesktop standard</a> defines a few
|
|
special attributes, which are handled as such by
|
|
<span class="application">Recoll</span>:</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">mime_type</span></dt>
|
|
|
|
<dd>
|
|
<p>If set, this overrides any other determination of
|
|
the file MIME type.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">charset</span></dt>
|
|
|
|
<dd>If set, this defines the file character set (mostly
|
|
useful for plain text files).</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<p>By default, other attributes are handled as <span class=
|
|
"application">Recoll</span> fields. On Linux, the
|
|
<code class="literal">user</code> prefix is removed from
|
|
the name. This can be configured more precisely inside the
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file"><code class=
|
|
"filename">fields</code> configuration file</a>.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.EXTTAGS" id=
|
|
"RCL.INDEXING.EXTTAGS"></a>2.6. Importing
|
|
external tags</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>During indexing, it is possible to import metadata for
|
|
each file by executing commands. For example, this could
|
|
extract user tag data for the file and store it in a field
|
|
for indexing.</p>
|
|
|
|
<p>See the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS">section about
|
|
the <code class="literal">metadatacmds</code> field</a> in
|
|
the main configuration chapter for a description of the
|
|
configuration syntax.</p>
|
|
|
|
<p>As an example, if you would want <span class=
|
|
"application">Recoll</span> to use tags managed by
|
|
<span class="application">tmsu</span>, you would add the
|
|
following to the configuration file:</p>
|
|
<pre class="programlisting">
|
|
[/some/area/of/the/fs]
|
|
metadatacmds = ; tags = tmsu tags %f
|
|
|
|
</pre>
|
|
|
|
<p>You may want to restrict this processing to a subset of
|
|
the directory tree, because it may slow down indexing a bit
|
|
(<code class="literal">[some/area/of/the/fs]</code>).</p>
|
|
|
|
<p>Note the initial semi-colon after the equal sign.</p>
|
|
|
|
<p>In the example above, the output of <span class=
|
|
"command"><strong>tmsu</strong></span> is used to set a
|
|
field named <code class="literal">tags</code>. The field
|
|
name is arbitrary and could be <code class=
|
|
"literal">tmsu</code> or <code class=
|
|
"literal">myfield</code> just the same, but <code class=
|
|
"literal">tags</code> is an alias for the standard
|
|
<span class="application">Recoll</span> <code class=
|
|
"literal">keywords</code> field, and the <span class=
|
|
"command"><strong>tmsu</strong></span> output will just
|
|
augment its contents. This will avoid the need to extend
|
|
the <a class="link" href="#RCL.PROGRAM.FIELDS" title=
|
|
"4.2. Field data processing">field
|
|
configuration</a>.</p>
|
|
|
|
<p>Once re-indexing is performed (you'll need to force the
|
|
file reindexing, <span class="application">Recoll</span>
|
|
will not detect the need by itself), you will be able to
|
|
search from the query language, through any of its aliases:
|
|
<code class="literal">tags:some/alternate/values</code> or
|
|
<code class="literal">tags:all,these,values</code> (the
|
|
compact field search syntax is supported for recoll 1.20
|
|
and later. For older versions, you would need to repeat the
|
|
<code class="literal">tags:</code> specifier for each term,
|
|
e.g. <code class="literal">tags:some OR
|
|
tags:alternate</code>).</p>
|
|
|
|
<p>You should be aware that tags changes will not be
|
|
detected by the indexer if the file itself did not change.
|
|
One possible workaround would be to update the file
|
|
<code class="literal">ctime</code> when you modify the
|
|
tags, which would be consistent with how extended
|
|
attributes function. A pair of <span class=
|
|
"command"><strong>chmod</strong></span> commands could
|
|
accomplish this, or a <code class="literal">touch -a</code>
|
|
. Alternatively, just couple the tag update with a
|
|
<code class="literal">recollindex -e -i
|
|
filename.</code></p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.PERIODIC" id=
|
|
"RCL.INDEXING.PERIODIC"></a>2.7. Periodic
|
|
indexing</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.PERIODIC.EXEC" id=
|
|
"RCL.INDEXING.PERIODIC.EXEC"></a>2.7.1. Running
|
|
indexing</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Indexing is always performed by the <span class=
|
|
"command"><strong>recollindex</strong></span> program,
|
|
which can be started either from the command line or from
|
|
the <span class="guimenu">File</span> menu in the
|
|
<span class="command"><strong>recoll</strong></span> GUI
|
|
program. When started from the GUI, the indexing will run
|
|
on the same configuration <span class=
|
|
"command"><strong>recoll</strong></span> was started on.
|
|
When started from the command line, <span class=
|
|
"command"><strong>recollindex</strong></span> will use
|
|
the <code class="envar">RECOLL_CONFDIR</code> variable or
|
|
accept a <code class="option">-c</code> <em class=
|
|
"replaceable"><code>confdir</code></em> option to specify
|
|
a non-default configuration directory.</p>
|
|
|
|
<p>If the <span class=
|
|
"command"><strong>recoll</strong></span> program finds no
|
|
index when it starts, it will automatically start
|
|
indexing (except if canceled).</p>
|
|
|
|
<p>The <span class=
|
|
"command"><strong>recollindex</strong></span> indexing
|
|
process can be interrupted by sending an interrupt
|
|
(<span class="keysym">Ctrl-C</span>, SIGINT) or terminate
|
|
(SIGTERM) signal. Some time may elapse before the process
|
|
exits, because it needs to properly flush and close the
|
|
index. This can also be done from the <span class=
|
|
"command"><strong>recoll</strong></span> GUI <span class=
|
|
"guimenu">File</span> → <span class=
|
|
"guimenuitem">Stop Indexing</span> menu entry.</p>
|
|
|
|
<p>After such an interruption, the index will be somewhat
|
|
inconsistent because some operations which are normally
|
|
performed at the end of the indexing pass will have been
|
|
skipped (for example, the stemming and spelling databases
|
|
will be inexistant or out of date). You just need to
|
|
restart indexing at a later time to restore consistency.
|
|
The indexing will restart at the interruption point (the
|
|
full file tree will be traversed, but files that were
|
|
indexed up to the interruption and for which the index is
|
|
still up to date will not need to be reindexed).</p>
|
|
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span> has a
|
|
number of other options which are described in its man
|
|
page. Only a few will be described here.</p>
|
|
|
|
<p>Option <code class="option">-z</code> will reset the
|
|
index when starting. This is almost the same as
|
|
destroying the index files (the nuance is that the
|
|
<span class="application">Xapian</span> format version
|
|
will not be changed).</p>
|
|
|
|
<p>Option <code class="option">-Z</code> will force the
|
|
update of all documents without resetting the index
|
|
first. This will not have the "clean start" aspect of
|
|
<code class="option">-z</code>, but the advantage is that
|
|
the index will remain available for querying while it is
|
|
rebuilt, which can be a significant advantage if it is
|
|
very big (some installations need days for a full index
|
|
rebuild).</p>
|
|
|
|
<p>Option <code class="option">-k</code> will force
|
|
retrying files which previously failed to be indexed, for
|
|
example because of a missing helper program.</p>
|
|
|
|
<p>Of special interest also, maybe, are the <code class=
|
|
"option">-i</code> and <code class="option">-f</code>
|
|
options. <code class="option">-i</code> allows indexing
|
|
an explicit list of files (given as command line
|
|
parameters or read on <code class=
|
|
"literal">stdin</code>). <code class="option">-f</code>
|
|
tells <span class=
|
|
"command"><strong>recollindex</strong></span> to ignore
|
|
file selection parameters from the configuration.
|
|
Together, these options allow building a custom file
|
|
selection process for some area of the file system, by
|
|
adding the top directory to the <code class=
|
|
"varname">skippedPaths</code> list and using an
|
|
appropriate file selection method to build the file list
|
|
to be fed to <span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
<code class="option">-if</code>. Trivial example:</p>
|
|
<pre class="programlisting">
|
|
find . -name indexable.txt -print | recollindex -if
|
|
|
|
</pre>
|
|
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
<code class="option">-i</code> will not descend into
|
|
subdirectories specified as parameters, but just add them
|
|
as index entries. It is up to the external file selection
|
|
method to build the complete file list.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.PERIODIC.AUTOMAT" id=
|
|
"RCL.INDEXING.PERIODIC.AUTOMAT"></a>2.7.2. Using
|
|
<span class="command"><strong>cron</strong></span>
|
|
to automate indexing</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The most common way to set up indexing is to have a
|
|
cron task execute it every night. For example the
|
|
following <code class="filename">crontab</code> entry
|
|
would do it every day at 3:30AM (supposing <span class=
|
|
"command"><strong>recollindex</strong></span> is in your
|
|
PATH):</p>
|
|
<pre class="screen">
|
|
30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
|
|
</pre>
|
|
|
|
<p>Or, using <span class=
|
|
"command"><strong>anacron</strong></span>:</p>
|
|
<pre class="screen">
|
|
1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
|
|
</pre>
|
|
|
|
<p>As of version 1.17 the <span class=
|
|
"application">Recoll</span> GUI has dialogs to manage
|
|
<code class="filename">crontab</code> entries for
|
|
<span class=
|
|
"command"><strong>recollindex</strong></span>. You can
|
|
reach them from the <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Indexing Schedule</span> menu. They only
|
|
work with the good old <span class=
|
|
"command"><strong>cron</strong></span>, and do not give
|
|
access to all features of <span class=
|
|
"command"><strong>cron</strong></span> scheduling.</p>
|
|
|
|
<p>The usual command to edit your <code class=
|
|
"filename">crontab</code> is <span class=
|
|
"command"><strong>crontab</strong></span> <code class=
|
|
"option">-e</code> (which will usually start the
|
|
<span class="command"><strong>vi</strong></span> editor
|
|
to edit the file). You may have more sophisticated tools
|
|
available on your system.</p>
|
|
|
|
<p>Please be aware that there may be differences between
|
|
your usual interactive command line environment and the
|
|
one seen by crontab commands. Especially the PATH
|
|
variable may be of concern. Please check the crontab
|
|
manual pages about possible issues.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.MONITOR" id=
|
|
"RCL.INDEXING.MONITOR"></a>2.8. Real time
|
|
indexing</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Real time monitoring/indexing is performed by starting
|
|
the <span class=
|
|
"command"><strong>recollindex</strong></span> <code class=
|
|
"option">-m</code> command. With this option, <span class=
|
|
"command"><strong>recollindex</strong></span> will detach
|
|
from the terminal and become a daemon, permanently
|
|
monitoring file changes and updating the index.</p>
|
|
|
|
<p>Under <span class="application">KDE</span>, <span class=
|
|
"application">Gnome</span> and some other desktop
|
|
environments, the daemon can automatically started when you
|
|
log in, by creating a desktop file inside the <code class=
|
|
"filename">~/.config/autostart</code> directory. This can
|
|
be done for you by the <span class=
|
|
"application">Recoll</span> GUI. Use the <span class=
|
|
"guimenu">Preferences->Indexing Schedule</span>
|
|
menu.</p>
|
|
|
|
<p>With older <span class="application">X11</span> setups,
|
|
starting the daemon is normally performed as part of the
|
|
user session script.</p>
|
|
|
|
<p>The <code class="filename">rclmon.sh</code> script can
|
|
be used to easily start and stop the daemon. It can be
|
|
found in the <code class="filename">examples</code>
|
|
directory (typically <code class=
|
|
"filename">/usr/local/[share/]recoll/examples</code>).</p>
|
|
|
|
<p>For example, my out of fashion <span class=
|
|
"application">xdm</span>-based session has a <code class=
|
|
"filename">.xsession</code> script with the following lines
|
|
at the end:</p>
|
|
<pre class="programlisting">
|
|
recollconf=$HOME/.recoll-home
|
|
recolldata=/usr/local/share/recoll
|
|
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
|
|
|
fvwm
|
|
|
|
</pre>
|
|
|
|
<p>The indexing daemon gets started, then the window
|
|
manager, for which the session waits.</p>
|
|
|
|
<p>By default the indexing daemon will monitor the state of
|
|
the X11 session, and exit when it finishes, it is not
|
|
necessary to kill it explicitly. (The <span class=
|
|
"application">X11</span> server monitoring can be disabled
|
|
with option <code class="option">-x</code> to <span class=
|
|
"command"><strong>recollindex</strong></span>).</p>
|
|
|
|
<p>If you use the daemon completely out of an <span class=
|
|
"application">X11</span> session, you need to add option
|
|
<code class="option">-x</code> to disable <span class=
|
|
"application">X11</span> session monitoring (else the
|
|
daemon will not start).</p>
|
|
|
|
<p>By default, the messages from the indexing daemon will
|
|
be setn to the same file as those from the interactive
|
|
commands (<code class="literal">logfilename</code>). You
|
|
may want to change this by setting the <code class=
|
|
"varname">daemlogfilename</code> and <code class=
|
|
"varname">daemloglevel</code> configuration parameters.
|
|
Also the log file will only be truncated when the daemon
|
|
starts. If the daemon runs permanently, the log file may
|
|
grow quite big, depending on the log level.</p>
|
|
|
|
<p>When building <span class="application">Recoll</span>,
|
|
the real time indexing support can be customised during
|
|
package <a class="link" href="#RCL.INSTALL.BUILDING.BUILD"
|
|
title="5.3.2. Building">configuration</a> with the
|
|
<code class="option">--with[out]-fam</code> or <code class=
|
|
"option">--with[out]-inotify</code> options. The default is
|
|
currently to include <span class=
|
|
"application">inotify</span> monitoring on systems that
|
|
support it, and, as of <span class=
|
|
"application">Recoll</span> 1.17, <span class=
|
|
"application">gamin</span> support on <span class=
|
|
"application">FreeBSD</span>.</p>
|
|
|
|
<p>While it is convenient that data is indexed in real
|
|
time, repeated indexing can generate a significant load on
|
|
the system when files such as email folders change. Also,
|
|
monitoring large file trees by itself significantly taxes
|
|
system resources. You probably do not want to enable it if
|
|
your system is short on resources. Periodic indexing is
|
|
adequate in most cases.</p>
|
|
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Increasing resources for inotify</h3>
|
|
|
|
<p>On Linux systems, monitoring a big tree may need
|
|
increasing the resources available to inotify, which are
|
|
normally defined in <code class=
|
|
"filename">/etc/sysctl.conf</code>.</p>
|
|
<pre class="programlisting">
|
|
### inotify
|
|
#
|
|
# cat /proc/sys/fs/inotify/max_queued_events - 16384
|
|
# cat /proc/sys/fs/inotify/max_user_instances - 128
|
|
# cat /proc/sys/fs/inotify/max_user_watches - 16384
|
|
#
|
|
# -- Change to:
|
|
#
|
|
fs.inotify.max_queued_events=32768
|
|
fs.notify.max_user_instances=256
|
|
fs.inotify.max_user_watches=32768
|
|
|
|
</pre>
|
|
|
|
<p>Especially, you will need to trim your tree or adjust
|
|
the <code class="literal">max_user_watches</code> value
|
|
if indexing exits with a message about errno <code class=
|
|
"literal">ENOSPC</code> (28) from <code class=
|
|
"function">inotify_add_watch</code>.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.MONITOR.FASTFILES" id=
|
|
"RCL.INDEXING.MONITOR.FASTFILES"></a>2.8.1. Slowing
|
|
down the reindexing rate for fast changing
|
|
files</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>When using the real time monitor, it may happen that
|
|
some files need to be indexed, but change so often that
|
|
they impose an excessive load for the system.</p>
|
|
|
|
<p><span class="application">Recoll</span> provides a
|
|
configuration option to specify the minimum time before
|
|
which a file, specified by a wildcard pattern, cannot be
|
|
reindexed. See the <code class=
|
|
"varname">mondelaypatterns</code> parameter in the
|
|
<a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.MISC" title=
|
|
"5.4.2.5. Miscellaneous parameters:">configuration
|
|
section</a>.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.SEARCH" id=
|
|
"RCL.SEARCH"></a>Chapter 3. Searching</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.GUI" id=
|
|
"RCL.SEARCH.GUI"></a>3.1. Searching with the Qt
|
|
graphical user interface</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The <span class="command"><strong>recoll</strong></span>
|
|
program provides the main user interface for searching. It
|
|
is based on the <span class="application">Qt</span>
|
|
library.</p>
|
|
|
|
<p><span class="command"><strong>recoll</strong></span> has
|
|
two search modes:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Simple search (the default, on the main screen)
|
|
has a single entry field where you can enter multiple
|
|
words.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Advanced search (a panel accessed through the
|
|
<span class="guilabel">Tools</span> menu or the
|
|
toolbox bar icon) has multiple entry fields, which
|
|
you may use to build a logical condition, with
|
|
additional filtering on file type, location in the
|
|
file system, modification date, and size.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>In most cases, you can enter the terms as you think
|
|
them, even if they contain embedded punctuation or other
|
|
non-textual characters. For example, <span class=
|
|
"application">Recoll</span> can handle things like email
|
|
addresses, or arbitrary cut and paste from another text
|
|
window, punctation and all.</p>
|
|
|
|
<p>The main case where you should enter text differently
|
|
from how it is printed is for east-asian languages
|
|
(Chinese, Japanese, Korean). Words composed of single or
|
|
multiple characters should be entered separated by white
|
|
space in this case (they would typically be printed without
|
|
white space).</p>
|
|
|
|
<p>Some searches can be quite complex, and you may want to
|
|
re-use them later, perhaps with some tweaking. <span class=
|
|
"application">Recoll</span> versions 1.21 and later can
|
|
save and restore searches, using XML files. See <a class=
|
|
"link" href="#RCL.SEARCH.SAVING" title=
|
|
"3.1.14. Saving and restoring queries (1.21 and later)">
|
|
Saving and restoring queries</a>.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.SIMPLE"
|
|
id="RCL.SEARCH.GUI.SIMPLE"></a>3.1.1. Simple
|
|
search</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="procedure">
|
|
<ol class="procedure" type="1">
|
|
<li class="step">
|
|
<p>Start the <span class=
|
|
"command"><strong>recoll</strong></span>
|
|
program.</p>
|
|
</li>
|
|
|
|
<li class="step">
|
|
<p>Possibly choose a search mode: <span class=
|
|
"guilabel">Any term</span>, <span class=
|
|
"guilabel">All terms</span>, <span class=
|
|
"guilabel">File name</span> or <span class=
|
|
"guilabel">Query language</span>.</p>
|
|
</li>
|
|
|
|
<li class="step">
|
|
<p>Enter search term(s) in the text field at the
|
|
top of the window.</p>
|
|
</li>
|
|
|
|
<li class="step">
|
|
<p>Click the <span class="guilabel">Search</span>
|
|
button or hit the <span class=
|
|
"keycap"><strong>Enter</strong></span> key to start
|
|
the search.</p>
|
|
</li>
|
|
</ol>
|
|
</div>
|
|
|
|
<p>The initial default search mode is <span class=
|
|
"guilabel">Query language</span>. Without special
|
|
directives, this will look for documents containing all
|
|
of the search terms (the ones with more terms will get
|
|
better scores), just like the <span class="guilabel">All
|
|
terms</span> mode which will ignore such directives.
|
|
<span class="guilabel">Any term</span> will search for
|
|
documents where at least one of the terms appear.</p>
|
|
|
|
<p>The <span class="guilabel">Query Language</span>
|
|
features are described in <a class="link" href=
|
|
"#RCL.SEARCH.LANG" title="3.6. The query language">a
|
|
separate section</a>.</p>
|
|
|
|
<p>All search modes allow wildcards inside terms
|
|
(<code class="literal">*</code>, <code class=
|
|
"literal">?</code>, <code class="literal">[]</code>). You
|
|
may want to have a look at the <a class="link" href=
|
|
"#RCL.SEARCH.WILDCARDS" title=
|
|
"3.8.1. More about wildcards">section about
|
|
wildcards</a> for more information about this.</p>
|
|
|
|
<p><span class="guilabel">File name</span> will
|
|
specifically look for file names. The point of having a
|
|
separate file name search is that wild card expansion can
|
|
be performed more efficiently on a small subset of the
|
|
index (allowing wild cards on the left of terms without
|
|
excessive penality). Things to know:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>White space in the entry should match white
|
|
space in the file name, and is not treated
|
|
specially.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The search is insensitive to character case and
|
|
accents, independantly of the type of index.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>An entry without any wild card character and not
|
|
capitalized will be prepended and appended with '*'
|
|
(ie: <em class="replaceable"><code>etc</code></em>
|
|
-> <em class=
|
|
"replaceable"><code>*etc*</code></em>, but
|
|
<em class="replaceable"><code>Etc</code></em> ->
|
|
<em class="replaceable"><code>etc</code></em>).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>If you have a big index (many files),
|
|
excessively generic fragments may result in
|
|
inefficient searches.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>You can search for exact phrases (adjacent words in a
|
|
given order) by enclosing the input inside double quotes.
|
|
Ex: <code class="literal">"virtual reality"</code>.</p>
|
|
|
|
<p>When using a stripped index, character case has no
|
|
influence on search, except that you can disable stem
|
|
expansion for any term by capitalizing it. Ie: a search
|
|
for <code class="literal">floor</code> will also normally
|
|
look for <code class="literal">flooring</code>,
|
|
<code class="literal">floored</code>, etc., but a search
|
|
for <code class="literal">Floor</code> will only look for
|
|
<code class="literal">floor</code>, in any character
|
|
case. Stemming can also be disabled globally in the
|
|
preferences. When using a raw index, <a class="link"
|
|
href="#RCL.SEARCH.CASEDIAC" title=
|
|
"3.7. Search case and diacritics sensitivity">the
|
|
rules are a bit more complicated</a>.</p>
|
|
|
|
<p><span class="application">Recoll</span> remembers the
|
|
last few searches that you performed. You can use the
|
|
simple search text entry widget (a combobox) to recall
|
|
them (click on the thing at the right of the text field).
|
|
Please note, however, that only the search texts are
|
|
remembered, not the mode (all/any/file name).</p>
|
|
|
|
<p>Typing <span class=
|
|
"keycap"><strong>Esc</strong></span> <span class=
|
|
"keycap"><strong>Space</strong></span> while entering a
|
|
word in the simple search entry will open a window with
|
|
possible completions for the word. The completions are
|
|
extracted from the database.</p>
|
|
|
|
<p>Double-clicking on a word in the result list or a
|
|
preview window will insert it into the simple search
|
|
entry field.</p>
|
|
|
|
<p>You can cut and paste any text into an <span class=
|
|
"guilabel">All terms</span> or <span class="guilabel">Any
|
|
term</span> search field, punctuation, newlines and all -
|
|
except for wildcard characters (single <code class=
|
|
"literal">?</code> characters are ok). <span class=
|
|
"application">Recoll</span> will process it and produce a
|
|
meaningful search. This is what most differentiates this
|
|
mode from the <span class="guilabel">Query
|
|
Language</span> mode, where you have to care about the
|
|
syntax.</p>
|
|
|
|
<p>You can use the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.COMPLEX" title=
|
|
"3.1.8. Complex/advanced search"><span class=
|
|
"guimenu">Tools</span> → <span class=
|
|
"guimenuitem">Advanced search</span></a> dialog for more
|
|
complex searches.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.RESLIST"
|
|
id="RCL.SEARCH.GUI.RESLIST"></a>3.1.2. The
|
|
default result list</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>After starting a search, a list of results will
|
|
instantly be displayed in the main list window.</p>
|
|
|
|
<p>By default, the document list is presented in order of
|
|
relevance (how well the system estimates that the
|
|
document matches the query). You can sort the result by
|
|
ascending or descending date by using the vertical arrows
|
|
in the toolbar.</p>
|
|
|
|
<p>Clicking on the <code class="literal">Preview</code>
|
|
link for an entry will open an internal preview window
|
|
for the document. Further <code class=
|
|
"literal">Preview</code> clicks for the same search will
|
|
open tabs in the existing preview window. You can use
|
|
<span class="keycap"><strong>Shift</strong></span>+Click
|
|
to force the creation of another preview window, which
|
|
may be useful to view the documents side by side. (You
|
|
can also browse successive results in a single preview
|
|
window by typing <span class=
|
|
"keycap"><strong>Shift</strong></span>+<span class=
|
|
"keycap"><strong>ArrowUp/Down</strong></span> in the
|
|
window).</p>
|
|
|
|
<p>Clicking the <code class="literal">Open</code> link
|
|
will start an external viewer for the document. By
|
|
default, <span class="application">Recoll</span> lets the
|
|
desktop choose the appropriate application for most
|
|
document types (there is a short list of exceptions, see
|
|
further). If you prefer to completely customize the
|
|
choice of applications, you can uncheck the <span class=
|
|
"guilabel">Use desktop preferences</span> option in the
|
|
GUI preferences dialog, and click the <span class=
|
|
"guilabel">Choose editor applications</span> button to
|
|
adjust the predefined <span class=
|
|
"application">Recoll</span> choices. The tool accepts
|
|
multiple selections of MIME types (e.g. to set up the
|
|
editor for the dozens of office file types).</p>
|
|
|
|
<p>Even when <span class="guilabel">Use desktop
|
|
preferences</span> is checked, there is a small list of
|
|
exceptions, for MIME types where the <span class=
|
|
"application">Recoll</span> choice should override the
|
|
desktop one. These are applications which are well
|
|
integrated with <span class="application">Recoll</span>,
|
|
especially <span class="application">evince</span> for
|
|
viewing PDF and Postscript files because of its support
|
|
for opening the document at a specific page and passing a
|
|
search string as an argument. Of course, you can edit the
|
|
list (in the GUI preferences) if you would prefer to lose
|
|
the functionality and use the standard desktop tool.</p>
|
|
|
|
<p>You may also change the choice of applications by
|
|
editing the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.MIMEVIEW" title=
|
|
"5.4.6. The mimeview file"><code class=
|
|
"filename">mimeview</code></a> configuration file if you
|
|
find this more convenient.</p>
|
|
|
|
<p>Each result entry also has a right-click menu with an
|
|
<span class="guilabel">Open With</span> entry. This lets
|
|
you choose an application from the list of those which
|
|
registered with the desktop for the document MIME
|
|
type.</p>
|
|
|
|
<p>The <code class="literal">Preview</code> and
|
|
<code class="literal">Open</code> edit links may not be
|
|
present for all entries, meaning that <span class=
|
|
"application">Recoll</span> has no configured way to
|
|
preview a given file type (which was indexed by name
|
|
only), or no configured external editor for the file
|
|
type. This can sometimes be adjusted simply by tweaking
|
|
the <a class="link" href="#RCL.INSTALL.CONFIG.MIMEMAP"
|
|
title="5.4.4. The mimemap file"><code class=
|
|
"filename">mimemap</code></a> and <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.MIMEVIEW" title=
|
|
"5.4.6. The mimeview file"><code class=
|
|
"filename">mimeview</code></a> configuration files (the
|
|
latter can be modified with the user preferences
|
|
dialog).</p>
|
|
|
|
<p>The format of the result list entries is entirely
|
|
configurable by using the preference dialog to <a class=
|
|
"link" href="#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"3.1.15.1. The result list format">edit an HTML
|
|
fragment</a>.</p>
|
|
|
|
<p>You can click on the <code class="literal">Query
|
|
details</code> link at the top of the results page to see
|
|
the query actually performed, after stem expansion and
|
|
other processing.</p>
|
|
|
|
<p>Double-clicking on any word inside the result list or
|
|
a preview window will insert it into the simple search
|
|
text.</p>
|
|
|
|
<p>The result list is divided into pages (the size of
|
|
which you can change in the preferences). Use the arrow
|
|
buttons in the toolbar or the links at the bottom of the
|
|
page to browse the results.</p>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RESLIST.SUGGS" id=
|
|
"RCL.SEARCH.GUI.RESLIST.SUGGS"></a>3.1.2.1. No
|
|
results: the spelling suggestions</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>When a search yields no result, and if the
|
|
<span class="application">aspell</span> dictionary is
|
|
configured, <span class="application">Recoll</span>
|
|
will try to check for misspellings among the query
|
|
terms, and will propose lists of replacements. Clicking
|
|
on one of the suggestions will replace the word and
|
|
restart the search. You can hold any of the modifier
|
|
keys (Ctrl, Shift, etc.) while clicking if you would
|
|
rather stay on the suggestion screen because several
|
|
terms need replacement.</p>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RESULTLIST.MENU" id=
|
|
"RCL.SEARCH.GUI.RESULTLIST.MENU"></a>3.1.2.2. The
|
|
result list right-click menu</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Apart from the preview and edit links, you can
|
|
display a pop-up menu by right-clicking over a
|
|
paragraph in the result list. This menu has the
|
|
following entries:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Preview</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open With</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Run Script</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Copy File
|
|
Name</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Copy Url</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Save to File</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Find similar</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Preview Parent
|
|
document</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open Parent
|
|
document</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open Snippets
|
|
Window</span></p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The <span class="guilabel">Preview</span> and
|
|
<span class="guilabel">Open</span> entries do the same
|
|
thing as the corresponding links.</p>
|
|
|
|
<p><span class="guilabel">Open With</span> lets you
|
|
open the document with one of the applications claiming
|
|
to be able to handle its MIME type (the information
|
|
comes from the <code class="literal">.desktop</code>
|
|
files in <code class=
|
|
"filename">/usr/share/applications</code>).</p>
|
|
|
|
<p><span class="guilabel">Run Script</span> allows
|
|
starting an arbitrary command on the result file. It
|
|
will only appear for results which are top-level files.
|
|
See <a class="link" href="#RCL.SEARCH.GUI.RUNSCRIPT"
|
|
title=
|
|
"3.1.4. Running arbitrary commands on result files (1.20 and later)">
|
|
further</a> for a more detailed description.</p>
|
|
|
|
<p>The <span class="guilabel">Copy File Name</span> and
|
|
<span class="guilabel">Copy Url</span> copy the
|
|
relevant data to the clipboard, for later pasting.</p>
|
|
|
|
<p><span class="guilabel">Save to File</span> allows
|
|
saving the contents of a result document to a chosen
|
|
file. This entry will only appear if the document does
|
|
not correspond to an existing file, but is a
|
|
subdocument inside such a file (ie: an email
|
|
attachment). It is especially useful to extract
|
|
attachments with no associated editor.</p>
|
|
|
|
<p>The <span class="guilabel">Open/Preview Parent
|
|
document</span> entries allow working with the higher
|
|
level document (e.g. the email message an attachment
|
|
comes from). <span class="application">Recoll</span> is
|
|
sometimes not totally accurate as to what it can or
|
|
can't do in this area. For example the <span class=
|
|
"guilabel">Parent</span> entry will also appear for an
|
|
email which is part of an mbox folder file, but you
|
|
can't actually visualize the mbox (there will be an
|
|
error dialog if you try).</p>
|
|
|
|
<p>If the document is a top-level file, <span class=
|
|
"guilabel">Open Parent</span> will start the default
|
|
file manager on the enclosing filesystem directory.</p>
|
|
|
|
<p>The <span class="guilabel">Find similar</span> entry
|
|
will select a number of relevant term from the current
|
|
document and enter them into the simple search field.
|
|
You can then start a simple search, with a good chance
|
|
of finding documents related to the current result. I
|
|
can't remember a single instance where this function
|
|
was actually useful to me...</p>
|
|
|
|
<p><a name="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS"
|
|
id="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS"></a>The
|
|
<span class="guilabel">Open Snippets Window</span>
|
|
entry will only appear for documents which support page
|
|
breaks (typically PDF, Postscript, DVI). The snippets
|
|
window lists extracts from the document, taken around
|
|
search terms occurrences, along with the corresponding
|
|
page number, as links which can be used to start the
|
|
native viewer on the appropriate page. If the viewer
|
|
supports it, its search function will also be primed
|
|
with one of the search terms.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.RESTABLE"
|
|
id="RCL.SEARCH.GUI.RESTABLE"></a>3.1.3. The
|
|
result table</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>In <span class="application">Recoll</span> 1.15 and
|
|
newer, the results can be displayed in spreadsheet-like
|
|
fashion. You can switch to this presentation by clicking
|
|
the table-like icon in the toolbar (this is a toggle,
|
|
click again to restore the list).</p>
|
|
|
|
<p>Clicking on the column headers will allow sorting by
|
|
the values in the column. You can click again to invert
|
|
the order, and use the header right-click menu to reset
|
|
sorting to the default relevance order (you can also use
|
|
the sort-by-date arrows to do this).</p>
|
|
|
|
<p>Both the list and the table display the same
|
|
underlying results. The sort order set from the table is
|
|
still active if you switch back to the list mode. You can
|
|
click twice on a date sort arrow to reset it from
|
|
there.</p>
|
|
|
|
<p>The header right-click menu allows adding or deleting
|
|
columns. The columns can be resized, and their order can
|
|
be changed (by dragging). All the changes are recorded
|
|
when you quit <span class=
|
|
"command"><strong>recoll</strong></span></p>
|
|
|
|
<p>Hovering over a table row will update the detail area
|
|
at the bottom of the window with the corresponding
|
|
values. You can click the row to freeze the display. The
|
|
bottom area is equivalent to a result list paragraph,
|
|
with links for starting a preview or a native
|
|
application, and an equivalent right-click menu. Typing
|
|
<span class="keycap"><strong>Esc</strong></span> (the
|
|
Escape key) will unfreeze the display.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RUNSCRIPT" id=
|
|
"RCL.SEARCH.GUI.RUNSCRIPT"></a>3.1.4. Running
|
|
arbitrary commands on result files (1.20 and
|
|
later)</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Apart from the <span class="guilabel">Open</span> and
|
|
<span class="guilabel">Open With</span> operations, which
|
|
allow starting an application on a result document (or a
|
|
temporary copy), based on its MIME type, it is also
|
|
possible to run arbitrary commands on results which are
|
|
top-level files, using the <span class="guilabel">Run
|
|
Script</span> entry in the results pop-up menu.</p>
|
|
|
|
<p>The commands which will appear in the <span class=
|
|
"guilabel">Run Script</span> submenu must be defined by
|
|
<code class="literal">.desktop</code> files inside the
|
|
<code class="filename">scripts</code> subdirectory of the
|
|
current configuration directory.</p>
|
|
|
|
<p>Here follows an example of a <code class=
|
|
"literal">.desktop</code> file, which could be named for
|
|
example, <code class=
|
|
"filename">~/.recoll/scripts/myscript.desktop</code> (the
|
|
exact file name inside the directory is irrelevant):</p>
|
|
<pre class="programlisting">
|
|
[Desktop Entry]
|
|
Type=Application
|
|
Name=MyFirstScript
|
|
Exec=/home/me/bin/tryscript %F
|
|
MimeType=*/*
|
|
|
|
</pre>
|
|
|
|
<p>The <code class="literal">Name</code> attribute
|
|
defines the label which will appear inside the
|
|
<span class="guilabel">Run Script</span> menu. The
|
|
<code class="literal">Exec</code> attribute defines the
|
|
program to be run, which does not need to actually be a
|
|
script, of course. The <code class=
|
|
"literal">MimeType</code> attribute is not used, but
|
|
needs to exist.</p>
|
|
|
|
<p>The commands defined this way can also be used from
|
|
links inside the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA" title=
|
|
"The paragraph format">result paragraph</a>.</p>
|
|
|
|
<p>As an example, it might make sense to write a script
|
|
which would move the document to the trash and purge it
|
|
from the <span class="application">Recoll</span>
|
|
index.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.THUMBNAILS" id=
|
|
"RCL.SEARCH.GUI.THUMBNAILS"></a>3.1.5. Displaying
|
|
thumbnails</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The default format for the result list entries and the
|
|
detail area of the result table display an icon for each
|
|
result document. The icon is either a generic one
|
|
determined from the MIME type, or a thumbnail of the
|
|
document appearance. Thumbnails are only displayed if
|
|
found in the standard <span class=
|
|
"application">freedesktop</span> location, where they
|
|
would typically have been created by a file manager.</p>
|
|
|
|
<p>Recoll has no capability to create thumbnails. A
|
|
relatively simple trick is to use the <span class=
|
|
"guilabel">Open parent document/folder</span> entry in
|
|
the result list popup menu. This should open a file
|
|
manager window on the containing directory, which should
|
|
in turn create the thumbnails (depending on your
|
|
settings). Restarting the search should then display the
|
|
thumbnails.</p>
|
|
|
|
<p>There are also <a class="ulink" href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails.wiki"
|
|
target="_top">some pointers about thumbnail
|
|
generation</a> on the <span class=
|
|
"application">Recoll</span> wiki.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.PREVIEW"
|
|
id="RCL.SEARCH.GUI.PREVIEW"></a>3.1.6. The
|
|
preview window</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The preview window opens when you first click a
|
|
<code class="literal">Preview</code> link inside the
|
|
result list.</p>
|
|
|
|
<p>Subsequent preview requests for a given search open
|
|
new tabs in the existing window (except if you hold the
|
|
<span class="keycap"><strong>Shift</strong></span> key
|
|
while clicking which will open a new window for side by
|
|
side viewing).</p>
|
|
|
|
<p>Starting another search and requesting a preview will
|
|
create a new preview window. The old one stays open until
|
|
you close it.</p>
|
|
|
|
<p>You can close a preview tab by typing <span class=
|
|
"keycap"><strong>Ctrl-W</strong></span> (<span class=
|
|
"keycap"><strong>Ctrl</strong></span> + <span class=
|
|
"keycap"><strong>W</strong></span>) in the window.
|
|
Closing the last tab for a window will also close the
|
|
window.</p>
|
|
|
|
<p>Of course you can also close a preview window by using
|
|
the window manager button in the top of the frame.</p>
|
|
|
|
<p>You can display successive or previous documents from
|
|
the result list inside a preview tab by typing
|
|
<span class=
|
|
"keycap"><strong>Shift</strong></span>+<span class=
|
|
"keycap"><strong>Down</strong></span> or <span class=
|
|
"keycap"><strong>Shift</strong></span>+<span class=
|
|
"keycap"><strong>Up</strong></span> (<span class=
|
|
"keycap"><strong>Down</strong></span> and <span class=
|
|
"keycap"><strong>Up</strong></span> are the arrow
|
|
keys).</p>
|
|
|
|
<p>A right-click menu in the text area allows switching
|
|
between displaying the main text or the contents of
|
|
fields associated to the document (ie: author, abtract,
|
|
etc.). This is especially useful in cases where the term
|
|
match did not occur in the main text but in one of the
|
|
fields. In the case of images, you can switch between
|
|
three displays: the image itself, the image metadata as
|
|
extracted by <span class=
|
|
"command"><strong>exiftool</strong></span> and the
|
|
fields, which is the metadata stored in the index.</p>
|
|
|
|
<p>You can print the current preview window contents by
|
|
typing <span class=
|
|
"keycap"><strong>Ctrl-P</strong></span> (<span class=
|
|
"keycap"><strong>Ctrl</strong></span> + <span class=
|
|
"keycap"><strong>P</strong></span>) in the window
|
|
text.</p>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.PREVIEW.SEARCH" id=
|
|
"RCL.SEARCH.GUI.PREVIEW.SEARCH"></a>3.1.6.1. Searching
|
|
inside the preview</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The preview window has an internal search
|
|
capability, mostly controlled by the panel at the
|
|
bottom of the window, which works in two modes: as a
|
|
classical editor incremental search, where we look for
|
|
the text entered in the entry zone, or as a way to walk
|
|
the matches between the document and the <span class=
|
|
"application">Recoll</span> query that found it.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Incremental text
|
|
search</span></dt>
|
|
|
|
<dd>
|
|
<p>The preview tabs have an internal incremental
|
|
search function. You initiate the search either
|
|
by typing a <span class=
|
|
"keycap"><strong>/</strong></span> (slash) or
|
|
<span class=
|
|
"keycap"><strong>CTL-F</strong></span> inside the
|
|
text area or by clicking into the <span class=
|
|
"guilabel">Search for:</span> text field and
|
|
entering the search string. You can then use the
|
|
<span class="guilabel">Next</span> and
|
|
<span class="guilabel">Previous</span> buttons to
|
|
find the next/previous occurrence. You can also
|
|
type <span class=
|
|
"keycap"><strong>F3</strong></span> inside the
|
|
text area to get to the next occurrence.</p>
|
|
|
|
<p>If you have a search string entered and you
|
|
use Ctrl-Up/Ctrl-Down to browse the results, the
|
|
search is initiated for each successive document.
|
|
If the string is found, the cursor will be
|
|
positioned at the first occurrence of the search
|
|
string.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">Walking the match
|
|
lists</span></dt>
|
|
|
|
<dd>
|
|
<p>If the entry area is empty when you click the
|
|
<span class="guilabel">Next</span> or
|
|
<span class="guilabel">Previous</span> buttons,
|
|
the editor will be scrolled to show the next
|
|
match to any search term (the next highlighted
|
|
zone). If you select a search group from the
|
|
dropdown list and click <span class=
|
|
"guilabel">Next</span> or <span class=
|
|
"guilabel">Previous</span>, the match list for
|
|
this group will be walked. This is not the same
|
|
as a text search, because the occurences will
|
|
include non-exact matches (as caused by stemming
|
|
or wildcards). The search will revert to the text
|
|
mode as soon as you edit the entry area.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.FRAGBUTS"
|
|
id="RCL.SEARCH.GUI.FRAGBUTS"></a>3.1.7. The
|
|
Query Fragments window</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Selecting the <span class="guimenu">Tools</span>
|
|
→ <span class="guimenuitem">Query Fragments</span>
|
|
menu entry will open a window with radio- and
|
|
check-buttons which can be used to activate query
|
|
language fragments for filtering the current query. This
|
|
can be useful if you have frequent reusable selectors,
|
|
for example, filtering on alternate directories, or
|
|
searching just one category of files, not covered by the
|
|
standard category selectors.</p>
|
|
|
|
<p>The contents of the window are entirely customizable,
|
|
and defined by the contents of the <code class=
|
|
"filename">fragbuts.xml</code> file inside the
|
|
configuration directory. The sample file distributed with
|
|
<span class="application">Recoll</span> (which you should
|
|
be able to find under <code class=
|
|
"filename">/usr/share/recoll/examples/fragbuts.xml</code>),
|
|
contains an example which filters the results from the
|
|
WEB history.</p>
|
|
|
|
<p>Here follows an example:</p>
|
|
<pre class="programlisting">
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<fragbuts version="1.0">
|
|
|
|
<radiobuttons>
|
|
|
|
<fragbut>
|
|
<label>Include Web Results</label>
|
|
<frag></frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Exclude Web Results</label>
|
|
<frag>-rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Only Web Results</label>
|
|
<frag>rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
</radiobuttons>
|
|
|
|
<buttons>
|
|
|
|
<fragbut>
|
|
<label>Year 2010</label>
|
|
<frag>date:2010-01-01/2010-12-31</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>My Great Directory Only</label>
|
|
<frag>dir:/my/great/directory</frag>
|
|
</fragbut>
|
|
|
|
</buttons>
|
|
</fragbuts>
|
|
</pre>
|
|
|
|
<p>Each <code class="literal">radiobuttons</code> or
|
|
<code class="literal">buttons</code> section defines a
|
|
line of checkbuttons or radiobuttons inside the window.
|
|
Any number of buttons can be selected, but the
|
|
radiobuttons in a line are exclusive.</p>
|
|
|
|
<p>Each <code class="literal">fragbut</code> section
|
|
defines the label for a button, and the Query Language
|
|
fragment which will be added (as an AND filter) before
|
|
performing the query if the button is active.</p>
|
|
|
|
<p>This feature is new in <span class=
|
|
"application">Recoll</span> 1.20, and will probably be
|
|
refined depending on user feedback.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.COMPLEX"
|
|
id=
|
|
"RCL.SEARCH.GUI.COMPLEX"></a>3.1.8. Complex/advanced
|
|
search</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The advanced search dialog helps you build more
|
|
complex queries without memorizing the search language
|
|
constructs. It can be opened through the <span class=
|
|
"guilabel">Tools</span> menu or through the main
|
|
toolbar.</p>
|
|
|
|
<p><span class="application">Recoll</span> keeps a
|
|
history of searches. See <a class="link" href=
|
|
"#RCL.SEARCH.GUI.COMPLEX.HISTORY" title=
|
|
"3.1.8.3. Avanced search history">Advanced search
|
|
history</a>.</p>
|
|
|
|
<p>The dialog has two tabs:</p>
|
|
|
|
<div class="orderedlist">
|
|
<ol class="orderedlist" type="1">
|
|
<li class="listitem">
|
|
<p>The first tab lets you specify terms to search
|
|
for, and permits specifying multiple clauses which
|
|
are combined to build the search.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The second tab lets filter the results according
|
|
to file size, date of modification, MIME type, or
|
|
location.</p>
|
|
</li>
|
|
</ol>
|
|
</div>
|
|
|
|
<p>Click on the <span class="guilabel">Start
|
|
Search</span> button in the advanced search dialog, or
|
|
type <span class="keycap"><strong>Enter</strong></span>
|
|
in any text field to start the search. The button in the
|
|
main window always performs a simple search.</p>
|
|
|
|
<p>Click on the <code class="literal">Show query
|
|
details</code> link at the top of the result page to see
|
|
the query expansion.</p>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.COMPLEX.TERMS" id=
|
|
"RCL.SEARCH.GUI.COMPLEX.TERMS"></a>3.1.8.1. Avanced
|
|
search: the "find" tab</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>This part of the dialog lets you constructc a query
|
|
by combining multiple clauses of different types. Each
|
|
entry field is configurable for the following
|
|
modes:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>All terms.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Any term.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>None of the terms.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Phrase (exact terms in order within an
|
|
adjustable window).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Proximity (terms in any order within an
|
|
adjustable window).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Filename search.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Additional entry fields can be created by clicking
|
|
the <span class="guilabel">Add clause</span>
|
|
button.</p>
|
|
|
|
<p>When searching, the non-empty clauses will be
|
|
combined either with an AND or an OR conjunction,
|
|
depending on the choice made on the left (<span class=
|
|
"guilabel">All clauses</span> or <span class=
|
|
"guilabel">Any clause</span>).</p>
|
|
|
|
<p>Entries of all types except "Phrase" and "Near"
|
|
accept a mix of single words and phrases enclosed in
|
|
double quotes. Stemming and wildcard expansion will be
|
|
performed as for simple search.</p>
|
|
|
|
<p><b>Phrases and Proximity searches. </b>These
|
|
two clauses work in similar ways, with the difference
|
|
that proximity searches do not impose an order on the
|
|
words. In both cases, an adjustable number (slack) of
|
|
non-matched words may be accepted between the searched
|
|
ones (use the counter on the left to adjust this
|
|
count). For phrases, the default count is zero (exact
|
|
match). For proximity it is ten (meaning that two
|
|
search terms, would be matched if found within a window
|
|
of twelve words). Examples: a phrase search for
|
|
<code class="literal">quick fox</code> with a slack of
|
|
0 will match <code class="literal">quick fox</code> but
|
|
not <code class="literal">quick brown fox</code>. With
|
|
a slack of 1 it will match the latter, but not
|
|
<code class="literal">fox quick</code>. A proximity
|
|
search for <code class="literal">quick fox</code> with
|
|
the default slack will match the latter, and also
|
|
<code class="literal">a fox is a cunning and quick
|
|
animal</code>.</p>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.COMPLEX.FILTER" id=
|
|
"RCL.SEARCH.GUI.COMPLEX.FILTER"></a>3.1.8.2. Avanced
|
|
search: the "filter" tab</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>This part of the dialog has several sections which
|
|
allow filtering the results of a search according to a
|
|
number of criteria</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The first section allows filtering by dates of
|
|
last modification. You can specify both a minimum
|
|
and a maximum date. The initial values are set
|
|
according to the oldest and newest documents
|
|
found in the index.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The next section allows filtering the results
|
|
by file size. There are two entries for minimum
|
|
and maximum size. Enter decimal numbers. You can
|
|
use suffix multipliers: <code class=
|
|
"literal">k/K</code>, <code class=
|
|
"literal">m/M</code>, <code class=
|
|
"literal">g/G</code>, <code class=
|
|
"literal">t/T</code> for 1E3, 1E6, 1E9, 1E12
|
|
respectively.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The next section allows filtering the results
|
|
by their MIME types, or MIME categories (ie:
|
|
media/text/message/etc.).</p>
|
|
|
|
<p>You can transfer the types between two boxes,
|
|
to define which will be included or excluded by
|
|
the search.</p>
|
|
|
|
<p>The state of the file type selection can be
|
|
saved as the default (the file type filter will
|
|
not be activated at program start-up, but the
|
|
lists will be in the restored state).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The bottom section allows restricting the
|
|
search results to a sub-tree of the indexed area.
|
|
You can use the <span class=
|
|
"guilabel">Invert</span> checkbox to search for
|
|
files not in the sub-tree instead. If you use
|
|
directory filtering often and on big subsets of
|
|
the file system, you may think of setting up
|
|
multiple indexes instead, as the performance may
|
|
be better.</p>
|
|
|
|
<p>You can use relative/partial paths for
|
|
filtering. Ie, entering <code class=
|
|
"literal">dirA/dirB</code> would match either
|
|
<code class=
|
|
"filename">/dir1/dirA/dirB/myfile1</code> or
|
|
<code class=
|
|
"filename">/dir2/dirA/dirB/someother/myfile2</code>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.COMPLEX.HISTORY" id=
|
|
"RCL.SEARCH.GUI.COMPLEX.HISTORY"></a>3.1.8.3. Avanced
|
|
search history</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The advanced search tool memorizes the last 100
|
|
searches performed. You can walk the saved searches by
|
|
using the up and down arrow keys while the keyboard
|
|
focus belongs to the advanced search dialog.</p>
|
|
|
|
<p>The complex search history can be erased, along with
|
|
the one for simple search, by selecting the
|
|
<span class="guimenu">File</span> → <span class=
|
|
"guimenuitem">Erase Search History</span> menu
|
|
entry.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TERMEXPLORER" id=
|
|
"RCL.SEARCH.GUI.TERMEXPLORER"></a>3.1.9. The
|
|
term explorer tool</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> automatically
|
|
manages the expansion of search terms to their
|
|
derivatives (ie: plural/singular, verb inflections). But
|
|
there are other cases where the exact search term is not
|
|
known. For example, you may not remember the exact
|
|
spelling, or only know the beginning of the name.</p>
|
|
|
|
<p>The search will only propose replacement terms with
|
|
spelling variations when no matching document were found.
|
|
In some cases, both proper spellings and mispellings are
|
|
present in the index, and it may be interesting to look
|
|
for them explicitely.</p>
|
|
|
|
<p>The term explorer tool (started from the toolbar icon
|
|
or from the <span class="guilabel">Term explorer</span>
|
|
entry of the <span class="guilabel">Tools</span> menu)
|
|
can be used to search the full index terms list. It has
|
|
three modes of operations:</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Wildcard</span></dt>
|
|
|
|
<dd>
|
|
<p>In this mode of operation, you can enter a
|
|
search string with shell-like wildcards (*, ?, []).
|
|
ie: <em class="replaceable"><code>xapi*</code></em>
|
|
would display all index terms beginning with
|
|
<em class="replaceable"><code>xapi</code></em>.
|
|
(More about wildcards <a class="link" href=
|
|
"#RCL.SEARCH.WILDCARDS" title=
|
|
"3.8.1. More about wildcards">here</a>).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">Regular expression</span></dt>
|
|
|
|
<dd>
|
|
<p>This mode will accept a regular expression as
|
|
input. Example: <em class=
|
|
"replaceable"><code>word[0-9]+</code></em>. The
|
|
expression is implicitely anchored at the
|
|
beginning. Ie: <em class=
|
|
"replaceable"><code>press</code></em> will match
|
|
<em class="replaceable"><code>pression</code></em>
|
|
but not <em class=
|
|
"replaceable"><code>expression</code></em>. You can
|
|
use <em class=
|
|
"replaceable"><code>.*press</code></em> to match
|
|
the latter, but be aware that this will cause a
|
|
full index term list scan, which can be quite
|
|
long.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">Stem expansion</span></dt>
|
|
|
|
<dd>
|
|
<p>This mode will perform the usual stem expansion
|
|
normally done as part user input processing. As
|
|
such it is probably mostly useful to demonstrate
|
|
the process.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">Spelling/Phonetic</span></dt>
|
|
|
|
<dd>
|
|
<p>In this mode, you enter the term as you think it
|
|
is spelled, and <span class=
|
|
"application">Recoll</span> will do its best to
|
|
find index terms that sound like your entry. This
|
|
mode uses the <span class=
|
|
"application">Aspell</span> spelling application,
|
|
which must be installed on your system for things
|
|
to work (if your documents contain non-ascii
|
|
characters, <span class="application">Recoll</span>
|
|
needs an aspell version newer than 0.60 for UTF-8
|
|
support). The language which is used to build the
|
|
dictionary out of the index terms (which is done at
|
|
the end of an indexing pass) is the one defined by
|
|
your NLS environment. Weird things will probably
|
|
happen if languages are mixed up.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<p>Note that in cases where <span class=
|
|
"application">Recoll</span> does not know the beginning
|
|
of the string to search for (ie a wildcard expression
|
|
like <em class="replaceable"><code>*coll</code></em>),
|
|
the expansion can take quite a long time because the full
|
|
index term list will have to be processed. The expansion
|
|
is currently limited at 10000 results for wildcards and
|
|
regular expressions. It is possible to change the limit
|
|
in the configuration file.</p>
|
|
|
|
<p>Double-clicking on a term in the result list will
|
|
insert it into the simple search entry field. You can
|
|
also cut/paste between the result list and any entry
|
|
field (the end of lines will be taken care of).</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.MULTIDB"
|
|
id=
|
|
"RCL.SEARCH.GUI.MULTIDB"></a>3.1.10. Multiple
|
|
indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>See the <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.MULTIPLE" title=
|
|
"2.3.1. Multiple indexes">section describing the use
|
|
of multiple indexes</a> for generalities. Only the
|
|
aspects concerning the <span class=
|
|
"command"><strong>recoll</strong></span> GUI are
|
|
described here.</p>
|
|
|
|
<p>A <span class="command"><strong>recoll</strong></span>
|
|
program instance is always associated with a specific
|
|
index, which is the one to be updated when requested from
|
|
the <span class="guimenu">File</span> menu, but it can
|
|
use any number of <span class="application">Recoll</span>
|
|
indexes for searching. The external indexes can be
|
|
selected through the <span class="guilabel">external
|
|
indexes</span> tab in the preferences dialog.</p>
|
|
|
|
<p>Index selection is performed in two phases. A set of
|
|
all usable indexes must first be defined, and then the
|
|
subset of indexes to be used for searching. These
|
|
parameters are retained across program executions (there
|
|
are kept separately for each <span class=
|
|
"application">Recoll</span> configuration). The set of
|
|
all indexes is usually quite stable, while the active
|
|
ones might typically be adjusted quite frequently.</p>
|
|
|
|
<p>The main index (defined by <code class=
|
|
"envar">RECOLL_CONFDIR</code>) is always active. If this
|
|
is undesirable, you can set up your base configuration to
|
|
index an empty directory.</p>
|
|
|
|
<p>When adding a new index to the set, you can select
|
|
either a <span class="application">Recoll</span>
|
|
configuration directory, or directly a <span class=
|
|
"application">Xapian</span> index directory. In the first
|
|
case, the <span class="application">Xapian</span> index
|
|
directory will be obtained from the selected
|
|
configuration.</p>
|
|
|
|
<p>As building the set of all indexes can be a little
|
|
tedious when done through the user interface, you can use
|
|
the <code class="envar">RECOLL_EXTRA_DBS</code>
|
|
environment variable to provide an initial set. This
|
|
might typically be set up by a system administrator so
|
|
that every user does not have to do it. The variable
|
|
should define a colon-separated list of index
|
|
directories, ie:</p>
|
|
<pre class="screen">
|
|
export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
|
|
</pre>
|
|
|
|
<p>Another environment variable, <code class=
|
|
"envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
|
|
the active list of indexes. This variable was suggested
|
|
and implemented by a <span class=
|
|
"application">Recoll</span> user. It is mostly useful if
|
|
you use scripts to mount external volumes with
|
|
<span class="application">Recoll</span> indexes. By using
|
|
<code class="envar">RECOLL_EXTRA_DBS</code> and
|
|
<code class="envar">RECOLL_ACTIVE_EXTRA_DBS</code>, you
|
|
can add and activate the index for the mounted volume
|
|
when starting <span class=
|
|
"command"><strong>recoll</strong></span>.</p>
|
|
|
|
<p><code class="envar">RECOLL_ACTIVE_EXTRA_DBS</code> is
|
|
available for <span class="application">Recoll</span>
|
|
versions 1.17.2 and later. A change was made in the same
|
|
update so that <span class=
|
|
"command"><strong>recoll</strong></span> will
|
|
automatically deactivate unreachable indexes when
|
|
starting up.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.HISTORY"
|
|
id=
|
|
"RCL.SEARCH.GUI.HISTORY"></a>3.1.11. Document
|
|
history</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Documents that you actually view (with the internal
|
|
preview or an external tool) are entered into the
|
|
document history, which is remembered.</p>
|
|
|
|
<p>You can display the history list by using the
|
|
<span class="guilabel">Tools/</span><span class=
|
|
"guilabel">Doc History</span> menu entry.</p>
|
|
|
|
<p>You can erase the document history by using the
|
|
<span class="guilabel">Erase document history</span>
|
|
entry in the <span class="guimenu">File</span> menu.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.SORT" id=
|
|
"RCL.SEARCH.GUI.SORT"></a>3.1.12. Sorting
|
|
search results and collapsing duplicates</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The documents in a result list are normally sorted in
|
|
order of relevance. It is possible to specify a different
|
|
sort order, either by using the vertical arrows in the
|
|
GUI toolbox to sort by date, or switching to the result
|
|
table display and clicking on any header. The sort order
|
|
chosen inside the result table remains active if you
|
|
switch back to the result list, until you click one of
|
|
the vertical arrows, until both are unchecked (you are
|
|
back to sort by relevance).</p>
|
|
|
|
<p>Sort parameters are remembered between program
|
|
invocations, but result sorting is normally always
|
|
inactive when the program starts. It is possible to keep
|
|
the sorting activation state between program invocations
|
|
by checking the <span class="guilabel">Remember sort
|
|
activation state</span> option in the preferences.</p>
|
|
|
|
<p>It is also possible to hide duplicate entries inside
|
|
the result list (documents with the exact same contents
|
|
as the displayed one). The test of identity is based on
|
|
an MD5 hash of the document container, not only of the
|
|
text contents (so that ie, a text document with an image
|
|
added will not be a duplicate of the text only).
|
|
Duplicates hiding is controlled by an entry in the
|
|
<span class="guilabel">GUI configuration</span> dialog,
|
|
and is off by default.</p>
|
|
|
|
<p>As of release 1.19, when a result document does have
|
|
undisplayed duplicates, a <code class=
|
|
"literal">Dups</code> link will be shown with the result
|
|
list entry. Clicking the link will display the paths
|
|
(URLs + ipaths) for the duplicate entries.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.TIPS" id=
|
|
"RCL.SEARCH.GUI.TIPS"></a>3.1.13. Search tips,
|
|
shortcuts</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TIPS.TERMS" id=
|
|
"RCL.SEARCH.GUI.TIPS.TERMS"></a>3.1.13.1. Terms
|
|
and search expansion</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><b>Term completion. </b>Typing <span class=
|
|
"keycap"><strong>Esc</strong></span> <span class=
|
|
"keycap"><strong>Space</strong></span> in the simple
|
|
search entry field while entering a word will either
|
|
complete the current word if its beginning matches a
|
|
unique term in the index, or open a window to propose a
|
|
list of completions.</p>
|
|
|
|
<p><b>Picking up new terms from result or preview
|
|
text. </b>Double-clicking on a word in the result
|
|
list or in a preview window will copy it to the simple
|
|
search entry field.</p>
|
|
|
|
<p><b>Wildcards. </b>Wildcards can be used inside
|
|
search terms in all forms of searches. <a class="link"
|
|
href="#RCL.SEARCH.WILDCARDS" title=
|
|
"3.8.1. More about wildcards">More about
|
|
wildcards</a>.</p>
|
|
|
|
<p><b>Automatic suffixes. </b>Words like
|
|
<code class="literal">odt</code> or <code class=
|
|
"literal">ods</code> can be automatically turned into
|
|
query language <code class="literal">ext:xxx</code>
|
|
clauses. This can be enabled in the <span class=
|
|
"guilabel">Search preferences</span> panel in the
|
|
GUI.</p>
|
|
|
|
<p><b>Disabling stem expansion. </b>Entering a
|
|
capitalized word in any search field will prevent stem
|
|
expansion (no search for <code class=
|
|
"literal">gardening</code> if you enter <code class=
|
|
"literal">Garden</code> instead of <code class=
|
|
"literal">garden</code>). This is the only case where
|
|
character case should make a difference for a
|
|
<span class="application">Recoll</span> search. You can
|
|
also disable stem expansion or change the stemming
|
|
language in the preferences.</p>
|
|
|
|
<p><b>Finding related documents. </b>Selecting the
|
|
<span class="guilabel">Find similar documents</span>
|
|
entry in the result list paragraph right-click menu
|
|
will select a set of "interesting" terms from the
|
|
current result, and insert them into the simple search
|
|
entry field. You can then possibly edit the list and
|
|
start a search to find documents which may be
|
|
apparented to the current result.</p>
|
|
|
|
<p><b>File names. </b>File names are added as
|
|
terms during indexing, and you can specify them as
|
|
ordinary terms in normal search fields (<span class=
|
|
"application">Recoll</span> used to index all
|
|
directories in the file path as terms. This has been
|
|
abandoned as it did not seem really useful).
|
|
Alternatively, you can use the specific file name
|
|
search which will <span class=
|
|
"emphasis"><em>only</em></span> look for file names,
|
|
and may be faster than the generic search especially
|
|
when using wildcards.</p>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TIPS.PHRASES" id=
|
|
"RCL.SEARCH.GUI.TIPS.PHRASES"></a>3.1.13.2. Working
|
|
with phrases and proximity</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><b>Phrases and Proximity searches. </b>A phrase
|
|
can be looked for by enclosing it in double quotes.
|
|
Example: <code class="literal">"user manual"</code>
|
|
will look only for occurrences of <code class=
|
|
"literal">user</code> immediately followed by
|
|
<code class="literal">manual</code>. You can use the
|
|
<span class="guilabel">This phrase</span> field of the
|
|
advanced search dialog to the same effect. Phrases can
|
|
be entered along simple terms in all simple or advanced
|
|
search entry fields (except <span class="guilabel">This
|
|
exact phrase</span>).</p>
|
|
|
|
<p><b>AutoPhrases. </b>This option can be set in
|
|
the preferences dialog. If it is set, a phrase will be
|
|
automatically built and added to simple searches when
|
|
looking for <code class="literal">Any terms</code>.
|
|
This will not change radically the results, but will
|
|
give a relevance boost to the results where the search
|
|
terms appear as a phrase. Ie: searching for
|
|
<code class="literal">virtual reality</code> will still
|
|
find all documents where either <code class=
|
|
"literal">virtual</code> or <code class=
|
|
"literal">reality</code> or both appear, but those
|
|
which contain <code class="literal">virtual
|
|
reality</code> should appear sooner in the list.</p>
|
|
|
|
<p>Phrase searches can strongly slow down a query if
|
|
most of the terms in the phrase are common. This is why
|
|
the <code class="varname">autophrase</code> option is
|
|
off by default for <span class=
|
|
"application">Recoll</span> versions before 1.17. As of
|
|
version 1.17, <code class="varname">autophrase</code>
|
|
is on by default, but very common terms will be removed
|
|
from the constructed phrase. The removal threshold can
|
|
be adjusted from the search preferences.</p>
|
|
|
|
<p><b>Phrases and abbreviations. </b>As of
|
|
<span class="application">Recoll</span> version 1.17,
|
|
dotted abbreviations like <code class=
|
|
"literal">I.B.M.</code> are also automatically indexed
|
|
as a word without the dots: <code class=
|
|
"literal">IBM</code>. Searching for the word inside a
|
|
phrase (ie: <code class="literal">"the IBM
|
|
company"</code>) will only match the dotted
|
|
abrreviation if you increase the phrase slack (using
|
|
the advanced search panel control, or the <code class=
|
|
"literal">o</code> query language modifier). Literal
|
|
occurences of the word will be matched normally.</p>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TIPS.MISC" id=
|
|
"RCL.SEARCH.GUI.TIPS.MISC"></a>3.1.13.3. Others</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><b>Using fields. </b>You can use the <a class=
|
|
"link" href="#RCL.SEARCH.LANG" title=
|
|
"3.6. The query language">query language</a> and
|
|
field specifications to only search certain parts of
|
|
documents. This can be especially helpful with email,
|
|
for example only searching emails from a specific
|
|
originator: <code class="literal">search tips
|
|
from:helpfulgui</code></p>
|
|
|
|
<p><b>Ajusting the result table columns. </b>When
|
|
displaying results in table mode, you can use a right
|
|
click on the table headers to activate a pop-up menu
|
|
which will let you adjust what columns are displayed.
|
|
You can drag the column headers to adjust their order.
|
|
You can click them to sort by the field displayed in
|
|
the column. You can also save the result list in CSV
|
|
format.</p>
|
|
|
|
<p><b>Changing the GUI geometry. </b>It is
|
|
possible to configure the GUI in wide form factor by
|
|
dragging the toolbars to one of the sides (their
|
|
location is remembered between sessions), and moving
|
|
the category filters to a menu (can be set in the
|
|
<span class="guimenu">Preferences</span> →
|
|
<span class="guimenuitem">GUI configuration</span>
|
|
→ <span class="guimenuitem">User interface</span>
|
|
panel).</p>
|
|
|
|
<p><b>Query explanation. </b>You can get an exact
|
|
description of what the query looked for, including
|
|
stem expansion, and Boolean operators used, by clicking
|
|
on the result list header.</p>
|
|
|
|
<p><b>Advanced search history. </b>As of
|
|
<span class="application">Recoll</span> 1.18, you can
|
|
display any of the last 100 complex searches performed
|
|
by using the up and down arrow keys while the advanced
|
|
search panel is active.</p>
|
|
|
|
<p><b>Browsing the result list inside a preview
|
|
window. </b>Entering <span class=
|
|
"keycap"><strong>Shift-Down</strong></span> or
|
|
<span class="keycap"><strong>Shift-Up</strong></span>
|
|
(<span class="keycap"><strong>Shift</strong></span> +
|
|
an arrow key) in a preview window will display the next
|
|
or the previous document from the result list. Any
|
|
secondary search currently active will be executed on
|
|
the new document.</p>
|
|
|
|
<p><b>Scrolling the result list from the
|
|
keyboard. </b>You can use <span class=
|
|
"keycap"><strong>PageUp</strong></span> and
|
|
<span class="keycap"><strong>PageDown</strong></span>
|
|
to scroll the result list, <span class=
|
|
"keycap"><strong>Shift+Home</strong></span> to go back
|
|
to the first page. These work even while the focus is
|
|
in the search entry.</p>
|
|
|
|
<p><b>Result table: moving the focus to the
|
|
table. </b>You can use <span class=
|
|
"keycap"><strong>Ctrl-r</strong></span> to move the
|
|
focus from the search entry to the table, and then use
|
|
the arrow keys to change the current row. <span class=
|
|
"keycap"><strong>Ctrl-Shift-s</strong></span> returns
|
|
to the search.</p>
|
|
|
|
<p><b>Result table: open / preview. </b>With the
|
|
focus in the result table, you can use <span class=
|
|
"keycap"><strong>Ctrl-o</strong></span> to open the
|
|
document from the current row, <span class=
|
|
"keycap"><strong>Ctrl-Shift-o</strong></span> to open
|
|
the document and close <span class=
|
|
"command"><strong>recoll</strong></span>, <span class=
|
|
"keycap"><strong>Ctrl-d</strong></span> to preview the
|
|
document.</p>
|
|
|
|
<p><b>Editing a new search while the focus is not in
|
|
the search entry. </b>You can use the <span class=
|
|
"keycap"><strong>Ctrl-Shift-S</strong></span> shortcut
|
|
to return the cursor to the search entry (and select
|
|
the current search text), while the focus is anywhere
|
|
in the main window.</p>
|
|
|
|
<p><b>Forced opening of a preview window. </b>You
|
|
can use <span class=
|
|
"keycap"><strong>Shift</strong></span>+Click on a
|
|
result list <code class="literal">Preview</code> link
|
|
to force the creation of a preview window instead of a
|
|
new tab in the existing one.</p>
|
|
|
|
<p><b>Closing previews. </b>Entering <span class=
|
|
"keycap"><strong>Ctrl-W</strong></span> in a tab will
|
|
close it (and, for the last tab, close the preview
|
|
window). Entering <span class=
|
|
"keycap"><strong>Esc</strong></span> will close the
|
|
preview window and all its tabs.</p>
|
|
|
|
<p><b>Printing previews. </b>Entering <span class=
|
|
"keycap"><strong>Ctrl-P</strong></span> in a preview
|
|
window will print the currently displayed text.</p>
|
|
|
|
<p><b>Quitting. </b>Entering <span class=
|
|
"keycap"><strong>Ctrl-Q</strong></span> almost anywhere
|
|
will close the application.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.SAVING" id=
|
|
"RCL.SEARCH.SAVING"></a>3.1.14. Saving and
|
|
restoring queries (1.21 and later)</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Both simple and advanced query dialogs save recent
|
|
history, but the amount is limited: old queries will
|
|
eventually be forgotten. Also, important queries may be
|
|
difficult to find among others. This is why both types of
|
|
queries can also be explicitely saved to files, from the
|
|
GUI menus: <span class="guimenu">File</span> →
|
|
<span class="guimenuitem">Save last query / Load last
|
|
query</span></p>
|
|
|
|
<p>The default location for saved queries is a
|
|
subdirectory of the current configuration directory, but
|
|
saved queries are ordinary files and can be written or
|
|
moved anywhere.</p>
|
|
|
|
<p>Some of the saved query parameters are part of the
|
|
preferences (e.g. <code class="literal">autophrase</code>
|
|
or the active external indexes), and may differ when the
|
|
query is loaded from the time it was saved. In this case,
|
|
<span class="application">Recoll</span> will warn of the
|
|
differences, but will not change the user
|
|
preferences.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.CUSTOM"
|
|
id=
|
|
"RCL.SEARCH.GUI.CUSTOM"></a>3.1.15. Customizing
|
|
the search interface</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>You can customize some aspects of the search interface
|
|
by using the <span class="guimenu">GUI
|
|
configuration</span> entry in the <span class=
|
|
"guimenu">Preferences</span> menu.</p>
|
|
|
|
<p>There are several tabs in the dialog, dealing with the
|
|
interface itself, the parameters used for searching and
|
|
returning results, and what indexes are searched.</p>
|
|
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.UI" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.UI"></a><b>User interface
|
|
parameters: </b></p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Highlight color for query
|
|
terms</span>: Terms from the user query are
|
|
highlighted in the result list samples and the
|
|
preview window. The color can be chosen here. Any
|
|
Qt color string should work (ie <code class=
|
|
"literal">red</code>, <code class=
|
|
"literal">#ff0000</code>). The default is
|
|
<code class="literal">blue</code>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Style sheet</span>: The
|
|
name of a <span class="application">Qt</span> style
|
|
sheet text file which is applied to the whole
|
|
Recoll application on startup. The default value is
|
|
empty, but there is a skeleton style sheet
|
|
(<code class="filename">recoll.qss</code>) inside
|
|
the <code class=
|
|
"filename">/usr/share/recoll/examples</code>
|
|
directory. Using a style sheet, you can change most
|
|
<span class=
|
|
"command"><strong>recoll</strong></span> graphical
|
|
parameters: colors, fonts, etc. See the sample file
|
|
for a few simple examples.</p>
|
|
|
|
<p>You should be aware that parameters (e.g.: the
|
|
background color) set inside the <span class=
|
|
"application">Recoll</span> GUI style sheet will
|
|
override global system preferences, with possible
|
|
strange side effects: for example if you set the
|
|
foreground to a light color and the background to a
|
|
dark one in the desktop preferences, but only the
|
|
background is set inside the <span class=
|
|
"application">Recoll</span> style sheet, and it is
|
|
light too, then text will appear light-on-light
|
|
inside the <span class="application">Recoll</span>
|
|
GUI.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Maximum text size
|
|
highlighted for preview</span> Inserting highlights
|
|
on search term inside the text before inserting it
|
|
in the preview window involves quite a lot of
|
|
processing, and can be disabled over the given text
|
|
size to speed up loading.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Prefer HTML to plain text
|
|
for preview</span> if set, Recoll will display HTML
|
|
as such inside the preview window. If this causes
|
|
problems with the Qt HTML display, you can uncheck
|
|
it to display the plain text version instead.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Plain text to HTML line
|
|
style</span>: when displaying plain text inside the
|
|
preview window, <span class=
|
|
"application">Recoll</span> tries to preserve some
|
|
of the original text line breaks and indentation.
|
|
It can either use PRE HTML tags, which will well
|
|
preserve the indentation but will force horizontal
|
|
scrolling for long lines, or use BR tags to break
|
|
at the original line breaks, which will let the
|
|
editor introduce other line breaks according to the
|
|
window width, but will lose some of the original
|
|
indentation. The third option has been available in
|
|
recent releases and is probably now the best one:
|
|
use PRE tags with line wrapping.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Choose editor
|
|
applicationsr</span>: this opens a dialog which
|
|
allows you to select the application to be used to
|
|
open each MIME type. The default is nornally to use
|
|
the <span class=
|
|
"command"><strong>xdg-open</strong></span> utility,
|
|
but you can override it.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Exceptions</span>: even
|
|
wen <span class=
|
|
"command"><strong>xdg-open</strong></span> is used
|
|
by default for opening documents, you can set
|
|
exceptions for MIME types that will still be opened
|
|
according to <span class=
|
|
"application">Recoll</span> preferences. This is
|
|
useful for passing parameters like page numbers or
|
|
search strings to applications that support them
|
|
(e.g. <span class="application">evince</span>).
|
|
This cannot be done with <span class=
|
|
"command"><strong>xdg-open</strong></span> which
|
|
only supports passing one parameter.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Document filter choice
|
|
style</span>: this will let you choose if the
|
|
document categories are displayed as a list or a
|
|
set of buttons, or a menu.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Start with simple search
|
|
mode</span>: this lets you choose the value of the
|
|
simple search type on program startup. Either a
|
|
fixed value (e.g. <code class="literal">Query
|
|
Language</code>, or the value in use when the
|
|
program last exited.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Auto-start simple search
|
|
on white space entry</span>: if this is checked, a
|
|
search will be executed each time you enter a space
|
|
in the simple search input field. This lets you
|
|
look at the result list as you enter new terms.
|
|
This is off by default, you may like it or
|
|
not...</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Start with advanced
|
|
search dialog open</span> : If you use this dialog
|
|
frequently, checking the entries will get it to
|
|
open when recoll starts.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Remember sort activation
|
|
state</span> if set, Recoll will remember the sort
|
|
tool stat between invocations. It normally starts
|
|
with sorting disabled.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.RL" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RL"></a><b>Result list
|
|
parameters: </b></p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Number of results in a
|
|
result page</span></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Result list font</span>:
|
|
There is quite a lot of information shown in the
|
|
result list, and you may want to customize the font
|
|
and/or font size. The rest of the fonts used by
|
|
<span class="application">Recoll</span> are
|
|
determined by your generic Qt config (try the
|
|
<span class=
|
|
"command"><strong>qtconfig</strong></span>
|
|
command).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.RESULTPARA" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESULTPARA"></a><span class=
|
|
"guilabel">Edit result list paragraph format
|
|
string</span>: allows you to change the
|
|
presentation of each result list entry. See the
|
|
<a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"3.1.15.1. The result list format">result list
|
|
customisation section</a>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.RESULTHEAD" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESULTHEAD"></a><span class=
|
|
"guilabel">Edit result page HTML header
|
|
insert</span>: allows you to define text inserted
|
|
at the end of the result page HTML header. More
|
|
detail in the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"3.1.15.1. The result list format">result list
|
|
customisation section.</a></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Date format</span>:
|
|
allows specifying the format used for displaying
|
|
dates inside the result list. This should be
|
|
specified as an strftime() string (man
|
|
strftime).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.ABSSEP" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.ABSSEP"></a><span class=
|
|
"guilabel">Abstract snippet separator</span>: for
|
|
synthetic abstracts built from index data, which
|
|
are usually made of several snippets from different
|
|
parts of the document, this defines the snippet
|
|
separator, an ellipsis by default.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.SEARCH" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.SEARCH"></a><b>Search
|
|
parameters: </b></p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Hide duplicate
|
|
results</span>: decides if result list entries are
|
|
shown for identical documents found in different
|
|
places.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Stemming language</span>:
|
|
stemming obviously depends on the document's
|
|
language. This listbox will let you chose among the
|
|
stemming databases which were built during indexing
|
|
(this is set in the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF" title=
|
|
"5.4.2. The main configuration file, recoll.conf">
|
|
main configuration file</a>), or later added with
|
|
<span class="command"><strong>recollindex
|
|
-s</strong></span> (See the recollindex manual).
|
|
Stemming languages which are dynamically added will
|
|
be deleted at the next indexing pass unless they
|
|
are also added in the configuration file.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Automatically add phrase
|
|
to simple searches</span>: a phrase will be
|
|
automatically built and added to simple searches
|
|
when looking for <code class="literal">Any
|
|
terms</code>. This will give a relevance boost to
|
|
the results where the search terms appear as a
|
|
phrase (consecutive and in order).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Autophrase term frequency
|
|
threshold percentage</span>: very frequent terms
|
|
should not be included in automatic phrase searches
|
|
for performance reasons. The parameter defines the
|
|
cutoff percentage (percentage of the documents
|
|
where the term appears).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Replace abstracts from
|
|
documents</span>: this decides if we should
|
|
synthesize and display an abstract in place of an
|
|
explicit abstract found within the document
|
|
itself.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Dynamically build
|
|
abstracts</span>: this decides if <span class=
|
|
"application">Recoll</span> tries to build document
|
|
abstracts (lists of <span class=
|
|
"emphasis"><em>snippets</em></span>) when
|
|
displaying the result list. Abstracts are
|
|
constructed by taking context from the document
|
|
information, around the search terms.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Synthetic abstract
|
|
size</span>: adjust to taste...</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Synthetic abstract
|
|
context words</span>: how many words should be
|
|
displayed around each term occurrence.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Query language magic file
|
|
name suffixes</span>: a list of words which
|
|
automatically get turned into <code class=
|
|
"literal">ext:xxx</code> file name suffix clauses
|
|
when starting a query language query (ie:
|
|
<code class="literal">doc xls xlsx...</code>). This
|
|
will save some typing for people who use file types
|
|
a lot when querying.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.EXTRADB" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.EXTRADB"></a><b>External
|
|
indexes: </b>This panel will let you browse for
|
|
additional indexes that you may want to search. External
|
|
indexes are designated by their database directory (ie:
|
|
<code class=
|
|
"filename">/home/someothergui/.recoll/xapiandb</code>,
|
|
<code class=
|
|
"filename">/usr/local/recollglobal/xapiandb</code>).</p>
|
|
|
|
<p>Once entered, the indexes will appear in the
|
|
<span class="guilabel">External indexes</span> list, and
|
|
you can chose which ones you want to use at any moment by
|
|
checking or unchecking their entries.</p>
|
|
|
|
<p>Your main database (the one the current configuration
|
|
indexes to), is always implicitly active. If this is not
|
|
desirable, you can set up your configuration so that it
|
|
indexes, for example, an empty directory. An alternative
|
|
indexer may also need to implement a way of purging the
|
|
index from stale data,</p>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST"></a>3.1.15.1. The
|
|
result list format</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Newer versions of Recoll (from 1.17) normally use
|
|
WebKit HTML widgets for the result list and the
|
|
<a class="link" href=
|
|
"#RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">snippets
|
|
window</a> (this may be disabled at build time). Total
|
|
customisation is possible with full support for CSS and
|
|
Javascript. Conversely, there are limits to what you
|
|
can do with the older Qt QTextBrowser, but still, it is
|
|
possible to decide what data each result will contain,
|
|
and how it will be displayed.</p>
|
|
|
|
<p>The result list presentation can be exhaustively
|
|
customized by adjusting two elements:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The paragraph format</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>HTML code inside the header section. For
|
|
versions 1.21 and later, this is also used for
|
|
the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">snippets
|
|
window</a></p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The paragraph format and the header fragment can be
|
|
edited from the <span class="guilabel">Result
|
|
list</span> tab of the <span class="guilabel">GUI
|
|
configuration</span>.</p>
|
|
|
|
<p>The header fragment is used both for the result list
|
|
and the snippets window. The snippets list is a table
|
|
and has a <code class="literal">snippets</code> class
|
|
attribute. Each paragraph in the result list is a
|
|
table, with class <code class="literal">respar</code>,
|
|
but this can be changed by editing the paragraph
|
|
format.</p>
|
|
|
|
<p>There are a few examples on the <a class="ulink"
|
|
href="http://www.recoll.org/custom.html" target=
|
|
"_top">page about customising the result list</a> on
|
|
the <span class="application">Recoll</span> web
|
|
site.</p>
|
|
|
|
<div class="sect4">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA"></a>The
|
|
paragraph format</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>This is an arbitrary HTML string where the
|
|
following printf-like <code class="literal">%</code>
|
|
substitutions will be performed:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><b>%A. </b>Abstract</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%D. </b>Date</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%I. </b>Icon image name. This is
|
|
normally determined from the MIME type. The
|
|
associations are defined inside the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.MIMECONF"
|
|
title=
|
|
"5.4.5. The mimeconf file"><code class=
|
|
"filename">mimeconf</code> configuration
|
|
file</a>. If a thumbnail for the file is found
|
|
at the standard Freedesktop location, this will
|
|
be displayed instead.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%K. </b>Keywords (if any)</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%L. </b>Precooked Preview, Edit, and
|
|
possibly Snippets links</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%M. </b>MIME type</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%N. </b>result Number inside the
|
|
result page</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%P. </b>Parent folder Url. In the
|
|
case of an embedded document, this is the
|
|
parent folder for the top level container
|
|
file.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%R. </b>Relevance percentage</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%S. </b>Size information</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%T. </b>Title or Filename if not
|
|
set.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%t. </b>Title or Filename if not
|
|
set.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%U. </b>Url</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The format of the Preview, Edit, and Snippets
|
|
links is <code class="literal"><a
|
|
href="P%N"></code>, <code class="literal"><a
|
|
href="E%N"></code> and <code class="literal"><a
|
|
href="A%N"></code> where <em class=
|
|
"replaceable"><code>docnum</code></em> (%N) expands
|
|
to the document number inside the result page).</p>
|
|
|
|
<p>A link target defined as <code class=
|
|
"literal">"F%N"</code> will open the document
|
|
corresponding to the <code class="literal">%P</code>
|
|
parent folder expansion, usually creating a file
|
|
manager window on the folder where the container file
|
|
resides. E.g.:</p>
|
|
<pre class="programlisting">
|
|
<a href="F%N">%P</a>
|
|
</pre>
|
|
|
|
<p>A link target defined as <code class=
|
|
"literal">R%N|<em class=
|
|
"replaceable"><code>scriptname</code></em></code>
|
|
will run the corresponding script on the result file
|
|
(if the document is embedded, the script will be
|
|
started on the top-level parent). See the <a class=
|
|
"link" href="#RCL.SEARCH.GUI.RUNSCRIPT" title=
|
|
"3.1.4. Running arbitrary commands on result files (1.20 and later)">
|
|
section about defining scripts</a>.</p>
|
|
|
|
<p>In addition to the predefined values above, all
|
|
strings like <code class=
|
|
"literal">%(fieldname)</code> will be replaced by the
|
|
value of the field named <code class=
|
|
"literal">fieldname</code> for this document. Only
|
|
stored fields can be accessed in this way, the value
|
|
of indexed but not stored fields is not known at this
|
|
point in the search process (see <a class="link"
|
|
href="#RCL.PROGRAM.FIELDS" title=
|
|
"4.2. Field data processing">field
|
|
configuration</a>). There are currently very few
|
|
fields stored by default, apart from the values above
|
|
(only <code class="literal">author</code> and
|
|
<code class="literal">filename</code>), so this
|
|
feature will need some custom local configuration to
|
|
be useful. An example candidate would be the
|
|
<code class="literal">recipient</code> field which is
|
|
generated by the message input handlers.</p>
|
|
|
|
<p>The default value for the paragraph format string
|
|
is:</p>
|
|
<pre class="screen">
|
|
"<table class=\"respar\">\n"
|
|
"<tr>\n"
|
|
"<td><a href='%U'><img src='%I' width='64'></a></td>\n"
|
|
"<td>%L &nbsp;<i>%S</i> &nbsp;&nbsp;<b>%T</b><br>\n"
|
|
"<span style='white-space:nowrap'><i>%M</i>&nbsp;%D</span>&nbsp;&nbsp;&nbsp; <i>%U</i>&nbsp;%i<br>\n"
|
|
"%A %K</td>\n"
|
|
"</tr></table>\n"
|
|
</pre>
|
|
|
|
<p>You may, for example, try the following for a more
|
|
web-like experience:</p>
|
|
<pre class="screen">
|
|
<u><b><a href="P%N">%T</a></b></u><br>
|
|
%A<font color=#008000>%U - %S</font> - %L
|
|
</pre>
|
|
|
|
<p>Note that the P%N link in the above paragraph
|
|
makes the title a preview link. Or the clean
|
|
looking:</p>
|
|
<pre class="screen">
|
|
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
|
&nbsp;&nbsp;<b>%T&</b><br>%S&nbsp;
|
|
<font color="#808080"><i>%U</i></font>
|
|
<table bgcolor="#e0e0e0">
|
|
<tr><td><div>%A</div></td></tr>
|
|
</table>%K
|
|
</pre>
|
|
|
|
<p>These samples, and some others are <a class=
|
|
"ulink" href="http://www.recoll.org/custom.html"
|
|
target="_top">on the web site, with pictures to show
|
|
how they look.</a></p>
|
|
|
|
<p>It is also possible to <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.ABSSEP">define the value of
|
|
the snippet separator inside the abstract
|
|
section</a>.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.KIO" id=
|
|
"RCL.SEARCH.KIO"></a>3.2. Searching with the KDE
|
|
KIO slave</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.KIO.INTRO"
|
|
id="RCL.SEARCH.KIO.INTRO"></a>3.2.1. What's
|
|
this</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The <span class="application">Recoll</span> KIO slave
|
|
allows performing a <span class=
|
|
"application">Recoll</span> search by entering an
|
|
appropriate URL in a KDE open dialog, or with an
|
|
HTML-based interface displayed in <span class=
|
|
"command"><strong>Konqueror</strong></span>.</p>
|
|
|
|
<p>The HTML-based interface is similar to the Qt-based
|
|
interface, but slightly less powerful for now. Its
|
|
advantage is that you can perform your search while
|
|
staying fully within the KDE framework: drag and drop
|
|
from the result list works normally and you have your
|
|
normal choice of applications for opening files.</p>
|
|
|
|
<p>The alternative interface uses a directory view of
|
|
search results. Due to limitations in the current KIO
|
|
slave interface, it is currently not obviously useful (to
|
|
me).</p>
|
|
|
|
<p>The interface is described in more detail inside a
|
|
help file which you can access by entering <code class=
|
|
"filename">recoll:/</code> inside the <span class=
|
|
"command"><strong>konqueror</strong></span> URL line
|
|
(this works only if the recoll KIO slave has been
|
|
previously installed).</p>
|
|
|
|
<p>The instructions for building this module are located
|
|
in the source tree. See: <code class=
|
|
"filename">kde/kio/recoll/00README.txt</code>. Some Linux
|
|
distributions do package the kio-recoll module, so check
|
|
before diving into the build process, maybe it's already
|
|
out there ready for one-click installation.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.KIO.SEARCHABLEDOCS" id=
|
|
"RCL.SEARCH.KIO.SEARCHABLEDOCS"></a>3.2.2. Searchable
|
|
documents</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>As a sample application, the <span class=
|
|
"application">Recoll</span> KIO slave could allow
|
|
preparing a set of HTML documents (for example a manual)
|
|
so that they become their own search interface inside
|
|
<span class=
|
|
"command"><strong>konqueror</strong></span>.</p>
|
|
|
|
<p>This can be done by either explicitly inserting
|
|
<code class="literal"><a
|
|
href="recoll://..."></code> links around some document
|
|
areas, or automatically by adding a very small
|
|
<span class="application">javascript</span> program to
|
|
the documents, like the following example, which would
|
|
initiate a search by double-clicking any term:</p>
|
|
<pre class="programlisting">
|
|
<script language="JavaScript">
|
|
function recollsearch() {
|
|
var t = document.getSelection();
|
|
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
|
encodeURIComponent(t);
|
|
}
|
|
</script>
|
|
....
|
|
<body ondblclick="recollsearch()">
|
|
|
|
</pre>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.COMMANDLINE" id=
|
|
"RCL.SEARCH.COMMANDLINE"></a>3.3. Searching on
|
|
the command line</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>There are several ways to obtain search results as a
|
|
text stream, without a graphical interface:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>By passing option <code class="option">-t</code>
|
|
to the <span class=
|
|
"command"><strong>recoll</strong></span> program.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>By using the <span class=
|
|
"command"><strong>recollq</strong></span>
|
|
program.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>By writing a custom <span class=
|
|
"application">Python</span> program, using the
|
|
<a class="link" href="#RCL.PROGRAM.API.PYTHON" title=
|
|
"4.3.2. Python interface">Recoll Python
|
|
API</a>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The first two methods work in the same way and
|
|
accept/need the same arguments (except for the additional
|
|
<code class="option">-t</code> to <span class=
|
|
"command"><strong>recoll</strong></span>). The query to be
|
|
executed is specified as command line arguments.</p>
|
|
|
|
<p><span class="command"><strong>recollq</strong></span> is
|
|
not built by default. You can use the <code class=
|
|
"filename">Makefile</code> in the <code class=
|
|
"filename">query</code> directory to build it. This is a
|
|
very simple program, and if you can program a little c++,
|
|
you may find it useful to taylor its output format to your
|
|
needs. Not that recollq is only really useful on systems
|
|
where the Qt libraries (or even the X11 ones) are not
|
|
available. Otherwise, just use <code class="literal">recoll
|
|
-t</code>, which takes the exact same parameters and
|
|
options which are described for <span class=
|
|
"command"><strong>recollq</strong></span></p>
|
|
|
|
<p><span class="command"><strong>recollq</strong></span>
|
|
has a man page (not installed by default, look in the
|
|
<code class="filename">doc/man</code> directory). The Usage
|
|
string is as follows:</p>
|
|
<pre class="programlisting">
|
|
recollq: usage:
|
|
-P: Show the date span for all the documents present in the index
|
|
[-o|-a|-f] [-q] <query string>
|
|
Runs a recoll query and displays result lines.
|
|
Default: will interpret the argument(s) as a xesam query string
|
|
query may be like:
|
|
implicit AND, Exclusion, field spec: t1 -t2 title:t3
|
|
OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4)
|
|
Phrase: "t1 t2" (needs additional quoting on cmd line)
|
|
-o Emulate the GUI simple search in ANY TERM mode
|
|
-a Emulate the GUI simple search in ALL TERMS mode
|
|
-f Emulate the GUI simple search in filename mode
|
|
-q is just ignored (compatibility with the recoll GUI command line)
|
|
Common options:
|
|
-c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
|
|
-d also dump file contents
|
|
-n [first-]<cnt> define the result slice. The default value for [first]
|
|
is 0. Without the option, the default max count is 2000.
|
|
Use n=0 for no limit
|
|
-b : basic. Just output urls, no mime types or titles
|
|
-Q : no result lines, just the processed query and result count
|
|
-m : dump the whole document meta[] array for each result
|
|
-A : output the document abstracts
|
|
-S fld : sort by field <fld>
|
|
-s stemlang : set stemming language to use (must exist in index...)
|
|
Use -s "" to turn off stem expansion
|
|
-D : sort descending
|
|
-i <dbdir> : additional index, several can be given
|
|
-e use url encoding (%xx) for urls
|
|
-F <field name list> : output exactly these fields for each result.
|
|
The field values are encoded in base64, output in one line and
|
|
separated by one space character. This is the recommended format
|
|
for use by other programs. Use a normal query with option -m to
|
|
see the field names.
|
|
</pre>
|
|
|
|
<p>Sample execution:</p>
|
|
<pre class="programlisting">
|
|
recollq 'ilur -nautique mime:text/html'
|
|
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11)
|
|
OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
|
|
4 results
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
|
|
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
|
|
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.SYNONYMS" id=
|
|
"RCL.SEARCH.SYNONYMS"></a>3.4. Using Synonyms
|
|
(1.22)</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><b>Term synonyms: </b>there are a number of ways to
|
|
use term synonyms for searching text:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>At index creation time, they can be used to alter
|
|
the indexed terms, either increasing or decreasing
|
|
their number, by expanding the original terms to all
|
|
synonyms, or by reducing all synonym terms to a
|
|
canonical one.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>At query time, they can be used to match texts
|
|
containing terms which are synonyms of the ones
|
|
specified by the user, either by expanding the query
|
|
for all synonyms, or by reducing the user entry to
|
|
canonical terms (the latter only works if the
|
|
corresponding processing has been performed while
|
|
creating the index).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> only uses
|
|
synonyms at query time. A user query term which part of a
|
|
synonym group will be optionally expanded into an
|
|
<code class="literal">OR</code> query for all terms in the
|
|
group.</p>
|
|
|
|
<p>Synonym groups are defined inside ordinary text files.
|
|
Each line in the file defines a group.</p>
|
|
|
|
<p>Example:</p>
|
|
<pre class="programlisting">
|
|
hi hello "good morning"
|
|
|
|
# not sure about "au revoir" though. Is this english ?
|
|
bye goodbye "see you" \
|
|
"au revoir"
|
|
|
|
</pre>
|
|
|
|
<p>As usual, lines beginning with a <code class=
|
|
"literal">#</code> are comments, empty lines are ignored,
|
|
and lines can be continued by ending them with a
|
|
backslash.</p>
|
|
|
|
<p>Multi-word synonyms are supported, but be aware that
|
|
these will generate phrase queries, which may degrade
|
|
performance and will disable stemming expansion for the
|
|
phrase terms.</p>
|
|
|
|
<p>The synonyms file can be specified in the <span class=
|
|
"guilabel">Search parameters</span> tab of the <span class=
|
|
"guilabel">GUI configuration</span> <span class=
|
|
"guilabel">Preferences</span> menu entry, or as an option
|
|
for command-line searches.</p>
|
|
|
|
<p>Once the file is defined, the use of synonyms can be
|
|
enabled or disabled directly from the <span class=
|
|
"guilabel">Preferences</span> menu.</p>
|
|
|
|
<p>The synonyms are searched for matches with user terms
|
|
after the latter are stem-expanded, but the contents of the
|
|
synonyms file itself is not subjected to stem expansion.
|
|
This means that a match will not be found if the form
|
|
present in the synonyms file is not present anywhere in the
|
|
document set.</p>
|
|
|
|
<p>The synonyms function is probably not going to help you
|
|
find your letters to Mr. Smith. It is best used for
|
|
domain-specific searches. For example, it was initially
|
|
suggested by a user performing searches among historical
|
|
documents: the synonyms file would contains nicknames and
|
|
aliases for each of the persons of interest.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.PTRANS" id=
|
|
"RCL.SEARCH.PTRANS"></a>3.5. Path
|
|
translations</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>In some cases, the document paths stored inside the
|
|
index do not match the actual ones, so that document
|
|
previews and accesses will fail. This can occur in a number
|
|
of circumstances:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>When using multiple indexes it is a relatively
|
|
common occurrence that some will actually reside on a
|
|
remote volume, for exemple mounted via NFS. In this
|
|
case, the paths used to access the documents on the
|
|
local machine are not necessarily the same than the
|
|
ones used while indexing on the remote machine. For
|
|
example, <code class="filename">/home/me</code> may
|
|
have been used as a <code class=
|
|
"literal">topdirs</code> elements while indexing, but
|
|
the directory might be mounted as <code class=
|
|
"filename">/net/server/home/me</code> on the local
|
|
machine.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The case may also occur with removable disks. It
|
|
is perfectly possible to configure an index to live
|
|
with the documents on the removable disk, but it may
|
|
happen that the disk is not mounted at the same place
|
|
so that the documents paths from the index are
|
|
invalid.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>As a last exemple, one could imagine that a big
|
|
directory has been moved, but that it is currently
|
|
inconvenient to run the indexer.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> has a facility
|
|
for rewriting access paths when extracting the data from
|
|
the index. The translations can be defined for the main
|
|
index and for any additional query index.</p>
|
|
|
|
<p>The path translation facility will be useful whenever
|
|
the documents paths seen by the indexer are not the same as
|
|
the ones which should be used at query time.</p>
|
|
|
|
<p>In the above NFS example, <span class=
|
|
"application">Recoll</span> could be instructed to rewrite
|
|
any <code class="filename">file:///home/me</code> URL from
|
|
the index to <code class=
|
|
"filename">file:///net/server/home/me</code>, allowing
|
|
accesses from the client.</p>
|
|
|
|
<p>The translations are defined in the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG.PTRANS" title=
|
|
"5.4.7. The ptrans file"><code class=
|
|
"filename">ptrans</code></a> configuration file, which can
|
|
be edited by hand or from the GUI external indexes
|
|
configuration dialog: <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">External index dialog</span>, then click the
|
|
<span class="guilabel">Paths translations</span> button on
|
|
the right below the index list.</p>
|
|
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
|
|
<p>Due to a current bug, the GUI must be restarted after
|
|
changing the <code class="filename">ptrans</code> values
|
|
(even when they were changed from the GUI).</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.LANG" id=
|
|
"RCL.SEARCH.LANG"></a>3.6. The query
|
|
language</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The query language processor is activated in the GUI
|
|
simple search entry when the search mode selector is set to
|
|
<span class="guilabel">Query Language</span>. It can also
|
|
be used with the KIO slave or the command line search. It
|
|
broadly has the same capabilities as the complex search
|
|
interface in the GUI.</p>
|
|
|
|
<p>The language was based on the now defunct <a class=
|
|
"ulink" href=
|
|
"http://www.xesam.org/main/XesamUserSearchLanguage95"
|
|
target="_top">Xesam</a> user search language
|
|
specification.</p>
|
|
|
|
<p>If the results of a query language search puzzle you and
|
|
you doubt what has been actually searched for, you can use
|
|
the GUI <code class="literal">Show Query</code> link at the
|
|
top of the result list to check the exact query which was
|
|
finally executed by Xapian.</p>
|
|
|
|
<p>Here follows a sample request that we are going to
|
|
explain:</p>
|
|
<pre class="programlisting">
|
|
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
|
|
|
</pre>
|
|
|
|
<p>This would search for all documents with <em class=
|
|
"replaceable"><code>John Doe</code></em> appearing as a
|
|
phrase in the author field (exactly what this is would
|
|
depend on the document type, ie: the <code class=
|
|
"literal">From:</code> header, for an email message), and
|
|
containing either <em class=
|
|
"replaceable"><code>beatles</code></em> or <em class=
|
|
"replaceable"><code>lennon</code></em> and either
|
|
<em class="replaceable"><code>live</code></em> or
|
|
<em class="replaceable"><code>unplugged</code></em> but not
|
|
<em class="replaceable"><code>potatoes</code></em> (in any
|
|
part of the document).</p>
|
|
|
|
<p>An element is composed of an optional field
|
|
specification, and a value, separated by a colon (the field
|
|
separator is the last colon in the element). Examples:
|
|
<em class="replaceable"><code>Eugenie</code></em>,
|
|
<em class="replaceable"><code>author:balzac</code></em>,
|
|
<em class="replaceable"><code>dc:title:grandet</code></em>
|
|
<em class="replaceable"><code>dc:title:"eugenie
|
|
grandet"</code></em></p>
|
|
|
|
<p>The colon, if present, means "contains". Xesam defines
|
|
other relations, which are mostly unsupported for now
|
|
(except in special cases, described further down).</p>
|
|
|
|
<p>All elements in the search entry are normally combined
|
|
with an implicit AND. It is possible to specify that
|
|
elements be OR'ed instead, as in <em class=
|
|
"replaceable"><code>Beatles</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>Lennon</code></em>. The <code class=
|
|
"literal">OR</code> must be entered literally (capitals),
|
|
and it has priority over the AND associations: <em class=
|
|
"replaceable"><code>word1</code></em> <em class=
|
|
"replaceable"><code>word2</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>word3</code></em> means <em class=
|
|
"replaceable"><code>word1</code></em> AND (<em class=
|
|
"replaceable"><code>word2</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>word3</code></em>) not (<em class=
|
|
"replaceable"><code>word1</code></em> AND <em class=
|
|
"replaceable"><code>word2</code></em>) <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>word3</code></em>.</p>
|
|
|
|
<p><span class="application">Recoll</span> versions 1.21
|
|
and later, allow using parentheses to group elements, which
|
|
will sometimes make things clearer, and may allow
|
|
expressing combinations which would have been difficult
|
|
otherwise.</p>
|
|
|
|
<p>An element preceded by a <code class="literal">-</code>
|
|
specifies a term that should <span class=
|
|
"emphasis"><em>not</em></span> appear.</p>
|
|
|
|
<p>As usual, words inside quotes define a phrase (the order
|
|
of words is significant), so that <em class=
|
|
"replaceable"><code>title:"prejudice pride"</code></em> is
|
|
not the same as <em class=
|
|
"replaceable"><code>title:prejudice
|
|
title:pride</code></em>, and is unlikely to find a
|
|
result.</p>
|
|
|
|
<p>Words inside phrases and capitalized words are not
|
|
stem-expanded. Wildcards may be used anywhere inside a
|
|
term. Specifying a wild-card on the left of a term can
|
|
produce a very slow search (or even an incorrect one if the
|
|
expansion is truncated because of excessive size). Also see
|
|
<a class="link" href="#RCL.SEARCH.WILDCARDS" title=
|
|
"3.8.1. More about wildcards">More about
|
|
wildcards</a>.</p>
|
|
|
|
<p>To save you some typing, recent <span class=
|
|
"application">Recoll</span> versions (1.20 and later)
|
|
interpret a comma-separated list of terms as an AND list
|
|
inside the field. Use slash characters ('/') for an OR
|
|
list. No white space is allowed. So</p>
|
|
<pre class="programlisting">
|
|
author:john,lennon
|
|
</pre>
|
|
|
|
<p>will search for documents with <code class=
|
|
"literal">john</code> and <code class=
|
|
"literal">lennon</code> inside the <code class=
|
|
"literal">author</code> field (in any order), and</p>
|
|
<pre class="programlisting">
|
|
author:john/ringo
|
|
</pre>
|
|
|
|
<p>would search for <code class="literal">john</code> or
|
|
<code class="literal">ringo</code>.</p>
|
|
|
|
<p>Modifiers can be set on a double-quote value, for
|
|
example to specify a proximity search (unordered). See
|
|
<a class="link" href="#RCL.SEARCH.LANG.MODIFIERS" title=
|
|
"3.6.1. Modifiers">the modifier section</a>. No space
|
|
must separate the final double-quote and the modifiers
|
|
value, e.g. <em class="replaceable"><code>"two
|
|
one"po10</code></em></p>
|
|
|
|
<p><span class="application">Recoll</span> currently
|
|
manages the following default fields:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">title</code>, <code class=
|
|
"literal">subject</code> or <code class=
|
|
"literal">caption</code> are synonyms which specify
|
|
data to be searched for in the document title or
|
|
subject.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">author</code> or
|
|
<code class="literal">from</code> for searching the
|
|
documents originators.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">recipient</code> or
|
|
<code class="literal">to</code> for searching the
|
|
documents recipients.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">keyword</code> for searching
|
|
the document-specified keywords (few documents
|
|
actually have any).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">filename</code> for the
|
|
document's file name. This is not necessarily set for
|
|
all documents: internal documents contained inside a
|
|
compound one (for example an EPUB section) do not
|
|
inherit the container file name any more, this was
|
|
replaced by an explicit field (see next).
|
|
Sub-documents can still have a specific <code class=
|
|
"literal">filename</code>, if it is implied by the
|
|
document format, for example the attachment file name
|
|
for an email attachment.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">containerfilename</code>.
|
|
This is set for all documents, both top-level and
|
|
contained sub-documents, and is always the name of
|
|
the filesystem directory entry which contains the
|
|
data. The terms from this field can only be matched
|
|
by an explicit field specification (as opposed to
|
|
terms from <code class="literal">filename</code>
|
|
which are also indexed as general document content).
|
|
This avoids getting matches for all the sub-documents
|
|
when searching for the container file name.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">ext</code> specifies the
|
|
file name extension (Ex: <code class=
|
|
"literal">ext:html</code>)</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> 1.20 and later
|
|
have a way to specify aliases for the field names, which
|
|
will save typing, for example by aliasing <code class=
|
|
"literal">filename</code> to <em class=
|
|
"replaceable"><code>fn</code></em> or <code class=
|
|
"literal">containerfilename</code> to <em class=
|
|
"replaceable"><code>cfn</code></em>. See the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file">section about the
|
|
<code class="filename">fields</code> file</a></p>
|
|
|
|
<p>The document input handlers used while indexing have the
|
|
possibility to create other fields with arbitrary names,
|
|
and aliases may be defined in the configuration, so that
|
|
the exact field search possibilities may be different for
|
|
you if someone took care of the customisation.</p>
|
|
|
|
<p>The field syntax also supports a few field-like, but
|
|
special, criteria:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">dir</code> for filtering the
|
|
results on file location (Ex: <code class=
|
|
"literal">dir:/home/me/somedir</code>). <code class=
|
|
"literal">-dir</code> also works to find results not
|
|
in the specified directory (release >= 1.15.8).
|
|
Tilde expansion will be performed as usual (except
|
|
for a bug in versions 1.19 to 1.19.11p1). Wildcards
|
|
will be expanded, but please <a class="link" href=
|
|
"#RCL.SEARCH.WILDCARDS.PATH" title=
|
|
"3.8.1.1. Wildcards and path filtering">have a
|
|
look</a> at an important limitation of wildcards in
|
|
path filters.</p>
|
|
|
|
<p>Relative paths also make sense, for example,
|
|
<code class="literal">dir:share/doc</code> would
|
|
match either <code class=
|
|
"filename">/usr/share/doc</code> or <code class=
|
|
"filename">/usr/local/share/doc</code></p>
|
|
|
|
<p>Several <code class="literal">dir</code> clauses
|
|
can be specified, both positive and negative. For
|
|
example the following makes sense:</p>
|
|
<pre class="programlisting">
|
|
dir:recoll dir:src -dir:utils -dir:common
|
|
|
|
</pre>
|
|
|
|
<p>This would select results which have both
|
|
<code class="filename">recoll</code> and <code class=
|
|
"filename">src</code> in the path (in any order), and
|
|
which have not either <code class=
|
|
"filename">utils</code> or <code class=
|
|
"filename">common</code>.</p>
|
|
|
|
<p>You can also use <code class="literal">OR</code>
|
|
conjunctions with <code class="literal">dir:</code>
|
|
clauses.</p>
|
|
|
|
<p>A special aspect of <code class=
|
|
"literal">dir</code> clauses is that the values in
|
|
the index are not transcoded to UTF-8, and never
|
|
lower-cased or unaccented, but stored as binary. This
|
|
means that you need to enter the values in the exact
|
|
lower or upper case, and that searches for names with
|
|
diacritics may sometimes be impossible because of
|
|
character set conversion issues. Non-ASCII UNIX file
|
|
paths are an unending source of trouble and are best
|
|
avoided.</p>
|
|
|
|
<p>You need to use double-quotes around the path
|
|
value if it contains space characters.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">size</code> for filtering
|
|
the results on file size. Example: <code class=
|
|
"literal">size<10000</code>. You can use
|
|
<code class="literal"><</code>, <code class=
|
|
"literal">></code> or <code class=
|
|
"literal">=</code> as operators. You can specify a
|
|
range like the following: <code class=
|
|
"literal">size>100 size<1000</code>. The usual
|
|
<code class="literal">k/K, m/M, g/G, t/T</code> can
|
|
be used as (decimal) multipliers. Ex: <code class=
|
|
"literal">size>1k</code> to search for files
|
|
bigger than 1000 bytes.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">date</code> for searching or
|
|
filtering on dates. The syntax for the argument is
|
|
based on the ISO8601 standard for dates and time
|
|
intervals. Only dates are supported, no times. The
|
|
general syntax is 2 elements separated by a
|
|
<code class="literal">/</code> character. Each
|
|
element can be a date or a period of time. Periods
|
|
are specified as <code class=
|
|
"literal">P</code><em class=
|
|
"replaceable"><code>n</code></em><code class=
|
|
"literal">Y</code><em class=
|
|
"replaceable"><code>n</code></em><code class=
|
|
"literal">M</code><em class=
|
|
"replaceable"><code>n</code></em><code class=
|
|
"literal">D</code>. The <em class=
|
|
"replaceable"><code>n</code></em> numbers are the
|
|
respective numbers of years, months or days, any of
|
|
which may be missing. Dates are specified as
|
|
<em class=
|
|
"replaceable"><code>YYYY</code></em>-<em class=
|
|
"replaceable"><code>MM</code></em>-<em class=
|
|
"replaceable"><code>DD</code></em>. The days and
|
|
months parts may be missing. If the <code class=
|
|
"literal">/</code> is present but an element is
|
|
missing, the missing element is interpreted as the
|
|
lowest or highest date in the index. Examples:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: circle;">
|
|
<li class="listitem">
|
|
<p><code class=
|
|
"literal">2001-03-01/2002-05-01</code> the
|
|
basic syntax for an interval of dates.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class=
|
|
"literal">2001-03-01/P1Y2M</code> the same
|
|
specified with a period.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">2001/</code> from the
|
|
beginning of 2001 to the latest date in the
|
|
index.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">2001</code> the whole
|
|
year of 2001</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">P2D/</code> means 2
|
|
days ago up to now if there are no documents
|
|
with dates in the future.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">/2003</code> all
|
|
documents from 2003 or older.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Periods can also be specified with small letters
|
|
(ie: p2y).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">mime</code> or <code class=
|
|
"literal">format</code> for specifying the MIME type.
|
|
These clauses are processed besides the normal
|
|
Boolean logic of the search. Multiple values will be
|
|
OR'ed (instead of the normal AND). You can specify
|
|
types to be excluded, with the usual <code class=
|
|
"literal">-</code>, and use wildcards. Example:
|
|
<em class="replaceable"><code>mime:text/*
|
|
-mime:text/plain</code></em> Specifying an explicit
|
|
boolean operator before a <code class=
|
|
"literal">mime</code> specification is not supported
|
|
and will produce strange results.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">type</code> or <code class=
|
|
"literal">rclcat</code> for specifying the category
|
|
(as in text/media/presentation/etc.). The
|
|
classification of MIME types in categories is defined
|
|
in the <span class="application">Recoll</span>
|
|
configuration (<code class=
|
|
"filename">mimeconf</code>), and can be modified or
|
|
extended. The default category names are those which
|
|
permit filtering results in the main GUI screen.
|
|
Categories are OR'ed like MIME types above, and can
|
|
be negated with <code class="literal">-</code>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
|
|
<p><code class="literal">mime</code>, <code class=
|
|
"literal">rclcat</code>, <code class=
|
|
"literal">size</code> and <code class=
|
|
"literal">date</code> criteria always affect the whole
|
|
query (they are applied as a final filter), even if set
|
|
with other terms inside a parenthese.</p>
|
|
</div>
|
|
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
|
|
<p><code class="literal">mime</code> (or the equivalent
|
|
<code class="literal">rclcat</code>) is the <span class=
|
|
"emphasis"><em>only</em></span> field with an
|
|
<code class="literal">OR</code> default. You do need to
|
|
use <code class="literal">OR</code> with <code class=
|
|
"literal">ext</code> terms for example.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.LANG.MODIFIERS" id=
|
|
"RCL.SEARCH.LANG.MODIFIERS"></a>3.6.1. Modifiers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Some characters are recognized as search modifiers
|
|
when found immediately after the closing double quote of
|
|
a phrase, as in <code class="literal">"some
|
|
term"modifierchars</code>. The actual "phrase" can be a
|
|
single term of course. Supported modifiers:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">l</code> can be used to
|
|
turn off stemming (mostly makes sense with
|
|
<code class="literal">p</code> because stemming is
|
|
off by default for phrases).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">s</code> can be used to
|
|
turn off synonym expansion, if a synonyms file is
|
|
in place (only for <span class=
|
|
"application">Recoll</span> 1.22 and later).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">o</code> can be used to
|
|
specify a "slack" for phrase and proximity
|
|
searches: the number of additional terms that may
|
|
be found between the specified ones. If
|
|
<code class="literal">o</code> is followed by an
|
|
integer number, this is the slack, else the default
|
|
is 10.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">p</code> can be used to
|
|
turn the default phrase search into a proximity one
|
|
(unordered). Example: <code class="literal">"order
|
|
any in"p</code></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">C</code> will turn on case
|
|
sensitivity (if the index supports it).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">D</code> will turn on
|
|
diacritics sensitivity (if the index supports
|
|
it).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>A weight can be specified for a query element by
|
|
specifying a decimal value at the start of the
|
|
modifiers. Example: <code class=
|
|
"literal">"Important"2.5</code>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.CASEDIAC" id=
|
|
"RCL.SEARCH.CASEDIAC"></a>3.7. Search case and
|
|
diacritics sensitivity</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>For <span class="application">Recoll</span> versions
|
|
1.18 and later, and <span class="emphasis"><em>when working
|
|
with a raw index</em></span> (not the default), searches
|
|
can be sensitive to character case and diacritics. How this
|
|
happens is controlled by configuration variables and what
|
|
search data is entered.</p>
|
|
|
|
<p>The general default is that searches entered without
|
|
upper-case or accented characters are insensitive to case
|
|
and diacritics. An entry of <code class=
|
|
"literal">resume</code> will match any of <code class=
|
|
"literal">Resume</code>, <code class=
|
|
"literal">RESUME</code>, <code class=
|
|
"literal">résumé</code>, <code class=
|
|
"literal">Résumé</code> etc.</p>
|
|
|
|
<p>Two configuration variables can automate switching on
|
|
sensitivity (they were documented but actually did nothing
|
|
until <span class="application">Recoll</span> 1.22):</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">autodiacsens</span></dt>
|
|
|
|
<dd>
|
|
<p>If this is set, search sensitivity to diacritics
|
|
will be turned on as soon as an accented character
|
|
exists in a search term. When the variable is set to
|
|
true, <code class="literal">resume</code> will start
|
|
a diacritics-unsensitive search, but <code class=
|
|
"literal">résumé</code> will be matched
|
|
exactly. The default value is <span class=
|
|
"emphasis"><em>false</em></span>.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">autocasesens</span></dt>
|
|
|
|
<dd>
|
|
<p>If this is set, search sensitivity to character
|
|
case will be turned on as soon as an upper-case
|
|
character exists in a search term <span class=
|
|
"emphasis"><em>except for the first one</em></span>.
|
|
When the variable is set to true, <code class=
|
|
"literal">us</code> or <code class=
|
|
"literal">Us</code> will start a
|
|
diacritics-unsensitive search, but <code class=
|
|
"literal">US</code> will be matched exactly. The
|
|
default value is <span class=
|
|
"emphasis"><em>true</em></span> (contrary to
|
|
<code class="literal">autodiacsens</code>).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<p>As in the past, capitalizing the first letter of a word
|
|
will turn off its stem expansion and have no effect on
|
|
case-sensitivity.</p>
|
|
|
|
<p>You can also explicitely activate case and diacritics
|
|
sensitivity by using modifiers with the query language.
|
|
<code class="literal">C</code> will make the term
|
|
case-sensitive, and <code class="literal">D</code> will
|
|
make it diacritics-sensitive. Examples:</p>
|
|
<pre class="programlisting">
|
|
"us"C
|
|
|
|
</pre>
|
|
|
|
<p>will search for the term <code class="literal">us</code>
|
|
exactly (<code class="literal">Us</code> will not be a
|
|
match).</p>
|
|
<pre class="programlisting">
|
|
"resume"D
|
|
|
|
</pre>
|
|
|
|
<p>will search for the term <code class=
|
|
"literal">resume</code> exactly (<code class=
|
|
"literal">résumé</code> will not be a
|
|
match).</p>
|
|
|
|
<p>When either case or diacritics sensitivity is activated,
|
|
stem expansion is turned off. Having both does not make
|
|
much sense.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.ANCHORWILD" id=
|
|
"RCL.SEARCH.ANCHORWILD"></a>3.8. Anchored
|
|
searches and wildcards</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Some special characters are interpreted by <span class=
|
|
"application">Recoll</span> in search strings to expand or
|
|
specialize the search. Wildcards expand a root term in
|
|
controlled ways. Anchor characters can restrict a search to
|
|
succeed only if the match is found at or near the beginning
|
|
of the document or one of its fields.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.WILDCARDS"
|
|
id="RCL.SEARCH.WILDCARDS"></a>3.8.1. More
|
|
about wildcards</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>All words entered in <span class=
|
|
"application">Recoll</span> search fields will be
|
|
processed for wildcard expansion before the request is
|
|
finally executed.</p>
|
|
|
|
<p>The wildcard characters are:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">*</code> which matches 0
|
|
or more characters.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">?</code> which matches a
|
|
single character.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">[]</code> which allow
|
|
defining sets of characters to be matched (ex:
|
|
<code class="literal">[</code><strong class=
|
|
"userinput"><code>abc</code></strong><code class=
|
|
"literal">]</code> matches a single character which
|
|
may be 'a' or 'b' or 'c', <code class=
|
|
"literal">[</code><strong class=
|
|
"userinput"><code>0-9</code></strong><code class=
|
|
"literal">]</code> matches any number.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>You should be aware of a few things when using
|
|
wildcards.</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Using a wildcard character at the beginning of a
|
|
word can make for a slow search because
|
|
<span class="application">Recoll</span> will have
|
|
to scan the whole index term list to find the
|
|
matches. However, this is much less a problem for
|
|
field searches, and queries like <em class=
|
|
"replaceable"><code>author:*@domain.com</code></em>
|
|
can sometimes be very useful.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>For <span class="application">Recoll</span>
|
|
version 18 only, when working with a raw index
|
|
(preserving character case and diacritics), the
|
|
literal part of a wildcard expression will be
|
|
matched exactly for case and diacritics. This is
|
|
not true any more for versions 19 and later.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Using a <code class="literal">*</code> at the
|
|
end of a word can produce more matches than you
|
|
would think, and strange search results. You can
|
|
use the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.TERMEXPLORER" title=
|
|
"3.1.9. The term explorer tool">term
|
|
explorer</a> tool to check what completions exist
|
|
for a given term. You can also see exactly what
|
|
search was performed by clicking on the link at the
|
|
top of the result list. In general, for natural
|
|
language terms, stem expansion will produce better
|
|
results than an ending <code class=
|
|
"literal">*</code> (stem expansion is turned off
|
|
when any wildcard character appears in the
|
|
term).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.WILDCARDS.PATH" id=
|
|
"RCL.SEARCH.WILDCARDS.PATH"></a>3.8.1.1. Wildcards
|
|
and path filtering</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Due to the way that <span class=
|
|
"application">Recoll</span> processes wildcards inside
|
|
<code class="literal">dir</code> path filtering
|
|
clauses, they will have a multiplicative effect on the
|
|
query size. A clause containg wildcards in several
|
|
paths elements, like, for example, <code class=
|
|
"literal">dir:</code><em class=
|
|
"replaceable"><code>/home/me/*/*/docdir</code></em>,
|
|
will almost certainly fail if your indexed tree is of
|
|
any realistic size.</p>
|
|
|
|
<p>Depending on the case, you may be able to work
|
|
around the issue by specifying the paths elements more
|
|
narrowly, with a constant prefix, or by using 2
|
|
separate <code class="literal">dir:</code> clauses
|
|
instead of multiple wildcards, as in <code class=
|
|
"literal">dir:</code><em class=
|
|
"replaceable"><code>/home/me</code></em> <code class=
|
|
"literal">dir:</code><em class=
|
|
"replaceable"><code>docdir</code></em>. The latter
|
|
query is not equivalent to the initial one because it
|
|
does not specify a number of directory levels, but
|
|
that's the best we can do (and it may be actually more
|
|
useful in some cases).</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.ANCHOR" id=
|
|
"RCL.SEARCH.ANCHOR"></a>3.8.2. Anchored
|
|
searches</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Two characters are used to specify that a search hit
|
|
should occur at the beginning or at the end of the text.
|
|
<code class="literal">^</code> at the beginning of a term
|
|
or phrase constrains the search to happen at the start,
|
|
<code class="literal">$</code> at the end force it to
|
|
happen at the end.</p>
|
|
|
|
<p>As this function is implemented as a phrase search it
|
|
is possible to specify a maximum distance at which the
|
|
hit should occur, either through the controls of the
|
|
advanced search panel, or using the query language, for
|
|
example, as in:</p>
|
|
<pre class="programlisting">
|
|
"^someterm"o10
|
|
</pre>
|
|
|
|
<p>which would force <code class=
|
|
"literal">someterm</code> to be found within 10 terms of
|
|
the start of the text. This can be combined with a field
|
|
search as in <code class=
|
|
"literal">somefield:"^someterm"o10</code> or <code class=
|
|
"literal">somefield:someterm$</code>.</p>
|
|
|
|
<p>This feature can also be used with an actual phrase
|
|
search, but in this case, the distance applies to the
|
|
whole phrase and anchor, so that, for example,
|
|
<code class="literal">bla bla my unexpected term</code>
|
|
at the beginning of the text would be a match for
|
|
<code class="literal">"^my term"o5</code>.</p>
|
|
|
|
<p>Anchored searches can be very useful for searches
|
|
inside somewhat structured documents like scientific
|
|
articles, in case explicit metadata has not been supplied
|
|
(a most frequent case), for example for looking for
|
|
matches inside the abstract or the list of authors (which
|
|
occur at the top of the document).</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.DESKTOP" id=
|
|
"RCL.SEARCH.DESKTOP"></a>3.9. Desktop
|
|
integration</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Being independant of the desktop type has its drawbacks:
|
|
<span class="application">Recoll</span> desktop integration
|
|
is minimal. However there are a few tools available:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The <span class="application">KDE</span> KIO Slave
|
|
was described in a <a class="link" href=
|
|
"#RCL.SEARCH.KIO" title=
|
|
"3.2. Searching with the KDE KIO slave">previous
|
|
section</a>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>If you use a recent version of Ubuntu Linux, you
|
|
may find the <a class="ulink" href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/UnityLens"
|
|
target="_top">Ubuntu Unity Lens</a> module
|
|
useful.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>There is also an independantly developed <a class=
|
|
"ulink" href=
|
|
"http://kde-apps.org/content/show.php/recollrunner?content=128203"
|
|
target="_top">Krunner plugin</a>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Here follow a few other things that may help.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.SHORTCUT" id=
|
|
"RCL.SEARCH.SHORTCUT"></a>3.9.1. Hotkeying
|
|
recoll</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>It is surprisingly convenient to be able to show or
|
|
hide the <span class="application">Recoll</span> GUI with
|
|
a single keystroke. Recoll comes with a small Python
|
|
script, based on the <span class=
|
|
"application">libwnck</span> window manager interface
|
|
library, which will allow you to do just this. The
|
|
detailed instructions are on <a class="ulink" href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/HotRecoll"
|
|
target="_top">this wiki page</a>.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.KICKER-APPLET" id=
|
|
"RCL.KICKER-APPLET"></a>3.9.2. The KDE Kicker
|
|
Recoll applet</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>This is probably obsolete now. Anyway:</p>
|
|
|
|
<p>The <span class="application">Recoll</span> source
|
|
tree contains the source code to the <span class=
|
|
"application">recoll_applet</span>, a small application
|
|
derived from the <span class=
|
|
"application">find_applet</span>. This can be used to add
|
|
a small <span class="application">Recoll</span> launcher
|
|
to the KDE panel.</p>
|
|
|
|
<p>The applet is not automatically built with the main
|
|
<span class="application">Recoll</span> programs, nor is
|
|
it included with the main source distribution (because
|
|
the KDE build boilerplate makes it relatively big). You
|
|
can download its source from the recoll.org download
|
|
page. Use the omnipotent <strong class=
|
|
"userinput"><code>configure;make;make
|
|
install</code></strong> incantation to build and
|
|
install.</p>
|
|
|
|
<p>You can then add the applet to the panel by
|
|
right-clicking the panel and choosing the <span class=
|
|
"guilabel">Add applet</span> entry.</p>
|
|
|
|
<p>The <span class="application">recoll_applet</span> has
|
|
a small text window where you can type a <span class=
|
|
"application">Recoll</span> query (in query language
|
|
form), and an icon which can be used to restrict the
|
|
search to certain types of files. It is quite primitive,
|
|
and launches a new recoll GUI instance every time (even
|
|
if it is already running). You may find it useful
|
|
anyway.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.PROGRAM" id=
|
|
"RCL.PROGRAM"></a>Chapter 4. Programming
|
|
interface</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> has an Application
|
|
Programming Interface, usable both for indexing and
|
|
searching, currently accessible from the <span class=
|
|
"application">Python</span> language.</p>
|
|
|
|
<p>Another less radical way to extend the application is to
|
|
write input handlers for new types of documents.</p>
|
|
|
|
<p>The processing of metadata attributes for documents
|
|
(<code class="literal">fields</code>) is highly
|
|
configurable.</p>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.PROGRAM.FILTERS" id=
|
|
"RCL.PROGRAM.FILTERS"></a>4.1. Writing a
|
|
document input handler</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Terminology</h3>The small programs or
|
|
pieces of code which handle the processing of the
|
|
different document types for <span class=
|
|
"application">Recoll</span> used to be called
|
|
<code class="literal">filters</code>, which is still
|
|
reflected in the name of the directory which holds them
|
|
and many configuration variables. They were named this
|
|
way because one of their primary functions is to filter
|
|
out the formatting directives and keep the text content.
|
|
However these modules may have other behaviours, and the
|
|
term <code class="literal">input handler</code> is now
|
|
progressively substituted in the documentation.
|
|
<code class="literal">filter</code> is still used in many
|
|
places though.
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> input handlers
|
|
cooperate to translate from the multitude of input document
|
|
formats, simple ones as <span class=
|
|
"application">opendocument</span>, <span class=
|
|
"application">acrobat</span>), or compound ones such as
|
|
<span class="application">Zip</span> or <span class=
|
|
"application">Email</span>, into the final <span class=
|
|
"application">Recoll</span> indexing input format, which is
|
|
plain text. Most input handlers are executable programs or
|
|
scripts. A few handlers are coded in C++ and live inside
|
|
<span class="command"><strong>recollindex</strong></span>.
|
|
This latter kind will not be described here.</p>
|
|
|
|
<p>There are currently (1.18 and since 1.13) two kinds of
|
|
external executable input handlers:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Simple <code class="literal">exec</code> handlers
|
|
run once and exit. They can be bare programs like
|
|
<span class=
|
|
"command"><strong>antiword</strong></span>, or
|
|
scripts using other programs. They are very simple to
|
|
write, because they just need to print the converted
|
|
document to the standard output. Their output can be
|
|
plain text or HTML. HTML is usually preferred because
|
|
it can store metadata fields and it allows preserving
|
|
some of the formatting for the GUI preview.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Multiple <code class="literal">execm</code>
|
|
handlers can process multiple files (sparing the
|
|
process startup time which can be very significant),
|
|
or multiple documents per file (e.g.: for
|
|
<span class="application">zip</span> or <span class=
|
|
"application">chm</span> files). They communicate
|
|
with the indexer through a simple protocol, but are
|
|
nevertheless a bit more complicated than the older
|
|
kind. Most of new handlers are written in
|
|
<span class="application">Python</span>, using a
|
|
common module to handle the protocol. There is an
|
|
exception, <span class=
|
|
"command"><strong>rclimg</strong></span> which is
|
|
written in Perl. The subdocuments output by these
|
|
handlers can be directly indexable (text or HTML), or
|
|
they can be other simple or compound documents that
|
|
will need to be processed by another handler.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>In both cases, handlers deal with regular file system
|
|
files, and can process either a single document, or a
|
|
linear list of documents in each file. <span class=
|
|
"application">Recoll</span> is responsible for performing
|
|
up to date checks, deal with more complex embedding and
|
|
other upper level issues.</p>
|
|
|
|
<p>A simple handler returning a document in <code class=
|
|
"literal">text/plain</code> format, can transfer no
|
|
metadata to the indexer. Generic metadata, like document
|
|
size or modification date, will be gathered and stored by
|
|
the indexer.</p>
|
|
|
|
<p>Handlers that produce <code class=
|
|
"literal">text/html</code> format can return an arbitrary
|
|
amount of metadata inside HTML <code class=
|
|
"literal">meta</code> tags. These will be processed
|
|
according to the directives found in the <a class="link"
|
|
href="#RCL.PROGRAM.FIELDS" title=
|
|
"4.2. Field data processing"><code class=
|
|
"filename">fields</code> configuration file</a>.</p>
|
|
|
|
<p>The handlers that can handle multiple documents per file
|
|
return a single piece of data to identify each document
|
|
inside the file. This piece of data, called an <code class=
|
|
"literal">ipath element</code> will be sent back by
|
|
<span class="application">Recoll</span> to extract the
|
|
document at query time, for previewing, or for creating a
|
|
temporary file to be opened by a viewer.</p>
|
|
|
|
<p>The following section describes the simple handlers, and
|
|
the next one gives a few explanations about the
|
|
<code class="literal">execm</code> ones. You could
|
|
conceivably write a simple handler with only the elements
|
|
in the manual. This will not be the case for the other
|
|
ones, for which you will have to look at the code.</p>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.SIMPLE" id=
|
|
"RCL.PROGRAM.FILTERS.SIMPLE"></a>4.1.1. Simple
|
|
input handlers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> simple
|
|
handlers are usually shell-scripts, but this is in no way
|
|
necessary. Extracting the text from the native format is
|
|
the difficult part. Outputting the format expected by
|
|
<span class="application">Recoll</span> is trivial.
|
|
Happily enough, most document formats have translators or
|
|
text extractors which can be called from the handler. In
|
|
some cases the output of the translating program is
|
|
completely appropriate, and no intermediate shell-script
|
|
is needed.</p>
|
|
|
|
<p>Input handlers are called with a single argument which
|
|
is the source file name. They should output the result to
|
|
stdout.</p>
|
|
|
|
<p>When writing a handler, you should decide if it will
|
|
output plain text or HTML. Plain text is simpler, but you
|
|
will not be able to add metadata or vary the output
|
|
character encoding (this will be defined in a
|
|
configuration file). Additionally, some formatting may be
|
|
easier to preserve when previewing HTML. Actually the
|
|
deciding factor is metadata: <span class=
|
|
"application">Recoll</span> has a way to <a class="link"
|
|
href="#RCL.PROGRAM.FILTERS.HTML" title=
|
|
"4.1.4. Input handler HTML output">extract metadata
|
|
from the HTML header and use it for field
|
|
searches.</a>.</p>
|
|
|
|
<p>The <code class=
|
|
"envar">RECOLL_FILTER_FORPREVIEW</code> environment
|
|
variable (values <code class="literal">yes</code>,
|
|
<code class="literal">no</code>) tells the handler if the
|
|
operation is for indexing or previewing. Some handlers
|
|
use this to output a slightly different format, for
|
|
example stripping uninteresting repeated keywords (ie:
|
|
<code class="literal">Subject:</code> for email) when
|
|
indexing. This is not essential.</p>
|
|
|
|
<p>You should look at one of the simple handlers, for
|
|
example <span class=
|
|
"command"><strong>rclps</strong></span> for a starting
|
|
point.</p>
|
|
|
|
<p>Don't forget to make your handler executable before
|
|
testing !</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.MULTIPLE" id=
|
|
"RCL.PROGRAM.FILTERS.MULTIPLE"></a>4.1.2. "Multiple"
|
|
handlers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>If you can program and want to write an <code class=
|
|
"literal">execm</code> handler, it should not be too
|
|
difficult to make sense of one of the existing modules.
|
|
For example, look at <span class=
|
|
"command"><strong>rclzip</strong></span> which uses Zip
|
|
file paths as identifiers (<code class=
|
|
"literal">ipath</code>), and <span class=
|
|
"command"><strong>rclics</strong></span>, which uses an
|
|
integer index. Also have a look at the comments inside
|
|
the <code class="filename">internfile/mh_execm.h</code>
|
|
file and possibly at the corresponding module.</p>
|
|
|
|
<p><code class="literal">execm</code> handlers sometimes
|
|
need to make a choice for the nature of the <code class=
|
|
"literal">ipath</code> elements that they use in
|
|
communication with the indexer. Here are a few
|
|
guidelines:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Use ASCII or UTF-8 (if the identifier is an
|
|
integer print it, for example, like printf %d would
|
|
do).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>If at all possible, the data should make some
|
|
kind of sense when printed to a log file to help
|
|
with debugging.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class="application">Recoll</span> uses a
|
|
colon (<code class="literal">:</code>) as a
|
|
separator to store a complex path internally (for
|
|
deeper embedding). Colons inside the <code class=
|
|
"literal">ipath</code> elements output by a handler
|
|
will be escaped, but would be a bad choice as a
|
|
handler-specific separator (mostly, again, for
|
|
debugging issues).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>In any case, the main goal is that it should be easy
|
|
for the handler to extract the target document, given the
|
|
file name and the <code class="literal">ipath</code>
|
|
element.</p>
|
|
|
|
<p><code class="literal">execm</code> handlers will also
|
|
produce a document with a null <code class=
|
|
"literal">ipath</code> element. Depending on the type of
|
|
document, this may have some associated data (e.g. the
|
|
body of an email message), or none (typical for an
|
|
archive file). If it is empty, this document will be
|
|
useful anyway for some operations, as the parent of the
|
|
actual data documents.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.ASSOCIATION" id=
|
|
"RCL.PROGRAM.FILTERS.ASSOCIATION"></a>4.1.3. Telling
|
|
<span class="application">Recoll</span> about the
|
|
handler</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>There are two elements that link a file to the handler
|
|
which should process it: the association of file to MIME
|
|
type and the association of a MIME type with a
|
|
handler.</p>
|
|
|
|
<p>The association of files to MIME types is mostly based
|
|
on name suffixes. The types are defined inside the
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.MIMEMAP" title=
|
|
"5.4.4. The mimemap file"><code class=
|
|
"filename">mimemap</code> file</a>. Example:</p>
|
|
<pre class="programlisting">
|
|
|
|
.doc = application/msword
|
|
</pre>
|
|
|
|
<p>If no suffix association is found for the file name,
|
|
<span class="application">Recoll</span> will try to
|
|
execute the <span class="command"><strong>file
|
|
-i</strong></span> command to determine a MIME type.</p>
|
|
|
|
<p>The association of file types to handlers is performed
|
|
in the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.MIMECONF" title=
|
|
"5.4.5. The mimeconf file"><code class=
|
|
"filename">mimeconf</code> file</a>. A sample will
|
|
probably be of better help than a long explanation:</p>
|
|
<pre class="programlisting">
|
|
|
|
[index]
|
|
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
mimetype = text/plain ; charset=utf-8
|
|
|
|
application/ogg = exec rclogg
|
|
|
|
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
|
application/x-chm = execm rclchm
|
|
</pre>
|
|
|
|
<p>The fragment specifies that:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">application/msword</code>
|
|
files are processed by executing the <span class=
|
|
"command"><strong>antiword</strong></span> program,
|
|
which outputs <code class=
|
|
"literal">text/plain</code> encoded in <code class=
|
|
"literal">utf-8</code>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">application/ogg</code>
|
|
files are processed by the <span class=
|
|
"command"><strong>rclogg</strong></span> script,
|
|
with default output type (<code class=
|
|
"literal">text/html</code>, with encoding specified
|
|
in the header, or <code class=
|
|
"literal">utf-8</code> by default).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">text/rtf</code> is
|
|
processed by <span class=
|
|
"command"><strong>unrtf</strong></span>, which
|
|
outputs <code class="literal">text/html</code>. The
|
|
<code class="literal">iso-8859-1</code> encoding is
|
|
specified because it is not the <code class=
|
|
"literal">utf-8</code> default, and not output by
|
|
<span class="command"><strong>unrtf</strong></span>
|
|
in the HTML header section.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">application/x-chm</code>
|
|
is processed by a persistant handler. This is
|
|
determined by the <code class=
|
|
"literal">execm</code> keyword.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.HTML" id=
|
|
"RCL.PROGRAM.FILTERS.HTML"></a>4.1.4. Input
|
|
handler HTML output</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The output HTML could be very minimal like the
|
|
following example:</p>
|
|
<pre class="programlisting">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
|
</head>
|
|
<body>
|
|
Some text content
|
|
</body>
|
|
</html>
|
|
|
|
</pre>
|
|
|
|
<p>You should take care to escape some characters inside
|
|
the text by transforming them into appropriate entities.
|
|
At the very minimum, "<code class="literal">&</code>"
|
|
should be transformed into "<code class=
|
|
"literal">&amp;</code>", "<code class=
|
|
"literal"><</code>" should be transformed into
|
|
"<code class="literal">&lt;</code>". This is not
|
|
always properly done by translating programs which output
|
|
HTML, and of course never by those which output plain
|
|
text.</p>
|
|
|
|
<p>When encapsulating plain text in an HTML body, the
|
|
display of a preview may be improved by enclosing the
|
|
text inside <code class="literal"><pre></code>
|
|
tags.</p>
|
|
|
|
<p>The character set needs to be specified in the header.
|
|
It does not need to be UTF-8 (<span class=
|
|
"application">Recoll</span> will take care of translating
|
|
it), but it must be accurate for good results.</p>
|
|
|
|
<p><span class="application">Recoll</span> will process
|
|
<code class="literal">meta</code> tags inside the header
|
|
as possible document fields candidates. Documents fields
|
|
can be processed by the indexer in different ways, for
|
|
searching or displaying inside query results. This is
|
|
described in a <a class="link" href="#RCL.PROGRAM.FIELDS"
|
|
title="4.2. Field data processing">following
|
|
section.</a></p>
|
|
|
|
<p>By default, the indexer will process the standard
|
|
header fields if they are present: <code class=
|
|
"literal">title</code>, <code class=
|
|
"literal">meta/description</code>, and <code class=
|
|
"literal">meta/keywords</code> are both indexed and
|
|
stored for query-time display.</p>
|
|
|
|
<p>A predefined non-standard <code class=
|
|
"literal">meta</code> tag will also be processed by
|
|
<span class="application">Recoll</span> without further
|
|
configuration: if a <code class="literal">date</code> tag
|
|
is present and has the right format, it will be used as
|
|
the document date (for display and sorting), in
|
|
preference to the file modification date. The date format
|
|
should be as follows:</p>
|
|
<pre class="programlisting">
|
|
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
|
|
or
|
|
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
|
|
|
|
</pre>
|
|
|
|
<p>Example:</p>
|
|
<pre class="programlisting">
|
|
<meta name="date" content="2013-02-24 17:50:00">
|
|
|
|
</pre>
|
|
|
|
<p>Input handlers also have the possibility to "invent"
|
|
field names. This should also be output as meta tags:</p>
|
|
<pre class="programlisting">
|
|
<meta name="somefield" content="Some textual data" />
|
|
</pre>
|
|
|
|
<p>You can embed HTML markup inside the content of custom
|
|
fields, for improving the display inside result lists. In
|
|
this case, add a (wildly non-standard) <code class=
|
|
"literal">markup</code> attribute to tell <span class=
|
|
"application">Recoll</span> that the value is HTML and
|
|
should not be escaped for display.</p>
|
|
<pre class="programlisting">
|
|
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
|
|
</pre>
|
|
|
|
<p>As written above, the processing of fields is
|
|
described in a <a class="link" href="#RCL.PROGRAM.FIELDS"
|
|
title="4.2. Field data processing">further
|
|
section</a>.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.PAGES" id=
|
|
"RCL.PROGRAM.FILTERS.PAGES"></a>4.1.5. Page
|
|
numbers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The indexer will interpret <code class=
|
|
"literal">^L</code> characters in the handler output as
|
|
indicating page breaks, and will record them. At query
|
|
time, this allows starting a viewer on the right page for
|
|
a hit or a snippet. Currently, only the PDF, Postscript
|
|
and DVI handlers generate page breaks.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.PROGRAM.FIELDS" id=
|
|
"RCL.PROGRAM.FIELDS"></a>4.2. Field data
|
|
processing</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><code class="literal">Fields</code> are named pieces of
|
|
information in or about documents, like <code class=
|
|
"literal">title</code>, <code class=
|
|
"literal">author</code>, <code class=
|
|
"literal">abstract</code>.</p>
|
|
|
|
<p>The field values for documents can appear in several
|
|
ways during indexing: either output by input handlers as
|
|
<code class="literal">meta</code> fields in the HTML header
|
|
section, or extracted from file extended attributes, or
|
|
added as attributes of the <code class="literal">Doc</code>
|
|
object when using the API, or again synthetized internally
|
|
by <span class="application">Recoll</span>.</p>
|
|
|
|
<p>The <span class="application">Recoll</span> query
|
|
language allows searching for text in a specific field.</p>
|
|
|
|
<p><span class="application">Recoll</span> defines a number
|
|
of default fields. Additional ones can be output by
|
|
handlers, and described in the <code class=
|
|
"filename">fields</code> configuration file.</p>
|
|
|
|
<p>Fields can be:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">indexed</code>, meaning that
|
|
their terms are separately stored in inverted lists
|
|
(with a specific prefix), and that a field-specific
|
|
search is possible.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="literal">stored</code>, meaning that
|
|
their value is recorded in the index data record for
|
|
the document, and can be returned and displayed with
|
|
search results.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>A field can be either or both indexed and stored. This
|
|
and other aspects of fields handling is defined inside the
|
|
<code class="filename">fields</code> configuration
|
|
file.</p>
|
|
|
|
<p>The sequence of events for field processing is as
|
|
follows:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>During indexing, <span class=
|
|
"command"><strong>recollindex</strong></span> scans
|
|
all <code class="literal">meta</code> fields in HTML
|
|
documents (most document types are transformed into
|
|
HTML at some point). It compares the name for each
|
|
element to the configuration defining what should be
|
|
done with fields (the <code class=
|
|
"filename">fields</code> file)</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>If the name for the <code class=
|
|
"literal">meta</code> element matches one for a field
|
|
that should be indexed, the contents are processed
|
|
and the terms are entered into the index with the
|
|
prefix defined in the <code class=
|
|
"filename">fields</code> file.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>If the name for the <code class=
|
|
"literal">meta</code> element matches one for a field
|
|
that should be stored, the content of the element is
|
|
stored with the document data record, from which it
|
|
can be extracted and displayed at query time.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>At query time, if a field search is performed, the
|
|
index prefix is computed and the match is only
|
|
performed against appropriately prefixed terms in the
|
|
index.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>At query time, the field can be displayed inside
|
|
the result list by using the appropriate directive in
|
|
the definition of the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"3.1.15.1. The result list format">result list
|
|
paragraph format</a>. All fields are displayed on the
|
|
fields screen of the preview window (which you can
|
|
reach through the right-click menu). This is
|
|
independant of the fact that the search which
|
|
produced the results used the field or not.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>You can find more information in the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file">section about the
|
|
<code class="filename">fields</code> file</a>, or in
|
|
comments inside the file.</p>
|
|
|
|
<p>You can also have a look at the <a class="ulink" href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/HandleCustomField"
|
|
target="_top">example on the Wiki</a>, detailing how one
|
|
could add a <span class="emphasis"><em>page
|
|
count</em></span> field to pdf documents for displaying
|
|
inside result lists.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.PROGRAM.API" id=
|
|
"RCL.PROGRAM.API"></a>4.3. API</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.API.ELEMENTS" id=
|
|
"RCL.PROGRAM.API.ELEMENTS"></a>4.3.1. Interface
|
|
elements</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>A few elements in the interface are specific and and
|
|
need an explanation.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">udi</span></dt>
|
|
|
|
<dd>
|
|
<p>An udi (unique document identifier) identifies a
|
|
document. Because of limitations inside the index
|
|
engine, it is restricted in length (to 200 bytes),
|
|
which is why a regular URI cannot be used. The
|
|
structure and contents of the udi is defined by the
|
|
application and opaque to the index engine. For
|
|
example, the internal file system indexer uses the
|
|
complete document path (file path + internal path),
|
|
truncated to length, the suppressed part being
|
|
replaced by a hash value.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">ipath</span></dt>
|
|
|
|
<dd>
|
|
<p>This data value (set as a field in the Doc
|
|
object) is stored, along with the URL, but not
|
|
indexed by <span class="application">Recoll</span>.
|
|
Its contents are not interpreted, and its use is up
|
|
to the application. For example, the <span class=
|
|
"application">Recoll</span> internal file system
|
|
indexer stores the part of the document access path
|
|
internal to the container file (<code class=
|
|
"literal">ipath</code> in this case is a list of
|
|
subdocument sequential numbers). url and ipath are
|
|
returned in every search result and permit access
|
|
to the original document.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">Stored and indexed
|
|
fields</span></dt>
|
|
|
|
<dd>
|
|
<p>The <code class="filename">fields</code> file
|
|
inside the <span class="application">Recoll</span>
|
|
configuration defines which document fields are
|
|
either "indexed" (searchable), "stored"
|
|
(retrievable with search results), or both.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<p>Data for an external indexer, should be stored in a
|
|
separate index, not the one for the <span class=
|
|
"application">Recoll</span> internal file system indexer,
|
|
except if the latter is not used at all). The reason is
|
|
that the main document indexer purge pass would remove
|
|
all the other indexer's documents, as they were not seen
|
|
during indexing. The main indexer documents would also
|
|
probably be a problem for the external indexer purge
|
|
operation.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.PROGRAM.API.PYTHON"
|
|
id="RCL.PROGRAM.API.PYTHON"></a>4.3.2. Python
|
|
interface</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.INTRO" id=
|
|
"RCL.PROGRAM.PYTHON.INTRO"></a>4.3.2.1. Introduction</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> versions
|
|
after 1.11 define a Python programming interface, both
|
|
for searching and indexing.</p>
|
|
|
|
<p>The search interface is used in the Recoll Ubuntu
|
|
Unity Lens and Recoll WebUI.</p>
|
|
|
|
<p>The indexing section of the API has seen little use,
|
|
and is more a proof of concept. In truth it is waiting
|
|
for its killer app...</p>
|
|
|
|
<p>The search API is modeled along the Python database
|
|
API specification. There were two major changes along
|
|
<span class="application">Recoll</span> versions:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The basis for the <span class=
|
|
"application">Recoll</span> API changed from
|
|
Python database API version 1.0 (<span class=
|
|
"application">Recoll</span> versions up to
|
|
1.18.1), to version 2.0 (<span class=
|
|
"application">Recoll</span> 1.18.2 and
|
|
later).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The <code class="literal">recoll</code> module
|
|
became a package (with an internal <code class=
|
|
"literal">recoll</code> module) as of
|
|
<span class="application">Recoll</span> version
|
|
1.19, in order to add more functions. For
|
|
existing code, this only changes the way the
|
|
interface must be imported.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>We will mostly describe the new API and package
|
|
structure here. A paragraph at the end of this section
|
|
will explain a few differences and ways to write code
|
|
compatible with both versions.</p>
|
|
|
|
<p>The Python interface can be found in the source
|
|
package, under <code class=
|
|
"filename">python/recoll</code>.</p>
|
|
|
|
<p>The <code class="filename">python/recoll/</code>
|
|
directory contains the usual <code class=
|
|
"filename">setup.py</code>. After configuring the main
|
|
<span class="application">Recoll</span> code, you can
|
|
use the script to build and install the Python
|
|
module:</p>
|
|
<pre class="screen">
|
|
<strong class=
|
|
"userinput"><code>cd recoll-xxx/python/recoll</code></strong>
|
|
<strong class=
|
|
"userinput"><code>python setup.py build</code></strong>
|
|
<strong class=
|
|
"userinput"><code>python setup.py install</code></strong>
|
|
|
|
</pre>
|
|
|
|
<p>As of <span class="application">Recoll</span> 1.19,
|
|
the module can be compiled for Python3.</p>
|
|
|
|
<p>The normal <span class="application">Recoll</span>
|
|
installer installs the Python2 API along with the main
|
|
code. The Python3 version must be explicitely built and
|
|
installed.</p>
|
|
|
|
<p>When installing from a repository, and depending on
|
|
the distribution, the Python API can sometimes be found
|
|
in a separate package.</p>
|
|
|
|
<p>The following small sample will run a query and list
|
|
the title and url for each of the results. It would
|
|
work with <span class="application">Recoll</span> 1.19
|
|
and later. The <code class=
|
|
"filename">python/samples</code> source directory
|
|
contains several examples of Python programming with
|
|
<span class="application">Recoll</span>, exercising the
|
|
extension more completely, and especially its data
|
|
extraction features.</p>
|
|
<pre class="programlisting">
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
query = db.query()
|
|
nres = query.execute("some query")
|
|
results = query.fetchmany(20)
|
|
for doc in results:
|
|
print(doc.url, doc.title)
|
|
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.PACKAGE" id=
|
|
"RCL.PROGRAM.PYTHON.PACKAGE"></a>4.3.2.2. Recoll
|
|
package</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The <code class="literal">recoll</code> package
|
|
contains two modules:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The <code class="literal">recoll</code> module
|
|
contains functions and classes used to query (or
|
|
update) the index.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The <code class="literal">rclextract</code>
|
|
module contains functions and classes used to
|
|
access document data.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL" id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL"></a>4.3.2.3. The
|
|
recoll module</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect4">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.FUNCTIONS" id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.FUNCTIONS"></a>Functions</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">connect(confdir=None,
|
|
extra_dbs=None, writable = False)</span></dt>
|
|
|
|
<dd>
|
|
The <code class="literal">connect()</code>
|
|
function connects to one or several
|
|
<span class="application">Recoll</span>
|
|
index(es) and returns a <code class=
|
|
"literal">Db</code> object.
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem"><code class=
|
|
"literal">confdir</code> may specify a
|
|
configuration directory. The usual defaults
|
|
apply.</li>
|
|
|
|
<li class="listitem"><code class=
|
|
"literal">extra_dbs</code> is a list of
|
|
additional indexes (Xapian
|
|
directories).</li>
|
|
|
|
<li class="listitem"><code class=
|
|
"literal">writable</code> decides if we can
|
|
index new data through this
|
|
connection.</li>
|
|
</ul>
|
|
</div>This call initializes the recoll module,
|
|
and it should always be performed before any
|
|
other call or object creation.
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect4">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES" id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES"></a>Classes</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect5">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h6 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DB" id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DB"></a>The
|
|
Db class</h6>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>A Db object is created by a <code class=
|
|
"literal">connect()</code> call and holds a
|
|
connection to a Recoll index.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Db.close()</span></dt>
|
|
|
|
<dd>Closes the connection. You can't do
|
|
anything with the <code class=
|
|
"literal">Db</code> object after this.</dd>
|
|
|
|
<dt><span class="term">Db.query(),
|
|
Db.cursor()</span></dt>
|
|
|
|
<dd>These aliases return a blank <code class=
|
|
"literal">Query</code> object for this
|
|
index.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Db.setAbstractParams(maxchars,
|
|
contextwords)</span></dt>
|
|
|
|
<dd>Set the parameters used to build snippets
|
|
(sets of keywords in context text fragments).
|
|
<code class="literal">maxchars</code> defines
|
|
the maximum total size of the abstract.
|
|
<code class="literal">contextwords</code>
|
|
defines how many terms are shown around the
|
|
keyword.</dd>
|
|
|
|
<dt><span class="term">Db.termMatch(match_type,
|
|
expr, field='', maxlen=-1, casesens=False,
|
|
diacsens=False, lang='english')</span></dt>
|
|
|
|
<dd>Expand an expression against the index term
|
|
list. Performs the basic function from the GUI
|
|
term explorer tool. <code class=
|
|
"literal">match_type</code> can be either of
|
|
<code class="literal">wildcard</code>,
|
|
<code class="literal">regexp</code> or
|
|
<code class="literal">stem</code>. Returns a
|
|
list of terms expanded from the input
|
|
expression.</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect5">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h6 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.QUERY" id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.QUERY"></a>The
|
|
Query class</h6>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>A <code class="literal">Query</code> object
|
|
(equivalent to a cursor in the Python DB API) is
|
|
created by a <code class=
|
|
"literal">Db.query()</code> call. It is used to
|
|
execute index searches.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Query.sortby(fieldname,
|
|
ascending=True)</span></dt>
|
|
|
|
<dd>Sort results by <em class=
|
|
"replaceable"><code>fieldname</code></em>, in
|
|
ascending or descending order. Must be called
|
|
before executing the search.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.execute(query_string, stemming=1,
|
|
stemlang="english")</span></dt>
|
|
|
|
<dd>Starts a search for <em class=
|
|
"replaceable"><code>query_string</code></em>, a
|
|
<span class="application">Recoll</span> search
|
|
language string.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.executesd(SearchData)</span></dt>
|
|
|
|
<dd>Starts a search for the query defined by
|
|
the SearchData object.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.fetchmany(size=query.arraysize)</span></dt>
|
|
|
|
<dd>Fetches the next <code class=
|
|
"literal">Doc</code> objects in the current
|
|
search results, and returns them as an array of
|
|
the required size, which is by default the
|
|
value of the <code class=
|
|
"literal">arraysize</code> data member.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.fetchone()</span></dt>
|
|
|
|
<dd>Fetches the next <code class=
|
|
"literal">Doc</code> object from the current
|
|
search results.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.close()</span></dt>
|
|
|
|
<dd>Closes the query. The object is unusable
|
|
after the call.</dd>
|
|
|
|
<dt><span class="term">Query.scroll(value,
|
|
mode='relative')</span></dt>
|
|
|
|
<dd>Adjusts the position in the current result
|
|
set. <code class="literal">mode</code> can be
|
|
<code class="literal">relative</code> or
|
|
<code class="literal">absolute</code>.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.getgroups()</span></dt>
|
|
|
|
<dd>Retrieves the expanded query terms as a
|
|
list of pairs. Meaningful only after executexx
|
|
In each pair, the first entry is a list of user
|
|
terms (of size one for simple terms, or more
|
|
for group and phrase clauses), the second a
|
|
list of query terms as derived from the user
|
|
terms and used in the Xapian Query.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.getxquery()</span></dt>
|
|
|
|
<dd>Return the Xapian query description as a
|
|
Unicode string. Meaningful only after
|
|
executexx.</dd>
|
|
|
|
<dt><span class="term">Query.highlight(text,
|
|
ishtml = 0, methods = object)</span></dt>
|
|
|
|
<dd>Will insert <span "class=rclmatch">,
|
|
</span> tags around the match areas in
|
|
the input text and return the modified text.
|
|
<code class="literal">ishtml</code> can be set
|
|
to indicate that the input text is HTML and
|
|
that HTML special characters should not be
|
|
escaped. <code class="literal">methods</code>
|
|
if set should be an object with methods
|
|
startMatch(i) and endMatch() which will be
|
|
called for each match and should return a begin
|
|
and end tag</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.makedocabstract(doc, methods =
|
|
object))</span></dt>
|
|
|
|
<dd>Create a snippets abstract for <code class=
|
|
"literal">doc</code> (a <code class=
|
|
"literal">Doc</code> object) by selecting text
|
|
around the match terms. If methods is set, will
|
|
also perform highlighting. See the highlight
|
|
method.</dd>
|
|
|
|
<dt><span class="term">Query.__iter__() and
|
|
Query.next()</span></dt>
|
|
|
|
<dd>So that things like <code class=
|
|
"literal">for doc in query:</code> will
|
|
work.</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<p class="title"><b>Data descriptors</b></p>
|
|
|
|
<dl class="variablelist">
|
|
<dt><span class=
|
|
"term">Query.arraysize</span></dt>
|
|
|
|
<dd>Default number of records processed by
|
|
fetchmany (r/w).</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.rowcount</span></dt>
|
|
|
|
<dd>Number of records returned by the last
|
|
execute.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Query.rownumber</span></dt>
|
|
|
|
<dd>Next index to be fetched from results.
|
|
Normally increments after each fetchone() call,
|
|
but can be set/reset before the call to effect
|
|
seeking (equivalent to using <code class=
|
|
"literal">scroll()</code>). Starts at 0.</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect5">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h6 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DOC" id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DOC"></a>The
|
|
Doc class</h6>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>A <code class="literal">Doc</code> object
|
|
contains index data for a given document. The data
|
|
is extracted from the index when searching, or set
|
|
by the indexer program when updating. The Doc
|
|
object has many attributes to be read or set by its
|
|
user. It matches exactly the Rcl::Doc C++ object.
|
|
Some of the attributes are predefined, but,
|
|
especially when indexing, others can be set, the
|
|
name of which will be processed as field names by
|
|
the indexing configuration. Inputs can be specified
|
|
as Unicode or strings. Outputs are Unicode objects.
|
|
All dates are specified as Unix timestamps, printed
|
|
as strings. Please refer to the <code class=
|
|
"filename">rcldb/rcldoc.h</code> C++ file for a
|
|
description of the predefined attributes.</p>
|
|
|
|
<p>At query time, only the fields that are defined
|
|
as <code class="literal">stored</code> either by
|
|
default or in the <code class=
|
|
"filename">fields</code> configuration file will be
|
|
meaningful in the <code class="literal">Doc</code>
|
|
object. Especially this will not be the case for
|
|
the document text. See the <code class=
|
|
"literal">rclextract</code> module for accessing
|
|
document contents.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">get(key), []
|
|
operator</span></dt>
|
|
|
|
<dd>Retrieve the named doc attribute</dd>
|
|
|
|
<dt><span class="term">getbinurl()</span></dt>
|
|
|
|
<dd>Retrieve the URL in byte array format (no
|
|
transcoding), for use as parameter to a system
|
|
call.</dd>
|
|
|
|
<dt><span class="term">items()</span></dt>
|
|
|
|
<dd>Return a dictionary of doc object
|
|
keys/values</dd>
|
|
|
|
<dt><span class="term">keys()</span></dt>
|
|
|
|
<dd>list of doc object keys (attribute
|
|
names).</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect5">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h6 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.SEARCHDATA"
|
|
id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.SEARCHDATA">
|
|
</a>The SearchData class</h6>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>A <code class="literal">SearchData</code> object
|
|
allows building a query by combining clauses, for
|
|
execution by <code class=
|
|
"literal">Query.executesd()</code>. It can be used
|
|
in replacement of the query language approach. The
|
|
interface is going to change a little, so no
|
|
detailed doc for now...</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class=
|
|
"term">addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
qstring=string, slack=0, field='', stemming=1,
|
|
subSearch=SearchData)</span></dt>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RCLEXTRACT" id=
|
|
"RCL.PROGRAM.PYTHON.RCLEXTRACT"></a>4.3.2.4. The
|
|
rclextract module</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Index queries do not provide document content (only
|
|
a partial and unprecise reconstruction is performed to
|
|
show the snippets text). In order to access the actual
|
|
document data, the data extraction part of the indexing
|
|
process must be performed (subdocument access and
|
|
format translation). This is not trivial in general.
|
|
The <code class="literal">rclextract</code> module
|
|
currently provides a single class which can be used to
|
|
access the data content for result documents.</p>
|
|
|
|
<div class="sect4">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RCLEXTRACT.CLASSES" id=
|
|
"RCL.PROGRAM.PYTHON.RCLEXTRACT.CLASSES"></a>Classes</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect5">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h6 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.EXTRACTOR"
|
|
id=
|
|
"RCL.PROGRAM.PYTHON.RECOLL.CLASSES.EXTRACTOR">
|
|
</a>The Extractor class</h6>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class=
|
|
"term">Extractor(doc)</span></dt>
|
|
|
|
<dd>An <code class="literal">Extractor</code>
|
|
object is built from a <code class=
|
|
"literal">Doc</code> object, output from a
|
|
query.</dd>
|
|
|
|
<dt><span class=
|
|
"term">Extractor.textextract(ipath)</span></dt>
|
|
|
|
<dd>
|
|
Extract document defined by <em class=
|
|
"replaceable"><code>ipath</code></em> and
|
|
return a <code class="literal">Doc</code>
|
|
object. The doc.text field has the document
|
|
text converted to either text/plain or
|
|
text/html according to doc.mimetype. The
|
|
typical use would be as follows:
|
|
<pre class="programlisting">
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
doc = extractor.textextract(qdoc.ipath)
|
|
# use doc.text, e.g. for previewing
|
|
</pre>
|
|
</dd>
|
|
|
|
<dt><span class=
|
|
"term">Extractor.idoctofile(ipath, targetmtype,
|
|
outfile='')</span></dt>
|
|
|
|
<dd>
|
|
Extracts document into an output file, which
|
|
can be given explicitly or will be created as
|
|
a temporary file to be deleted by the caller.
|
|
Typical use:
|
|
<pre class="programlisting">
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
|
|
</pre>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.EXAMPLES" id=
|
|
"RCL.PROGRAM.PYTHON.EXAMPLES"></a>4.3.2.5. Example
|
|
code</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The following sample would query the index with a
|
|
user language string. See the <code class=
|
|
"filename">python/samples</code> directory inside the
|
|
<span class="application">Recoll</span> source for
|
|
other examples. The <code class=
|
|
"filename">recollgui</code> subdirectory has a very
|
|
embryonic GUI which demonstrates the highlighting and
|
|
data extraction functions.</p>
|
|
<pre class="programlisting">
|
|
#!/usr/bin/env python
|
|
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
db.setAbstractParams(maxchars=80, contextwords=4)
|
|
|
|
query = db.query()
|
|
nres = query.execute("some user question")
|
|
print "Result count: ", nres
|
|
if nres > 5:
|
|
nres = 5
|
|
for i in range(nres):
|
|
doc = query.fetchone()
|
|
print "Result #%d" % (query.rownumber,)
|
|
for k in ("title", "size"):
|
|
print k, ":", getattr(doc, k).encode('utf-8')
|
|
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
|
print abs
|
|
print
|
|
|
|
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHON.COMPAT" id=
|
|
"RCL.PROGRAM.PYTHON.COMPAT"></a>4.3.2.6. Compatibility
|
|
with the previous version</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The following code fragments can be used to ensure
|
|
that code can run with both the old and the new API (as
|
|
long as it does not use the new abilities of the new
|
|
API of course).</p>
|
|
|
|
<p>Adapting to the new package structure:</p>
|
|
<pre class="programlisting">
|
|
|
|
try:
|
|
from recoll import recoll
|
|
from recoll import rclextract
|
|
hasextract = True
|
|
except:
|
|
import recoll
|
|
hasextract = False
|
|
|
|
</pre>
|
|
|
|
<p>Adapting to the change of nature of the <code class=
|
|
"literal">next</code> <code class=
|
|
"literal">Query</code> member. The same test can be
|
|
used to choose to use the <code class=
|
|
"literal">scroll()</code> method (new) or set the
|
|
<code class="literal">next</code> value (old).</p>
|
|
<pre class="programlisting">
|
|
|
|
rownum = query.next if type(query.next) == int else \
|
|
query.rownumber
|
|
|
|
</pre>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.INSTALL" id=
|
|
"RCL.INSTALL"></a>Chapter 5. Installation and
|
|
configuration</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.BINARY" id=
|
|
"RCL.INSTALL.BINARY"></a>5.1. Installing a
|
|
binary copy</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> binary copies
|
|
are always distributed as regular packages for your system.
|
|
They can be obtained either through the system's normal
|
|
software distribution framework (e.g. <span class=
|
|
"application">Debian/Ubuntu apt</span>, <span class=
|
|
"application">FreeBSD</span> ports, etc.), or from some
|
|
type of "backports" repository providing versions newer
|
|
than the standard ones, or found on the <span class=
|
|
"application">Recoll</span> WEB site in some cases. The
|
|
most up-to-date information about Recoll packages can
|
|
usually be found on the <a class="ulink" href=
|
|
"http://www.recoll.org/download.html" target=
|
|
"_top"><span class="application">Recoll</span> WEB site
|
|
downloads page</a></p>
|
|
|
|
<p>There used to exist another form of binary install, as
|
|
pre-compiled source trees, but these are just less
|
|
convenient than the packages and don't exist any more.</p>
|
|
|
|
<p>The package management tools will usually automatically
|
|
deal with hard dependancies for packages obtained from a
|
|
proper package repository. You will have to deal with them
|
|
by hand for downloaded packages (for example, when
|
|
<span class="command"><strong>dpkg</strong></span>
|
|
complains about missing dependancies).</p>
|
|
|
|
<p>In all cases, you will have to check or install
|
|
<a class="link" href="#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">supporting applications</a>
|
|
for the file types that you want to index beyond those that
|
|
are natively processed by <span class=
|
|
"application">Recoll</span> (text, HTML, email files, and a
|
|
few others).</p>
|
|
|
|
<p>You should also maybe have a look at the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">configuration
|
|
section</a> (but this may not be necessary for a quick test
|
|
with default parameters). Most parameters can be more
|
|
conveniently set from the GUI interface.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.EXTERNAL" id=
|
|
"RCL.INSTALL.EXTERNAL"></a>5.2. Supporting
|
|
packages</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
|
|
<p>The <span class="application">Windows</span>
|
|
installation of <span class="application">Recoll</span>
|
|
is self-contained, and only needs Python 2.7 to be
|
|
externally installed. <span class=
|
|
"application">Windows</span> users can skip this
|
|
section.</p>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> uses external
|
|
applications to index some file types. You need to install
|
|
them for the file types that you wish to have indexed
|
|
(these are run-time optional dependencies. None is needed
|
|
for building or running <span class=
|
|
"application">Recoll</span> except for indexing their
|
|
specific file type).</p>
|
|
|
|
<p>After an indexing pass, the commands that were found
|
|
missing can be displayed from the <span class=
|
|
"command"><strong>recoll</strong></span> <span class=
|
|
"guilabel">File</span> menu. The list is stored in the
|
|
<code class="filename">missing</code> text file inside the
|
|
configuration directory.</p>
|
|
|
|
<p>A list of common file types which need external commands
|
|
follows. Many of the handlers need the <span class=
|
|
"command"><strong>iconv</strong></span> command, which is
|
|
not always listed as a dependancy.</p>
|
|
|
|
<p>Please note that, due to the relatively dynamic nature
|
|
of this information, the most up to date version is now
|
|
kept on <a class="ulink" href=
|
|
"http://www.recoll.org/features.html#doctypes" target=
|
|
"_top">http://www.recoll.org/features.html</a> along with
|
|
links to the home pages or best source/patches pages, and
|
|
misc tips. The list below is not updated often and may be
|
|
quite stale.</p>
|
|
|
|
<p>For many Linux distributions, most of the commands
|
|
listed can be installed from the package repositories.
|
|
However, the packages are sometimes outdated, or not the
|
|
best version for <span class="application">Recoll</span>,
|
|
so you should take a look at <a class="ulink" href=
|
|
"http://www.recoll.org/features.html#doctypes" target=
|
|
"_top">http://www.recoll.org/features.html</a> if a file
|
|
type is important to you.</p>
|
|
|
|
<p>As of <span class="application">Recoll</span> release
|
|
1.14, a number of XML-based formats that were handled by ad
|
|
hoc handler code now use the <span class=
|
|
"command"><strong>xsltproc</strong></span> command, which
|
|
usually comes with <span class=
|
|
"application">libxslt</span>. These are: abiword, fb2
|
|
(ebooks), kword, openoffice, svg.</p>
|
|
|
|
<p>Now for the list:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Openoffice files need <span class=
|
|
"command"><strong>unzip</strong></span> and
|
|
<span class=
|
|
"command"><strong>xsltproc</strong></span>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>PDF files need <span class=
|
|
"command"><strong>pdftotext</strong></span> which is
|
|
part of <span class="application">Poppler</span>
|
|
(usually comes with the <code class=
|
|
"literal">poppler-utils</code> package). Avoid the
|
|
original one from <span class=
|
|
"application">Xpdf</span>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Postscript files need <span class=
|
|
"command"><strong>pstotext</strong></span>. The
|
|
original version has an issue with shell character in
|
|
file names, which is corrected in recent packages.
|
|
See <a class="ulink" href=
|
|
"http://www.recoll.org/features.html#doctypes"
|
|
target="_top">http://www.recoll.org/features.html</a>
|
|
for more detail.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>MS Word needs <span class=
|
|
"command"><strong>antiword</strong></span>. It is
|
|
also useful to have <span class=
|
|
"command"><strong>wvWare</strong></span> installed as
|
|
it may be be used as a fallback for some files which
|
|
<span class=
|
|
"command"><strong>antiword</strong></span> does not
|
|
handle.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>MS Excel and PowerPoint are processed by internal
|
|
<span class="command"><strong>Python</strong></span>
|
|
handlers.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>MS Open XML (docx) needs <span class=
|
|
"command"><strong>xsltproc</strong></span>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Wordperfect files need <span class=
|
|
"command"><strong>wpd2html</strong></span> from the
|
|
<span class="application">libwpd</span> (or
|
|
<span class="application">libwpd-tools</span> on
|
|
Ubuntu) package.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>RTF files need <span class=
|
|
"command"><strong>unrtf</strong></span>, which, in
|
|
its older versions, has much trouble with non-western
|
|
character sets. Many Linux distributions carry
|
|
outdated <span class=
|
|
"command"><strong>unrtf</strong></span> versions.
|
|
Check <a class="ulink" href=
|
|
"http://www.recoll.org/features.html#doctypes"
|
|
target="_top">http://www.recoll.org/features.html</a>
|
|
for details.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>TeX files need <span class=
|
|
"command"><strong>untex</strong></span> or
|
|
<span class="command"><strong>detex</strong></span>.
|
|
Check <a class="ulink" href=
|
|
"http://www.recoll.org/features.html#doctypes"
|
|
target="_top">http://www.recoll.org/features.html</a>
|
|
for sources if it's not packaged for your
|
|
distribution.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>dvi files need <span class=
|
|
"command"><strong>dvips</strong></span>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>djvu files need <span class=
|
|
"command"><strong>djvutxt</strong></span> and
|
|
<span class="command"><strong>djvused</strong></span>
|
|
from the <span class="application">DjVuLibre</span>
|
|
package.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Audio files: <span class=
|
|
"application">Recoll</span> releases 1.14 and later
|
|
use a single <span class="application">Python</span>
|
|
handler based on <span class=
|
|
"application">mutagen</span> for all audio file
|
|
types.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Pictures: <span class="application">Recoll</span>
|
|
uses the <span class="application">Exiftool</span>
|
|
<span class="application">Perl</span> package to
|
|
extract tag information. Most image file formats are
|
|
supported. Note that there may not be much interest
|
|
in indexing the technical tags (image size, aperture,
|
|
etc.). This is only of interest if you store personal
|
|
tags or textual descriptions inside the image
|
|
files.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>chm: files in Microsoft help format need Python
|
|
and the <span class="application">pychm</span> module
|
|
(which needs <span class=
|
|
"application">chmlib</span>).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>ICS: up to <span class="application">Recoll</span>
|
|
1.13, iCalendar files need <span class=
|
|
"application">Python</span> and the <span class=
|
|
"application">icalendar</span> module. <span class=
|
|
"application">icalendar</span> is not needed for
|
|
newer versions, which use internal code.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Zip archives need <span class=
|
|
"application">Python</span> (and the standard zipfile
|
|
module).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Rar archives need <span class=
|
|
"application">Python</span>, the <span class=
|
|
"application">rarfile</span> Python module and the
|
|
<span class="command"><strong>unrar</strong></span>
|
|
utility.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Midi karaoke files need <span class=
|
|
"application">Python</span> and the <a class="ulink"
|
|
href="http://pypi.python.org/pypi/midi/0.2.1" target=
|
|
"_top"><span class="application">Midi
|
|
module</span></a></p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Konqueror webarchive format with Python (uses the
|
|
Tarfile module).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Mimehtml web archive format (support based on the
|
|
email handler, which introduces some mild weirdness,
|
|
but still usable).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Text, HTML, email folders, and Scribus files are
|
|
processed internally. <span class="application">Lyx</span>
|
|
is used to index Lyx files. Many handlers need <span class=
|
|
"command"><strong>iconv</strong></span> and the standard
|
|
<span class="command"><strong>sed</strong></span> and
|
|
<span class="command"><strong>awk</strong></span>.</p>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.BUILDING" id=
|
|
"RCL.INSTALL.BUILDING"></a>5.3. Building from
|
|
source</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.PREREQS" id=
|
|
"RCL.INSTALL.BUILDING.PREREQS"></a>5.3.1. Prerequisites</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>If you can install any or all of the following through
|
|
the package manager for your system, all the better.
|
|
Especially <span class="application">Qt</span> is a very
|
|
big piece of software, but you will most probably be able
|
|
to find a binary package.</p>
|
|
|
|
<p>You may have to compile <span class=
|
|
"application">Xapian</span> but this is easy.</p>
|
|
|
|
<p>The shopping list:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The <code class="literal">autoconf</code>,
|
|
<code class="literal">automake</code> and
|
|
<code class="literal">libtool</code> triad. Only
|
|
<code class="literal">autoconf</code> is needed up
|
|
to <span class="application">Recoll</span>
|
|
1.21.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>C++ compiler. Up to <span class=
|
|
"application">Recoll</span> version 1.13.04, its
|
|
absence can manifest itself by strange messages
|
|
about a missing iconv_open.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class=
|
|
"command"><strong>bison</strong></span> command
|
|
(for <span class="application">Recoll</span> 1.21
|
|
and later).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><span class=
|
|
"command"><strong>xsltproc</strong></span> command.
|
|
For building the documentation (for <span class=
|
|
"application">Recoll</span> 1.21 and later). This
|
|
sometimes comes with the <code class=
|
|
"literal">libxslt</code> package. And also the
|
|
Docbook XML and style sheet files.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Development files for <a class="ulink" href=
|
|
"http://www.xapian.org" target="_top"><span class=
|
|
"application">Xapian core</span></a>.</p>
|
|
|
|
<div class="important" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Important</h3>
|
|
|
|
<p>If you are building Xapian for an older CPU
|
|
(before Pentium 4 or Athlon 64), you need to add
|
|
the <code class="option">--disable-sse</code>
|
|
flag to the configure command. Else all Xapian
|
|
application will crash with an <code class=
|
|
"literal">illegal instruction</code> error.</p>
|
|
</div>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Development files for <a class="ulink" href=
|
|
"http://qt-project.org/downloads" target=
|
|
"_top"><span class="application">Qt 4</span></a> .
|
|
<span class="application">Recoll</span> has not
|
|
been tested with <span class="application">Qt
|
|
5</span> yet. <span class=
|
|
"application">Recoll</span> 1.15.9 was the last
|
|
version to support <span class="application">Qt
|
|
3</span>. If you do not want to install or build
|
|
the <span class="application">Qt Webkit</span>
|
|
module, <span class="application">Recoll</span> has
|
|
a configuration option to disable its use (see
|
|
further).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Development files for <span class=
|
|
"application">X11</span> and <span class=
|
|
"application">zlib</span>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Development files for <span class=
|
|
"application">Python</span> (or use <code class=
|
|
"literal">--disable-python-module</code>).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>You may also need <a class="ulink" href=
|
|
"http://www.gnu.org/software/libiconv/" target=
|
|
"_top">libiconv</a>. On <span class=
|
|
"application">Linux</span> systems, the iconv
|
|
interface is part of libc and you should not need
|
|
to do anything special.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Check the <a class="ulink" href=
|
|
"http://www.recoll.org/download.html" target=
|
|
"_top"><span class="application">Recoll</span> download
|
|
page</a> for up to date version information.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.BUILD" id=
|
|
"RCL.INSTALL.BUILDING.BUILD"></a>5.3.2. Building</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> has been built
|
|
on Linux, FreeBSD, Mac OS X, and Solaris, most versions
|
|
after 2005 should be ok, maybe some older ones too
|
|
(Solaris 8 is ok). If you build on another system, and
|
|
need to modify things, <a class="ulink" href=
|
|
"mailto:jfd@recoll.org" target="_top">I would very much
|
|
welcome patches</a>.</p>
|
|
|
|
<p><b>Configure options: </b></p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="option">--without-aspell</code>
|
|
will disable the code for phonetic matching of
|
|
search terms.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--with-fam</code> or
|
|
<code class="option">--with-inotify</code> will
|
|
enable the code for real time indexing. Inotify
|
|
support is enabled by default on recent Linux
|
|
systems.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--with-qzeitgeist</code>
|
|
will enable sending <span class=
|
|
"application">Zeitgeist</span> events about the
|
|
visited search results, and needs the <span class=
|
|
"application">qzeitgeist</span> package.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-webkit</code> is
|
|
available from version 1.17 to implement the result
|
|
list with a <span class="application">Qt</span>
|
|
QTextBrowser instead of a WebKit widget if you do
|
|
not or can't depend on the latter.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-idxthreads</code>
|
|
is available from version 1.19 to suppress
|
|
multithreading inside the indexing process. You can
|
|
also use the run-time configuration to restrict
|
|
<span class=
|
|
"command"><strong>recollindex</strong></span> to
|
|
using a single thread, but the compile-time option
|
|
may disable a few more unused locks. This only
|
|
applies to the use of multithreading for the core
|
|
index processing (data input). The <span class=
|
|
"application">Recoll</span> monitor mode always
|
|
uses at least two threads of execution.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class=
|
|
"option">--disable-python-module</code> will avoid
|
|
building the <span class=
|
|
"application">Python</span> module.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-xattr</code> will
|
|
prevent fetching data from file extended
|
|
attributes. Beyond a few standard attributes,
|
|
fetching extended attributes data can only be
|
|
useful is some application stores data in there,
|
|
and also needs some simple configuration (see
|
|
comments in the <code class=
|
|
"filename">fields</code> configuration file).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--enable-camelcase</code>
|
|
will enable splitting <em class=
|
|
"replaceable"><code>camelCase</code></em> words.
|
|
This is not enabled by default as it has the
|
|
unfortunate side-effect of making some phrase
|
|
searches quite confusing: ie, <code class=
|
|
"literal">"MySQL manual"</code> would be matched by
|
|
<code class="literal">"MySQL manual"</code> and
|
|
<code class="literal">"my sql manual"</code> but
|
|
not <code class="literal">"mysql manual"</code>
|
|
(only inside phrase searches).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--with-file-command</code>
|
|
Specify the version of the 'file' command to use
|
|
(ie: --with-file-command=/usr/local/bin/file). Can
|
|
be useful to enable the gnu version on systems
|
|
where the native one is bad.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-qtgui</code>
|
|
Disable the Qt interface. Will allow building the
|
|
indexer and the command line search program in
|
|
absence of a Qt environment.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-x11mon</code>
|
|
Disable <span class="application">X11</span>
|
|
connection monitoring inside recollindex. Together
|
|
with --disable-qtgui, this allows building recoll
|
|
without <span class="application">Qt</span> and
|
|
<span class="application">X11</span>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-userdoc</code>
|
|
will avoid building the user manual. This avoids
|
|
having to install the Docbook XML/XSL files and the
|
|
TeX toolchain used for translating the manual to
|
|
PDF.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-pic</code>
|
|
(<span class="application">Recoll</span> versions
|
|
up to 1.21 only) will compile <span class=
|
|
"application">Recoll</span> with position-dependant
|
|
code. This is incompatible with building the KIO or
|
|
the <span class="application">Python</span> or
|
|
<span class="application">PHP</span> extensions,
|
|
but might yield very marginally faster code.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Of course the usual <span class=
|
|
"application">autoconf</span> <span class=
|
|
"command"><strong>configure</strong></span>
|
|
options, like <code class="option">--prefix</code>
|
|
apply.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Normal procedure (for source extracted from a tar
|
|
distribution):</p>
|
|
<pre class="screen">
|
|
<strong class=
|
|
"userinput"><code>cd recoll-xxx</code></strong>
|
|
<strong class="userinput"><code>./configure</code></strong>
|
|
<strong class="userinput"><code>make</code></strong>
|
|
<strong class=
|
|
"userinput"><code>(practices usual hardship-repelling invocations)</code></strong>
|
|
|
|
</pre>
|
|
|
|
<p>When building from source cloned from the BitBucket
|
|
repository, you also need to install <span class=
|
|
"application">autoconf</span>, <span class=
|
|
"application">automake</span>, and <span class=
|
|
"application">libtool</span> and you must execute
|
|
<code class="literal">sh autogen.sh</code> in the top
|
|
source directory before running <code class=
|
|
"literal">configure</code>.</p>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.BUILD.SOLARIS" id=
|
|
"RCL.INSTALL.BUILDING.BUILD.SOLARIS"></a>5.3.2.1. Building
|
|
on Solaris</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>We did not test building the GUI on Solaris for
|
|
recent versions. You will need at least Qt 4.4. There
|
|
are some hints on <a class="ulink" href=
|
|
"http://www.recoll.org/download-1.14.html" target=
|
|
"_top">an old web site page</a>, they may still be
|
|
valid.</p>
|
|
|
|
<p>Someone did test the 1.19 indexer and Python module
|
|
build, they do work, with a few minor glitches. Be sure
|
|
to use GNU <span class=
|
|
"command"><strong>make</strong></span> and <span class=
|
|
"command"><strong>install</strong></span>.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.INSTALL" id=
|
|
"RCL.INSTALL.BUILDING.INSTALL"></a>5.3.3. Installation</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Either type <strong class="userinput"><code>make
|
|
install</code></strong> or execute <strong class=
|
|
"userinput"><code>recollinstall <em class=
|
|
"replaceable"><code>prefix</code></em></code></strong>,
|
|
in the root of the source tree. This will copy the
|
|
commands to <code class="filename"><em class=
|
|
"replaceable"><code>prefix</code></em>/bin</code> and the
|
|
sample configuration files, scripts and other shared data
|
|
to <code class="filename"><em class=
|
|
"replaceable"><code>prefix</code></em>/share/recoll</code>.</p>
|
|
|
|
<p>If the installation prefix given to <span class=
|
|
"command"><strong>recollinstall</strong></span> is
|
|
different from either the system default or the value
|
|
which was specified when executing <span class=
|
|
"command"><strong>configure</strong></span> (as in
|
|
<strong class="userinput"><code>configure --prefix
|
|
/some/path</code></strong>), you will have to set the
|
|
<code class="envar">RECOLL_DATADIR</code> environment
|
|
variable to indicate where the shared data is to be found
|
|
(ie for (ba)sh: <strong class="userinput"><code>export
|
|
RECOLL_DATADIR=/some/path/share/recoll</code></strong>).</p>
|
|
|
|
<p>You can then proceed to <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">configuration</a>.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.CONFIG" id=
|
|
"RCL.INSTALL.CONFIG"></a>5.4. Configuration
|
|
overview</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Most of the parameters specific to the <span class=
|
|
"command"><strong>recoll</strong></span> GUI are set
|
|
through the <span class="guilabel">Preferences</span> menu
|
|
and stored in the standard Qt place (<code class=
|
|
"filename">$HOME/.config/Recoll.org/recoll.conf</code>).
|
|
You probably do not want to edit this by hand.</p>
|
|
|
|
<p><span class="application">Recoll</span> indexing options
|
|
are set inside text configuration files located in a
|
|
configuration directory. There can be several such
|
|
directories, each of which defines the parameters for one
|
|
index.</p>
|
|
|
|
<p>The configuration files can be edited by hand or through
|
|
the <span class="guilabel">Index configuration</span>
|
|
dialog (<span class="guilabel">Preferences</span> menu).
|
|
The GUI tool will try to respect your formatting and
|
|
comments as much as possible, so it is quite possible to
|
|
use both approaches on the same configuration.</p>
|
|
|
|
<p>The most accurate documentation for the configuration
|
|
parameters is given by comments inside the default files,
|
|
and we will just give a general overview here.</p>
|
|
|
|
<p>For each index, there are at least two sets of
|
|
configuration files. System-wide configuration files are
|
|
kept in a directory named like <code class=
|
|
"filename">/usr/share/recoll/examples</code>, and define
|
|
default values, shared by all indexes. For each index, a
|
|
parallel set of files defines the customized
|
|
parameters.</p>
|
|
|
|
<p>The default location of the customized configuration is
|
|
the <code class="filename">.recoll</code> directory in your
|
|
home. Most people will only use this directory.</p>
|
|
|
|
<p>This location can be changed, or others can be added
|
|
with the <code class="envar">RECOLL_CONFDIR</code>
|
|
environment variable or the <code class="option">-c</code>
|
|
option parameter to <span class=
|
|
"command"><strong>recoll</strong></span> and <span class=
|
|
"command"><strong>recollindex</strong></span>.</p>
|
|
|
|
<p>In addition (as of <span class=
|
|
"application">Recoll</span> version 1.19.7), it is possible
|
|
to specify two additional configuration directories which
|
|
will be stacked before and after the user configuration
|
|
directory. These are defined by the <code class=
|
|
"envar">RECOLL_CONFTOP</code> and <code class=
|
|
"envar">RECOLL_CONFMID</code> environment variables. Values
|
|
from configuration files inside the top directory will
|
|
override user ones, values from configuration files inside
|
|
the middle directory will override system ones and be
|
|
overriden by user ones. These two variables may be of use
|
|
to applications which augment <span class=
|
|
"application">Recoll</span> functionality, and need to add
|
|
configuration data without disturbing the user's files.
|
|
Please note that the two, currently single, values will
|
|
probably be interpreted as colon-separated lists in the
|
|
future: do not use colon characters inside the directory
|
|
paths.</p>
|
|
|
|
<p>If the <code class="filename">.recoll</code> directory
|
|
does not exist when <span class=
|
|
"command"><strong>recoll</strong></span> or <span class=
|
|
"command"><strong>recollindex</strong></span> are started,
|
|
it will be created with a set of empty configuration files.
|
|
<span class="command"><strong>recoll</strong></span> will
|
|
give you a chance to edit the configuration file before
|
|
starting indexing. <span class=
|
|
"command"><strong>recollindex</strong></span> will proceed
|
|
immediately. To avoid mistakes, the automatic directory
|
|
creation will only occur for the default location, not if
|
|
<code class="option">-c</code> or <code class=
|
|
"envar">RECOLL_CONFDIR</code> were used (in the latter
|
|
cases, you will have to create the directory).</p>
|
|
|
|
<p>All configuration files share the same format. For
|
|
example, a short extract of the main configuration file
|
|
might look as follows:</p>
|
|
<pre class="programlisting">
|
|
# Space-separated list of directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
|
|
</pre>
|
|
|
|
<p>There are three kinds of lines:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Comment (starts with <span class=
|
|
"emphasis"><em>#</em></span>) or empty.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Parameter affectation (<span class=
|
|
"emphasis"><em>name = value</em></span>).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Section definition ([<span class=
|
|
"emphasis"><em>somedirname</em></span>]).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Long lines can be broken by ending each incomplete part
|
|
with a backslash (<code class="literal">\</code>).</p>
|
|
|
|
<p>Depending on the type of configuration file, section
|
|
definitions either separate groups of parameters or allow
|
|
redefining some parameters for a directory sub-tree. They
|
|
stay in effect until another section definition, or the end
|
|
of file, is encountered. Some of the parameters used for
|
|
indexing are looked up hierarchically from the current
|
|
directory location upwards. Not all parameters can be
|
|
meaningfully redefined, this is specified for each in the
|
|
next section.</p>
|
|
|
|
<p>When found at the beginning of a file path, the tilde
|
|
character (~) is expanded to the name of the user's home
|
|
directory, as a shell would do.</p>
|
|
|
|
<p>Some parameters are lists of strings. White space is
|
|
used for separation. List elements with embedded spaces can
|
|
be quoted using double-quotes. Double quotes inside these
|
|
elements can be escaped with a backslash.</p>
|
|
|
|
<p>No value inside a configuration file can contain a
|
|
newline character. Long lines can be continued by escaping
|
|
the physical newline with backslash, even inside quoted
|
|
strings.</p>
|
|
<pre class="programlisting">
|
|
astringlist = "some string \
|
|
with spaces"
|
|
thesame = "some string with spaces"
|
|
|
|
</pre>
|
|
|
|
<p>Parameters which are not part of string lists can't be
|
|
quoted, and leading and trailing space characters are
|
|
stripped before the value is used.</p>
|
|
|
|
<p><b>Encoding issues. </b>Most of the configuration
|
|
parameters are plain ASCII. Two particular sets of values
|
|
may cause encoding issues:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>File path parameters may contain non-ascii
|
|
characters and should use the exact same byte values
|
|
as found in the file system directory. Usually, this
|
|
means that the configuration file should use the
|
|
system default locale encoding.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>The <code class="envar">unac_except_trans</code>
|
|
parameter should be encoded in UTF-8. If your system
|
|
locale is not UTF-8, and you need to also specify
|
|
non-ascii file paths, this poses a difficulty because
|
|
common text editors cannot handle multiple encodings
|
|
in a single file. In this relatively unlikely case,
|
|
you can edit the configuration file as two separate
|
|
text files with appropriate encodings, and
|
|
concatenate them to create the complete
|
|
configuration.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.ENVIR" id=
|
|
"RCL.INSTALL.CONFIG.ENVIR"></a>5.4.1. Environment
|
|
variables</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_CONFDIR</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Defines the main configuration directory.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_TMPDIR, TMPDIR</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Locations for temporary files, in this order of
|
|
priority. The default if none of these is set is to
|
|
use <code class="filename">/tmp</code>. Big
|
|
temporary files may be created during indexing,
|
|
mostly for decompressing, and also for processing,
|
|
e.g. email attachments.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_CONFTOP,
|
|
RECOLL_CONFMID</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Allow adding configuration directories with
|
|
priorities below and above the user directory (see
|
|
above the Configuration overview section for
|
|
details).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_EXTRA_DBS,
|
|
RECOLL_ACTIVE_EXTRA_DBS</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Help for setting up external indexes. See
|
|
<a class="link" href="#RCL.SEARCH.GUI.MULTIDB"
|
|
title="3.1.10. Multiple indexes">this
|
|
paragraph</a> for explanations.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_DATADIR</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Defines replacement for the default location of
|
|
Recoll data files, normally found in, e.g.,
|
|
<code class=
|
|
"filename">/usr/share/recoll</code>).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_FILTERSDIR</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Defines replacement for the default location of
|
|
Recoll filters, normally found in, e.g.,
|
|
<code class=
|
|
"filename">/usr/share/recoll/filters</code>).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">ASPELL_PROG</code></span></dt>
|
|
|
|
<dd>
|
|
<p><span class=
|
|
"command"><strong>aspell</strong></span> program to
|
|
use for creating the spelling dictionary. The
|
|
result has to be compatible with the <code class=
|
|
"filename">libaspell</code> which <span class=
|
|
"application">Recoll</span> is using.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">VARNAME</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Blabla</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF"></a>5.4.2. The
|
|
main configuration file, recoll.conf</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><code class="filename">recoll.conf</code> is the main
|
|
configuration file. It defines things like what to index
|
|
(top directories and things to ignore), and the default
|
|
character set to use for document types which do not
|
|
specify it internally.</p>
|
|
|
|
<p>The default configuration will index your home
|
|
directory. If this is not appropriate, start <span class=
|
|
"command"><strong>recoll</strong></span> to create a
|
|
blank configuration, click <span class=
|
|
"guimenu">Cancel</span>, and edit the configuration file
|
|
before restarting the command. This will start the
|
|
initial indexing, which may take some time.</p>
|
|
|
|
<p>Most of the following parameters can be changed from
|
|
the <span class="guilabel">Index Configuration</span>
|
|
menu in the <span class=
|
|
"command"><strong>recoll</strong></span> interface. Some
|
|
can only be set by editing the configuration file.</p>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILES"></a>5.4.2.1. Parameters
|
|
affecting what documents we index:</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"></a><span class="term"><code class="varname">topdirs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Specifies the list of directories or files to
|
|
index (recursively for directories). You can use
|
|
symbolic links as elements of this list. See the
|
|
<code class="varname">followLinks</code> option
|
|
about following symbolic links found under the
|
|
top elements (not followed by default).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">skippedNames</code></span></dt>
|
|
|
|
<dd>
|
|
<p>A space-separated list of wilcard patterns for
|
|
names of files or directories that should be
|
|
completely ignored. The list defined in the
|
|
default file is:</p>
|
|
<pre class="programlisting">
|
|
skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
|
|
*~ .beagle .git .hg .bzr loop.ps .xsession-errors \
|
|
.recoll* xapiandb recollrc recoll.conf
|
|
</pre>
|
|
|
|
<p>The list can be redefined at any sub-directory
|
|
in the indexed area.</p>
|
|
|
|
<p>The top-level directories are not affected by
|
|
this list (that is, a directory in <code class=
|
|
"varname">topdirs</code> might match and would
|
|
still be indexed).</p>
|
|
|
|
<p>The list in the default configuration does not
|
|
exclude hidden directories (names beginning with
|
|
a dot), which means that it may index quite a few
|
|
things that you do not want. On the other hand,
|
|
email user agents like <span class=
|
|
"application">thunderbird</span> usually store
|
|
messages in hidden directories, and you probably
|
|
want this indexed. One possible solution is to
|
|
have <code class="filename">.*</code> in
|
|
<code class="varname">skippedNames</code>, and
|
|
add things like <code class=
|
|
"filename">~/.thunderbird</code> or <code class=
|
|
"filename">~/.evolution</code> in <code class=
|
|
"varname">topdirs</code>.</p>
|
|
|
|
<p>Not even the file names are indexed for
|
|
patterns in this list. See the <code class=
|
|
"varname">noContentSuffixes</code> variable for
|
|
an alternative approach which indexes the file
|
|
names.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">noContentSuffixes</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This is a list of file name endings (not
|
|
wildcard expressions, nor dot-delimited
|
|
suffixes). Only the names of matching files will
|
|
be indexed (no attempt at MIME type
|
|
identification, no decompression, no content
|
|
indexing). This can be redefined for
|
|
subdirectories, and edited from the GUI. The
|
|
default value is:</p>
|
|
<pre class="programlisting">
|
|
noContentSuffixes = .md5 .map \
|
|
.o .lib .dll .a .sys .exe .com \
|
|
.mpp .mpt .vsd \
|
|
.img .img.gz .img.bz2 .img.xz .image .image.gz .image.bz2 .image.xz \
|
|
.dat .bak .rdf .log.gz .log .db .msf .pid \
|
|
,v ~ #
|
|
</pre>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">skippedPaths</code> and <code class=
|
|
"varname">daemSkippedPaths</code></span></dt>
|
|
|
|
<dd>
|
|
<p>A space-separated list of patterns for
|
|
<span class="emphasis"><em>paths</em></span> of
|
|
files or directories that should be skipped.
|
|
There is no default in the sample configuration
|
|
file, but the code always adds the configuration
|
|
and database directories in there.</p>
|
|
|
|
<p><code class="varname">skippedPaths</code> is
|
|
used both by batch and real time indexing.
|
|
<code class="varname">daemSkippedPaths</code> can
|
|
be used to specify things that should be indexed
|
|
at startup, but not monitored.</p>
|
|
|
|
<p>Example of use for skipping text files only in
|
|
a specific directory:</p>
|
|
<pre class="programlisting">
|
|
skippedPaths = ~/somedir/*.txt
|
|
|
|
</pre>
|
|
</dd>
|
|
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHSFNMPATHNAME"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHSFNMPATHNAME">
|
|
</a><span class="term"><code class=
|
|
"varname">skippedPathsFnmPathname</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The values in the <code class=
|
|
"varname">*skippedPaths</code> variables are
|
|
matched by default with <code class=
|
|
"literal">fnmatch(3)</code>, with the
|
|
FNM_PATHNAME flag. This means that '/' characters
|
|
must be matched explicitely. You can set
|
|
<code class=
|
|
"varname">skippedPathsFnmPathname</code> to 0 to
|
|
disable the use of FNM_PATHNAME (meaning that
|
|
/*/dir3 will match /dir1/dir2/dir3).</p>
|
|
</dd>
|
|
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ZIPSKIPPEDNAMES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ZIPSKIPPEDNAMES"></a><span class="term"><code class="varname">zipSkippedNames</code></span></dt>
|
|
|
|
<dd>
|
|
<p>A space-separated list of patterns for names
|
|
of files or directories that should be ignored
|
|
inside zip archives. This is used directly by the
|
|
zip handler, and has a function similar to
|
|
skippedNames, but works independantly. Can be
|
|
redefined for filesystem subdirectories. For
|
|
versions up to 1.19, you will need to update the
|
|
Zip handler and install a supplementary Python
|
|
module. The details are described <a class=
|
|
"ulink" href=
|
|
"https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members"
|
|
target="_top">on the <span class=
|
|
"application">Recoll</span> wiki</a>.</p>
|
|
</dd>
|
|
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FOLLOWLINKS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FOLLOWLINKS"></a><span class="term"><code class="varname">followLinks</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Specifies if the indexer should follow
|
|
symbolic links while walking the file tree. The
|
|
default is to ignore symbolic links to avoid
|
|
multiple indexing of linked files. No effort is
|
|
made to avoid duplication when this option is set
|
|
to true. This option can be set individually for
|
|
each of the <code class="varname">topdirs</code>
|
|
members by using sections. It can not be changed
|
|
below the <code class="varname">topdirs</code>
|
|
level.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">indexedmimetypes</code></span></dt>
|
|
|
|
<dd>
|
|
<p><span class="application">Recoll</span>
|
|
normally indexes any file which it knows how to
|
|
read. This list lets you restrict the indexed
|
|
MIME types to what you specify. If the variable
|
|
is unspecified or the list empty (the default),
|
|
all supported types are processed. Can be
|
|
redefined for subdirectories.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">excludedmimetypes</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This list lets you exclude some MIME types
|
|
from indexing. Can be redefined for
|
|
subdirectories.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">compressedfilemaxkbs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Size limit for compressed (.gz or .bz2) files.
|
|
These need to be decompressed in a temporary
|
|
directory for identification, which can be very
|
|
wasteful if 'uninteresting' big compressed files
|
|
are present. Negative means no limit, 0 means no
|
|
processing of any compressed file. Defaults to
|
|
-1.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">textfilemaxmbs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Maximum size for text files. Very big text
|
|
files are often uninteresting logs. Set to -1 to
|
|
disable (default 20MB).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">textfilepagekbs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>If set to other than -1, text files will be
|
|
indexed as multiple documents of the given page
|
|
size. This may be useful if you do want to index
|
|
very big text files as it will both reduce memory
|
|
usage at index time and help with loading data to
|
|
the preview window. A size of a few megabytes
|
|
would seem reasonable (default: 1MB).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">membermaxkbs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This defines the maximum size in kilobytes for
|
|
an archive member (zip, tar or rar at the
|
|
moment). Bigger entries will be skipped.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">indexallfilenames</code></span></dt>
|
|
|
|
<dd>
|
|
<p><span class="application">Recoll</span>
|
|
indexes file names in a special section of the
|
|
database to allow specific file names searches
|
|
using wild cards. This parameter decides if file
|
|
name indexing is performed only for files with
|
|
MIME types that would qualify them for full text
|
|
indexing, or for all files inside the selected
|
|
subtrees, independently of MIME type.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">usesystemfilecommand</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Decide if we execute a system command
|
|
(<span class=
|
|
"command"><strong>file</strong></span>
|
|
<code class="option">-i</code> by default) as a
|
|
final step for determining the MIME type for a
|
|
file (the main procedure uses suffix associations
|
|
as defined in the <code class=
|
|
"filename">mimemap</code> file). This can be
|
|
useful for files with suffix-less names, but it
|
|
will also cause the indexing of many bogus "text"
|
|
files.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">systemfilecommand</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Command to use for mime for mime type
|
|
determination if <code class=
|
|
"literal">usesystefilecommand</code> is set.
|
|
Recent versions of <span class=
|
|
"command"><strong>xdg-mime</strong></span>
|
|
sometimes work better than <span class=
|
|
"command"><strong>file</strong></span>.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">processwebqueue</code></span></dt>
|
|
|
|
<dd>
|
|
<p>If this is set, process the directory where
|
|
Web browser plugins copy visited pages for
|
|
indexing.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">webqueuedir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The path to the web indexing queue. This is
|
|
hard-coded in the Firefox plugin as <code class=
|
|
"filename">~/.recollweb/ToIndex</code> so there
|
|
should be no need to change it.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TERMS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TERMS"></a>5.4.2.2. Parameters
|
|
affecting how we generate terms:</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Changing some of these parameters will imply a full
|
|
reindex. Also, when using multiple indexes, it may not
|
|
make sense to search indexes that don't share the
|
|
values for these parameters, because they usually
|
|
affect both search and index operations.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">indexStripChars</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Decide if we strip characters of diacritics
|
|
and convert them to lower-case before terms are
|
|
indexed. If we don't, searches sensitive to case
|
|
and diacritics can be performed, but the index
|
|
will be bigger, and some marginal weirdness may
|
|
sometimes occur. The default is a stripped index
|
|
(<code class="literal">indexStripChars =
|
|
1</code>) for now. When using multiple indexes
|
|
for a search, this parameter must be defined
|
|
identically for all. Changing the value implies
|
|
an index reset.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">maxTermExpand</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Maximum expansion count for a single term
|
|
(e.g.: when using wildcards). The default of
|
|
10000 is reasonable and will avoid queries that
|
|
appear frozen while the engine is walking the
|
|
term list.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">maxXapianClauses</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Maximum number of elementary clauses we can
|
|
add to a single Xapian query. In some cases, the
|
|
result of term expansion can be multiplicative,
|
|
and we want to avoid using excessive memory. The
|
|
default of 100 000 should be both high enough in
|
|
most cases and compatible with current typical
|
|
hardware configurations.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">nonumbers</code></span></dt>
|
|
|
|
<dd>
|
|
<p>If this set to true, no terms will be
|
|
generated for numbers. For example "123",
|
|
"1.5e6", 192.168.1.4, would not be indexed
|
|
("value123" would still be). Numbers are often
|
|
quite interesting to search for, and this should
|
|
probably not be set except for special
|
|
situations, ie, scientific documents with huge
|
|
amounts of numbers in them. This can only be set
|
|
for a whole index, not for a subtree.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">dehyphenate</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Determines if, given an input of <code class=
|
|
"literal">co-worker</code>, we add a term for
|
|
<code class="literal">coworker</code>. This
|
|
possibility is new in version 1.22, and on by
|
|
default. Setting the variable to off allows
|
|
restoring the previous behaviour.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">nocjk</code></span></dt>
|
|
|
|
<dd>
|
|
<p>If this set to true, specific east asian
|
|
(Chinese Korean Japanese) characters/word
|
|
splitting is turned off. This will save a small
|
|
amount of cpu if you have no CJK documents. If
|
|
your document base does include such text but you
|
|
are not interested in searching it, setting
|
|
<code class="varname">nocjk</code> may be a
|
|
significant time and space saver.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">cjkngramlen</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This lets you adjust the size of n-grams used
|
|
for indexing CJK text. The default value of 2 is
|
|
probably appropriate in most cases. A value of 3
|
|
would allow more precision and efficiency on
|
|
longer words, but the index will be approximately
|
|
twice as large.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">indexstemminglanguages</code></span></dt>
|
|
|
|
<dd>
|
|
<p>A list of languages for which the stem
|
|
expansion databases will be built. See
|
|
<span class="citerefentry"><span class=
|
|
"refentrytitle">recollindex</span>(1)</span> or
|
|
use the <span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
<code class="option">-l</code> command for
|
|
possible values. You can add a stem expansion
|
|
database for a different language by using
|
|
<span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
<code class="option">-s</code>, but it will be
|
|
deleted during the next indexing. Only languages
|
|
listed in the configuration file are
|
|
permanent.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">defaultcharset</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The name of the character set used for files
|
|
that do not contain a character set definition
|
|
(ie: plain text files). This can be redefined for
|
|
any sub-directory. If it is not set at all, the
|
|
character set used is the one defined by the nls
|
|
environment ( <code class="envar">LC_ALL</code>,
|
|
<code class="envar">LC_CTYPE</code>, <code class=
|
|
"envar">LANG</code>), or <code class=
|
|
"literal">iso8859-1</code> if nothing is set.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">unac_except_trans</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This is a list of characters, encoded in
|
|
UTF-8, which should be handled specially when
|
|
converting text to unaccented lowercase. For
|
|
example, in Swedish, the letter <code class=
|
|
"literal">a with diaeresis</code> has full
|
|
alphabet citizenship and should not be turned
|
|
into an <code class="literal">a</code>. Each
|
|
element in the space-separated list has the
|
|
special character as first element and the
|
|
translation following. The handling of both the
|
|
lowercase and upper-case versions of a character
|
|
should be specified, as appartenance to the list
|
|
will turn-off both standard accent and case
|
|
processing. Example for Swedish:</p>
|
|
<pre class="programlisting">
|
|
unac_except_trans = åå Åå ää Ää öö Öö
|
|
|
|
</pre>
|
|
|
|
<p>Note that the translation is not limited to a
|
|
single character, you could very well have
|
|
something like <code class=
|
|
"literal">üue</code> in the list.</p>
|
|
|
|
<p>The default value set for <code class=
|
|
"literal">unac_except_trans</code> can't be
|
|
listed here because I have trouble with SGML and
|
|
UTF-8, but it only contains ligature
|
|
decompositions: german ss, oe, ae, fi, fl.</p>
|
|
|
|
<p>This parameter can't be defined for
|
|
subdirectories, it is global, because there is no
|
|
way to do otherwise when querying. If you have
|
|
document sets which would need different values,
|
|
you will have to index and query them
|
|
separately.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">maildefcharset</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This can be used to define the default
|
|
character set specifically for email messages
|
|
which don't specify it. This is mainly useful for
|
|
readpst (libpst) dumps, which are utf-8 but do
|
|
not say so.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">localfields</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This allows setting fields for all documents
|
|
under a given directory. Typical usage would be
|
|
to set an "rclaptg" field, to be used in
|
|
<code class="filename">mimeview</code> to select
|
|
a specific viewer. If several fields are to be
|
|
set, they should be separated with a semi-colon
|
|
(';') character, which there is currently no way
|
|
to escape. Also note the initial semi-colon.
|
|
Example: <code class="literal">localfields=
|
|
;rclaptg=gnus;other = val</code>, then select
|
|
specifier viewer with <code class=
|
|
"literal">mimetype|tag=...</code> in <code class=
|
|
"filename">mimeview</code>.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">testmodifusemtime</code></span></dt>
|
|
|
|
<dd>
|
|
<p>If true, use mtime instead of default ctime to
|
|
determine if a file has been modified (in
|
|
addition to size, which is always used). Setting
|
|
this can reduce re-indexing on systems where
|
|
extended attributes are modified (by some other
|
|
application), but not indexed (changing extended
|
|
attributes only affects ctime). Notes:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>This may prevent detection of change in
|
|
some marginal file rename cases (the target
|
|
would need to have the same size and
|
|
mtime).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>You should probably also set
|
|
noxattrfields to 1 in this case, except if
|
|
you still prefer to perform xattr indexing,
|
|
for example if the local file update
|
|
pattern makes it of value (as in general,
|
|
there is a risk for pure extended
|
|
attributes updates without file
|
|
modification to go undetected).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Perform a full index reset after changing the
|
|
value of this parameter.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">noxattrfields</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Recoll versions 1.19 and later automatically
|
|
translate file extended attributes into document
|
|
fields (to be processed according to the
|
|
parameters from the <code class=
|
|
"filename">fields</code> file). Setting this
|
|
variable to 1 will disable the behaviour.</p>
|
|
</dd>
|
|
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS"></a><span class="term"><code class="varname">metadatacmds</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This allows executing external commands for
|
|
each file and storing the output in <span class=
|
|
"application">Recoll</span> document fields. This
|
|
could be used for example to index external tag
|
|
data. The value is a list of field names and
|
|
commands, don't forget an initial semi-colon.
|
|
Example:</p>
|
|
<pre class="programlisting">
|
|
[/some/area/of/the/fs]
|
|
metadatacmds = ; tags = tmsu tags %f; otherfield = somecmd -xx %f
|
|
|
|
</pre>
|
|
|
|
<p>As a specially disgusting hack brought by
|
|
<span class="application">Recoll</span> 1.19.7,
|
|
if a "field name" begins with <code class=
|
|
"literal">rclmulti</code>, the data returned by
|
|
the command is expected to contain multiple field
|
|
values, in configuration file format. This allows
|
|
setting several fields by executing a single
|
|
command. Example:</p>
|
|
<pre class="programlisting">
|
|
metadatacmds = ; rclmulti1 = somecmd %f
|
|
|
|
</pre>
|
|
|
|
<p>If <code class="literal">somecmd</code>
|
|
returns data in the form of:</p>
|
|
<pre class="programlisting">
|
|
field1 = value1
|
|
field2 = value for field2
|
|
|
|
</pre>
|
|
|
|
<p><code class="literal">field1</code> and
|
|
<code class="literal">field2</code> will be set
|
|
inside the document metadata.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.STORAGE" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.STORAGE"></a>5.4.2.3. Parameters
|
|
affecting where and how we store things:</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">cachedir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>When not explicitly specified, the
|
|
<span class="application">Recoll</span> data
|
|
directories are stored relative to the
|
|
configuration directory. If <code class=
|
|
"literal">cachedir</code> is set, the directories
|
|
are stored under the specified value instead
|
|
(e.g. if <code class="literal">cachedir</code> is
|
|
set to <code class=
|
|
"filename">~/.cache/recoll</code>, the default
|
|
<code class="literal">dbdir</code> would be
|
|
<code class=
|
|
"filename">~/.cache/recoll/xapiandb</code>
|
|
instead of <code class=
|
|
"filename">~/.recoll/xapiandb</code> ). This
|
|
affects the default values for <code class=
|
|
"literal">dbdir</code>, <code class=
|
|
"literal">webcachedir</code>, <code class=
|
|
"literal">mboxcachedir</code>, and <code class=
|
|
"literal">aspellDicDir</code>, which can still be
|
|
individually specified to override <code class=
|
|
"literal">cachedir</code>. Note that if you have
|
|
multiple configurations, each must have a
|
|
different <code class=
|
|
"literal">cachedir</code>.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">dbdir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The name of the Xapian data directory. It will
|
|
be created if needed when the index is
|
|
initialized. If this is not an absolute path, it
|
|
will be interpreted relative to the configuration
|
|
directory. The value can have embedded spaces but
|
|
starting or trailing spaces will be trimmed. You
|
|
cannot use quotes here.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">idxstatusfile</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The name of the scratch file where the indexer
|
|
process updates its status. Default: <code class=
|
|
"filename">idxstatus.txt</code> inside the
|
|
configuration directory.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">maxfsoccuppc</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Maximum file system occupation before we stop
|
|
indexing. The value is a percentage,
|
|
corresponding to what the "Capacity" df output
|
|
column shows. The default value is 0, meaning no
|
|
checking.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">mboxcachedir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The directory where mbox message offsets cache
|
|
files are held. This is normally
|
|
$RECOLL_CONFDIR/mboxcache, but it may be useful
|
|
to share a directory between different
|
|
configurations.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">mboxcacheminmbs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The minimum mbox file size over which we cache
|
|
the offsets. There is really no sense in caching
|
|
offsets for small files. The default is 5 MB.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">webcachedir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This is only used by the web browser plugin
|
|
indexing code, and defines where the cache for
|
|
visited pages will live. Default: <code class=
|
|
"filename">$RECOLL_CONFDIR/webcache</code></p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">webcachemaxmbs</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This is only used by the web browser plugin
|
|
indexing code, and defines the maximum size for
|
|
the web page cache. Default: 40 MB. Quite
|
|
unfortunately, this is only taken into account
|
|
when creating the cache file. You need to delete
|
|
the file for a change to be taken into
|
|
account.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">idxflushmb</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Threshold (megabytes of new text data) where
|
|
we flush from memory to disk index. Setting this
|
|
can help control memory usage. A value of 0 means
|
|
no explicit flushing, letting Xapian use its own
|
|
default, which is flushing every 10000 (or
|
|
XAPIAN_FLUSH_THRESHOLD) documents, which gives
|
|
little memory usage control, as memory usage also
|
|
depends on average document size. The default
|
|
value is 10, and it is probably a bit low. If
|
|
your system usually has free memory, you can try
|
|
higher values between 20 and 80. In my
|
|
experience, values beyond 100 are always
|
|
counterproductive.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXTHREADS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXTHREADS"></a>5.4.2.4. Parameters
|
|
affecting multithread processing</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>The <span class="application">Recoll</span> indexing
|
|
process <span class=
|
|
"command"><strong>recollindex</strong></span> can use
|
|
multiple threads to speed up indexing on multiprocessor
|
|
systems. The work done to index files is divided in
|
|
several stages and some of the stages can be executed
|
|
by multiple threads. The stages are:</p>
|
|
|
|
<div class="orderedlist">
|
|
<ol class="orderedlist" type="1">
|
|
<li class="listitem">File system walking: this is
|
|
always performed by the main thread.</li>
|
|
|
|
<li class="listitem">File conversion and data
|
|
extraction.</li>
|
|
|
|
<li class="listitem">Text processing (splitting,
|
|
stemming, etc.)</li>
|
|
|
|
<li class="listitem"><span class=
|
|
"application">Xapian</span> index update.</li>
|
|
</ol>
|
|
</div>
|
|
|
|
<p>You can also read a <a class="ulink" href=
|
|
"http://www.recoll.org/idxthreads/threadingRecoll.html"
|
|
target="_top">longer document</a> about the
|
|
transformation of <span class=
|
|
"application">Recoll</span> indexing to
|
|
multithreading.</p>
|
|
|
|
<p>The threads configuration is controlled by two
|
|
configuration file parameters.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">thrQSizes</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This variable defines the job input queues
|
|
configuration. There are three possible queues
|
|
for stages 2, 3 and 4, and this parameter should
|
|
give the queue depth for each stage (three
|
|
integer values). If a value of -1 is used for a
|
|
given stage, no queue is used, and the thread
|
|
will go on performing the next stage. In
|
|
practise, deep queues have not been shown to
|
|
increase performance. A value of 0 for the first
|
|
queue tells <span class=
|
|
"application">Recoll</span> to perform
|
|
autoconfiguration (no need for the two other
|
|
values in this case) - this is the default
|
|
configuration.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">thrTCounts</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This defines the number of threads used for
|
|
each stage. If a value of -1 is used for one of
|
|
the queue depths, the corresponding thread count
|
|
is ignored. It makes no sense to use a value
|
|
other than 1 for the last stage because updating
|
|
the <span class="application">Xapian</span> index
|
|
is necessarily single-threaded (and protected by
|
|
a mutex).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<p>The following example would use three queues (of
|
|
depth 2), and 4 threads for converting source
|
|
documents, 2 for processing their text, and one to
|
|
update the index. This was tested to be the best
|
|
configuration on the test system (quadri-processor with
|
|
multiple disks).</p>
|
|
<pre class="programlisting">
|
|
thrQSizes = 2 2 2
|
|
thrTCounts = 4 2 1
|
|
</pre>
|
|
|
|
<p>The following example would use a single queue, and
|
|
the complete processing for each document would be
|
|
performed by a single thread (several documents will
|
|
still be processed in parallel in most cases). The
|
|
threads will use mutual exclusion when entering the
|
|
index update stage. In practise the performance would
|
|
be close to the precedent case in general, but worse in
|
|
certain cases (e.g. a Zip archive would be performed
|
|
purely sequentially), so the previous approach is
|
|
preferred. YMMV... The 2 last values for thrTCounts are
|
|
ignored.</p>
|
|
<pre class="programlisting">
|
|
thrQSizes = 2 -1 -1
|
|
thrTCounts = 6 1 1
|
|
</pre>
|
|
|
|
<p>The following example would disable multithreading.
|
|
Indexing will be performed by a single thread.</p>
|
|
<pre class="programlisting">
|
|
thrQSizes = -1 -1 -1
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MISC" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MISC"></a>5.4.2.5. Miscellaneous
|
|
parameters:</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">autodiacsens</code></span></dt>
|
|
|
|
<dd>
|
|
<p>IF the index is not stripped, decide if we
|
|
automatically trigger diacritics sensitivity if
|
|
the search term has accented characters (not in
|
|
<code class="literal">unac_except_trans</code>).
|
|
Else you need to use the query language and the
|
|
<code class="literal">D</code> modifier to
|
|
specify diacritics sensitivity. Default is
|
|
no.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">autocasesens</code></span></dt>
|
|
|
|
<dd>
|
|
<p>IF the index is not stripped, decide if we
|
|
automatically trigger character case sensitivity
|
|
if the search term has upper-case characters in
|
|
any but the first position. Else you need to use
|
|
the query language and the <code class=
|
|
"literal">C</code> modifier to specify
|
|
character-case sensitivity. Default is yes.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">loglevel,daemloglevel</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Verbosity level for recoll and recollindex. A
|
|
value of 4 lists quite a lot of debug/information
|
|
messages. 2 only lists errors. The <code class=
|
|
"literal">daem</code>version is specific to the
|
|
indexing monitor daemon.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">logfilename,
|
|
daemlogfilename</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Where the messages should go. 'stderr' can be
|
|
used as a special value, and is the default. The
|
|
<code class="literal">daem</code>version is
|
|
specific to the indexing monitor daemon.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">checkneedretryindexscript</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This defines the name for a command executed
|
|
by <span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
when starting indexing. If the exit status of the
|
|
command is 0, <span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
retries to index all files which previously could
|
|
not be indexed because of data extraction errors.
|
|
The default value is a script which checks if any
|
|
of the common <code class="filename">bin</code>
|
|
directories have changed (indicating that a
|
|
helper program may have been installed).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">mondelaypatterns</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This allows specify wildcard path patterns
|
|
(processed with fnmatch(3) with 0 flag), to match
|
|
files which change too often and for which a
|
|
delay should be observed before re-indexing. This
|
|
is a space-separated list, each entry being a
|
|
pattern and a time in seconds, separated by a
|
|
colon. You can use double quotes if a path entry
|
|
contains white space. Example:</p>
|
|
<pre class="programlisting">
|
|
mondelaypatterns = *.log:20 "this one has spaces*:10"
|
|
|
|
</pre>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">monixinterval</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Minimum interval (seconds) for processing the
|
|
indexing queue. The real time monitor does not
|
|
process each event when it comes in, but will
|
|
wait this time for the queue to accumulate to
|
|
diminish overhead and in order to aggregate
|
|
multiple events to the same file. Default 30
|
|
S.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">monauxinterval</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Period (in seconds) at which the real time
|
|
monitor will regenerate the auxiliary databases
|
|
(spelling, stemming) if needed. The default is
|
|
one hour.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">monioniceclass,
|
|
monioniceclassdata</code></span></dt>
|
|
|
|
<dd>
|
|
<p>These allow defining the <span class=
|
|
"application">ionice</span> class and data used
|
|
by the indexer (default class 3, no data).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">filtermaxseconds</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Maximum handler execution time, after which it
|
|
is aborted. Some postscript programs just
|
|
loop...</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">filtermaxmbytes</code></span></dt>
|
|
|
|
<dd>
|
|
<p><span class="application">Recoll</span> 1.20.7
|
|
and later. Maximum handler memory utilisation.
|
|
This uses setrlimit(RLIMIT_AS) on most systems
|
|
(total virtual memory space size limit). Some
|
|
programs may start with 500 MBytes of mapped
|
|
shared libraries, so take this into account when
|
|
choosing a value. The default is a liberal
|
|
2000MB.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">filtersdir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>A directory to search for the external input
|
|
handler scripts used to index some types of
|
|
files. The value should not be changed, except if
|
|
you want to modify one of the default scripts.
|
|
The value can be redefined for any
|
|
sub-directory.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">iconsdir</code></span></dt>
|
|
|
|
<dd>
|
|
<p>The name of the directory where <span class=
|
|
"command"><strong>recoll</strong></span> result
|
|
list icons are stored. You can change this if you
|
|
want different images.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">idxabsmlen</code></span></dt>
|
|
|
|
<dd>
|
|
<p><span class="application">Recoll</span> stores
|
|
an abstract for each indexed file inside the
|
|
database. The text can come from an actual
|
|
'abstract' section in the document or will just
|
|
be the beginning of the document. It is stored in
|
|
the index so that it can be displayed inside the
|
|
result lists without decoding the original file.
|
|
The <code class="varname">idxabsmlen</code>
|
|
parameter defines the size of the stored
|
|
abstract. The default value is 250 bytes. The
|
|
search interface gives you the choice to display
|
|
this stored text or a synthetic abstract built by
|
|
extracting text around the search terms. If you
|
|
always prefer the synthetic abstract, you can
|
|
reduce this value and save a little space.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">idxmetastoredlen</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Maximum stored length for metadata fields.
|
|
This does not affect indexing (the whole field is
|
|
processed anyway), just the amount of data stored
|
|
in the index for the purpose of displaying fields
|
|
inside result lists or previews. The default
|
|
value is 150 bytes which may be too low if you
|
|
have custom fields.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">aspellLanguage</code></span></dt>
|
|
|
|
<dd>
|
|
<p>Language definitions to use when creating the
|
|
aspell dictionary. The value must match a set of
|
|
aspell language definition files. You can type
|
|
"aspell config" to see where these are installed
|
|
(look for data-dir). The default if the variable
|
|
is not set is to use your desktop national
|
|
language environment to guess the value.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">noaspell</code></span></dt>
|
|
|
|
<dd>
|
|
<p>If this is set, the aspell dictionary
|
|
generation is turned off. Useful for cases where
|
|
you don't need the functionality or when it is
|
|
unusable because aspell crashes during dictionary
|
|
generation.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term"><code class=
|
|
"varname">mhmboxquirks</code></span></dt>
|
|
|
|
<dd>
|
|
<p>This allows definining location-related quirks
|
|
for the mailbox handler. Currently only the
|
|
<code class="literal">tbird</code> flag is
|
|
defined, and it should be set for directories
|
|
which hold <span class=
|
|
"application">Thunderbird</span> data, as their
|
|
folder format is weird. Example:</p>
|
|
<pre class="programlisting">
|
|
[/path/to/my/mozilla/mail]
|
|
mhmboxquirks = tbird
|
|
</pre>
|
|
|
|
<p>It should be noted that later <span class=
|
|
"application">Recoll</span> versions have
|
|
improved automatic detection of <span class=
|
|
"application">Thunderbird</span> folders, so that
|
|
this should not be needed at all in most
|
|
cases.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.FIELDS" id=
|
|
"RCL.INSTALL.CONFIG.FIELDS"></a>5.4.3. The
|
|
fields file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>This file contains information about dynamic fields
|
|
handling in <span class="application">Recoll</span>. Some
|
|
very basic fields have hard-wired behaviour, and, mostly,
|
|
you should not change the original data inside the
|
|
<code class="filename">fields</code> file. But you can
|
|
create custom fields fitting your data and handle them
|
|
just like they were native ones.</p>
|
|
|
|
<p>The <code class="filename">fields</code> file has
|
|
several sections, which each define an aspect of fields
|
|
processing. Quite often, you'll have to modify several
|
|
sections to obtain the desired behaviour.</p>
|
|
|
|
<p>We will only give a short description here, you should
|
|
refer to the comments inside the default file for more
|
|
detailed information.</p>
|
|
|
|
<p>Field names should be lowercase alphabetic ASCII.</p>
|
|
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">[prefixes]</span></dt>
|
|
|
|
<dd>
|
|
<p>A field becomes indexed (searchable) by having a
|
|
prefix defined in this section.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">[stored]</span></dt>
|
|
|
|
<dd>
|
|
<p>A field becomes stored (displayable inside
|
|
results) by having its name listed in this section
|
|
(typically with an empty value).</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">[aliases]</span></dt>
|
|
|
|
<dd>
|
|
<p>This section defines lists of synonyms for the
|
|
canonical names used inside the <code class=
|
|
"literal">[prefixes]</code> and <code class=
|
|
"literal">[stored]</code> sections</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">[queryaliases]</span></dt>
|
|
|
|
<dd>
|
|
<p>This section also defines aliases for the
|
|
canonic field names, with the difference that the
|
|
substitution will only be used at query time,
|
|
avoiding any possibility that the value would
|
|
pick-up random metadata from documents.</p>
|
|
</dd>
|
|
|
|
<dt><span class="term">handler-specific
|
|
sections</span></dt>
|
|
|
|
<dd>
|
|
<p>Some input handlers may need specific
|
|
configuration for handling fields. Only the email
|
|
message handler currently has such a section (named
|
|
<code class="literal">[mail]</code>). It allows
|
|
indexing arbitrary email headers in addition to the
|
|
ones indexed by default. Other such sections may
|
|
appear in the future.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
|
|
<p>Here follows a small example of a personal
|
|
<code class="filename">fields</code> file. This would
|
|
extract a specific email header and use it as a
|
|
searchable field, with data displayable inside result
|
|
lists. (Side note: as the email handler does no decoding
|
|
on the values, only plain ascii headers can be indexed,
|
|
and only the first occurrence will be used for headers
|
|
that occur several times).</p>
|
|
<pre class="programlisting">
|
|
[prefixes]
|
|
# Index mailmytag contents (with the given prefix)
|
|
mailmytag = XMTAG
|
|
|
|
[stored]
|
|
# Store mailmytag inside the document data record (so that it can be
|
|
# displayed - as %(mailmytag) - in result lists).
|
|
mailmytag =
|
|
|
|
[queryaliases]
|
|
filename = fn
|
|
containerfilename = cfn
|
|
|
|
[mail]
|
|
# Extract the X-My-Tag mail header, and use it internally with the
|
|
# mailmytag field name
|
|
x-my-tag = mailmytag
|
|
</pre>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.FIELDS.XATTR" id=
|
|
"RCL.INSTALL.CONFIG.FIELDS.XATTR"></a>5.4.3.1. Extended
|
|
attributes in the fields file</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><span class="application">Recoll</span> versions
|
|
1.19 and later process user extended file attributes as
|
|
documents fields by default.</p>
|
|
|
|
<p>Attributes are processed as fields of the same name,
|
|
after removing the <code class="literal">user</code>
|
|
prefix on Linux.</p>
|
|
|
|
<p>The <code class="literal">[xattrtofields]</code>
|
|
section of the <code class="filename">fields</code>
|
|
file allows specifying translations from extended
|
|
attributes names to <span class=
|
|
"application">Recoll</span> field names. An empty
|
|
translation disables use of the corresponding attribute
|
|
data.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.MIMEMAP" id=
|
|
"RCL.INSTALL.CONFIG.MIMEMAP"></a>5.4.4. The
|
|
mimemap file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><code class="filename">mimemap</code> specifies the
|
|
file name extension to MIME type mappings.</p>
|
|
|
|
<p>For file names without an extension, or with an
|
|
unknown one, the system's <span class=
|
|
"command"><strong>file</strong></span> <code class=
|
|
"option">-i</code> command will be executed to determine
|
|
the MIME type (this can be switched off inside the main
|
|
configuration file).</p>
|
|
|
|
<p>The mappings can be specified on a per-subtree basis,
|
|
which may be useful in some cases. Example: <span class=
|
|
"application">gaim</span> logs have a <code class=
|
|
"filename">.txt</code> extension but should be handled
|
|
specially, which is possible because they are usually all
|
|
located in one place.</p>
|
|
|
|
<p>The <code class="varname">recoll_noindex</code>
|
|
<code class="filename">mimemap</code> variable has been
|
|
moved to <code class="filename">recoll.conf</code> and
|
|
renamed to <code class=
|
|
"varname">noContentSuffixes</code>, while keeping the
|
|
same function, as of <span class=
|
|
"application">Recoll</span> version 1.21. For older
|
|
<span class="application">Recoll</span> versions, see the
|
|
documentation for <code class=
|
|
"varname">noContentSuffixes</code> but use <code class=
|
|
"varname">recoll_noindex</code> in <code class=
|
|
"filename">mimemap</code>.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.MIMECONF" id=
|
|
"RCL.INSTALL.CONFIG.MIMECONF"></a>5.4.5. The
|
|
mimeconf file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><code class="filename">mimeconf</code> specifies how
|
|
the different MIME types are handled for indexing, and
|
|
which icons are displayed in the <span class=
|
|
"command"><strong>recoll</strong></span> result
|
|
lists.</p>
|
|
|
|
<p>Changing the parameters in the [index] section is
|
|
probably not a good idea except if you are a <span class=
|
|
"application">Recoll</span> developer.</p>
|
|
|
|
<p>The [icons] section allows you to change the icons
|
|
which are displayed by <span class=
|
|
"command"><strong>recoll</strong></span> in the result
|
|
lists (the values are the basenames of the png images
|
|
inside the <code class="filename">iconsdir</code>
|
|
directory (specified in <code class=
|
|
"filename">recoll.conf</code>).</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.MIMEVIEW" id=
|
|
"RCL.INSTALL.CONFIG.MIMEVIEW"></a>5.4.6. The
|
|
mimeview file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><code class="filename">mimeview</code> specifies which
|
|
programs are started when you click on an <span class=
|
|
"guilabel">Open</span> link in a result list. Ie: HTML is
|
|
normally displayed using <span class=
|
|
"application">firefox</span>, but you may prefer
|
|
<span class="application">Konqueror</span>, your
|
|
<span class="application">openoffice.org</span> program
|
|
might be named <span class=
|
|
"command"><strong>oofice</strong></span> instead of
|
|
<span class="command"><strong>openoffice</strong></span>
|
|
etc.</p>
|
|
|
|
<p>Changes to this file can be done by direct editing, or
|
|
through the <span class=
|
|
"command"><strong>recoll</strong></span> GUI preferences
|
|
dialog.</p>
|
|
|
|
<p>If <span class="guilabel">Use desktop preferences to
|
|
choose document editor</span> is checked in the
|
|
<span class="application">Recoll</span> GUI preferences,
|
|
all <code class="filename">mimeview</code> entries will
|
|
be ignored except the one labelled <code class=
|
|
"literal">application/x-all</code> (which is set to use
|
|
<span class="command"><strong>xdg-open</strong></span> by
|
|
default).</p>
|
|
|
|
<p>In this case, the <code class=
|
|
"literal">xallexcepts</code> top level variable defines a
|
|
list of MIME type exceptions which will be processed
|
|
according to the local entries instead of being passed to
|
|
the desktop. This is so that specific <span class=
|
|
"application">Recoll</span> options such as a page number
|
|
or a search string can be passed to applications that
|
|
support them, such as the <span class=
|
|
"application">evince</span> viewer.</p>
|
|
|
|
<p>As for the other configuration files, the normal usage
|
|
is to have a <code class="filename">mimeview</code>
|
|
inside your own configuration directory, with just the
|
|
non-default entries, which will override those from the
|
|
central configuration file.</p>
|
|
|
|
<p>All viewer definition entries must be placed under a
|
|
<code class="literal">[view]</code> section.</p>
|
|
|
|
<p>The keys in the file are normally MIME types. You can
|
|
add an application tag to specialize the choice for an
|
|
area of the filesystem (using a <code class=
|
|
"varname">localfields</code> specification in
|
|
<code class="filename">mimeconf</code>). The syntax for
|
|
the key is <em class=
|
|
"replaceable"><code>mimetype</code></em><code class=
|
|
"literal">|</code><em class=
|
|
"replaceable"><code>tag</code></em></p>
|
|
|
|
<p>The <code class="varname">nouncompforviewmts</code>
|
|
entry, (placed at the top level, outside of the
|
|
<code class="literal">[view]</code> section), holds a
|
|
list of MIME types that should not be uncompressed before
|
|
starting the viewer (if they are found compressed, ie:
|
|
<em class=
|
|
"replaceable"><code>mydoc.doc.gz</code></em>).</p>
|
|
|
|
<p>The right side of each assignment holds a command to
|
|
be executed for opening the file. The following
|
|
substitutions are performed:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><b>%D. </b>Document date</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%f. </b>File name. This may be the name
|
|
of a temporary file if it was necessary to create
|
|
one (ie: to extract a subdocument from a
|
|
container).</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%i. </b>Internal path, for subdocuments
|
|
of containers. The format depends on the container
|
|
type. If this appears in the command line,
|
|
<span class="application">Recoll</span> will not
|
|
create a temporary file to extract the subdocument,
|
|
expecting the called application (possibly a
|
|
script) to be able to handle it.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%M. </b>MIME type</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%p. </b>Page index. Only significant for
|
|
a subset of document types, currently only PDF,
|
|
Postscript and DVI files. Can be used to start the
|
|
editor at the right page for a match or
|
|
snippet.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%s. </b>Search term. The value will only
|
|
be set for documents with indexed page numbers (ie:
|
|
PDF). The value will be one of the matched search
|
|
terms. It would allow pre-setting the value in the
|
|
"Find" entry inside Evince for example, for easy
|
|
highlighting of the term.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p><b>%u. </b>Url.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>In addition to the predefined values above, all
|
|
strings like <code class="literal">%(fieldname)</code>
|
|
will be replaced by the value of the field named
|
|
<code class="literal">fieldname</code> for the document.
|
|
This could be used in combination with field
|
|
customisation to help with opening the document.</p>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.PTRANS" id=
|
|
"RCL.INSTALL.CONFIG.PTRANS"></a>5.4.7. The
|
|
<code class="filename">ptrans</code> file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p><code class="filename">ptrans</code> specifies
|
|
query-time path translations. These can be useful in
|
|
<a class="link" href="#RCL.SEARCH.PTRANS" title=
|
|
"3.5. Path translations">multiple cases</a>.</p>
|
|
|
|
<p>The file has a section for any index which needs
|
|
translations, either the main one or additional query
|
|
indexes. The sections are named with the <span class=
|
|
"application">Xapian</span> index directory names. No
|
|
slash character should exist at the end of the paths (all
|
|
comparisons are textual). An exemple should make things
|
|
sufficiently clear</p>
|
|
<pre class="programlisting">
|
|
[/home/me/.recoll/xapiandb]
|
|
/this/directory/moved = /to/this/place
|
|
|
|
[/path/to/additional/xapiandb]
|
|
/server/volume1/docdir = /net/server/volume1/docdir
|
|
/server/volume2/docdir = /net/server/volume2/docdir
|
|
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES" id=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES"></a>5.4.8. Examples
|
|
of configuration adjustments</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW" id=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW"></a>5.4.8.1. Adding
|
|
an external viewer for an non-indexed type</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Imagine that you have some kind of file which does
|
|
not have indexable content, but for which you would
|
|
like to have a functional <span class=
|
|
"guilabel">Open</span> link in the result list (when
|
|
found by file name). The file names end in <em class=
|
|
"replaceable"><code>.blob</code></em> and can be
|
|
displayed by application <em class=
|
|
"replaceable"><code>blobviewer</code></em>.</p>
|
|
|
|
<p>You need two entries in the configuration files for
|
|
this to work:</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>In <code class=
|
|
"filename">$RECOLL_CONFDIR/mimemap</code>
|
|
(typically <code class=
|
|
"filename">~/.recoll/mimemap</code>), add the
|
|
following line:</p>
|
|
<pre class="programlisting">
|
|
.blob = application/x-blobapp
|
|
</pre>
|
|
|
|
<p>Note that the MIME type is made up here, and
|
|
you could call it <em class=
|
|
"replaceable"><code>diesel/oil</code></em> just
|
|
the same.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>In <code class=
|
|
"filename">$RECOLL_CONFDIR/mimeview</code> under
|
|
the <code class="literal">[view]</code> section,
|
|
add:</p>
|
|
<pre class="programlisting">
|
|
application/x-blobapp = blobviewer %f
|
|
</pre>
|
|
|
|
<p>We are supposing that <em class=
|
|
"replaceable"><code>blobviewer</code></em> wants
|
|
a file name parameter here, you would use
|
|
<code class="literal">%u</code> if it liked URLs
|
|
better.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>If you just wanted to change the application used by
|
|
<span class="application">Recoll</span> to display a
|
|
MIME type which it already knows, you would just need
|
|
to edit <code class="filename">mimeview</code>. The
|
|
entries you add in your personal file override those in
|
|
the central configuration, which you do not need to
|
|
alter. <code class="filename">mimeview</code> can also
|
|
be modified from the Gui.</p>
|
|
</div>
|
|
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX" id=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX"></a>5.4.8.2. Adding
|
|
indexing support for a new file type</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<p>Let us now imagine that the above <em class=
|
|
"replaceable"><code>.blob</code></em> files actually
|
|
contain indexable text and that you know how to extract
|
|
it with a command line program. Getting <span class=
|
|
"application">Recoll</span> to index the files is easy.
|
|
You need to perform the above alteration, and also to
|
|
add data to the <code class="filename">mimeconf</code>
|
|
file (typically in <code class=
|
|
"filename">~/.recoll/mimeconf</code>):</p>
|
|
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Under the <code class="literal">[index]</code>
|
|
section, add the following line (more about the
|
|
<em class="replaceable"><code>rclblob</code></em>
|
|
indexing script later):</p>
|
|
<pre class="programlisting">
|
|
application/x-blobapp = exec rclblob
|
|
</pre>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Under the <code class="literal">[icons]</code>
|
|
section, you should choose an icon to be
|
|
displayed for the files inside the result lists.
|
|
Icons are normally 64x64 pixels PNG files which
|
|
live in <code class=
|
|
"filename">/usr/share/recoll/images</code>.</p>
|
|
</li>
|
|
|
|
<li class="listitem">
|
|
<p>Under the <code class=
|
|
"literal">[categories]</code> section, you should
|
|
add the MIME type where it makes sense (you can
|
|
also create a category). Categories may be used
|
|
for filtering in advanced search.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>The <em class=
|
|
"replaceable"><code>rclblob</code></em> handler should
|
|
be an executable program or script which exists inside
|
|
<code class=
|
|
"filename">/usr/share/recoll/filters</code>. It will be
|
|
given a file name as argument and should output the
|
|
text or html contents on the standard output.</p>
|
|
|
|
<p>The <a class="link" href="#RCL.PROGRAM.FILTERS"
|
|
title=
|
|
"4.1. Writing a document input handler">filter
|
|
programming</a> section describes in more detail how to
|
|
write an input handler.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</body>
|
|
</html>
|