491 lines
21 KiB
HTML
491 lines
21 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>RECOLL: a personal text search system for
|
|
Unix/Linux</title>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta name="Author" content="Jean-Francois Dockes">
|
|
<meta name="Description" content=
|
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
<meta name="Keywords" content=
|
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
<meta http-equiv="Content-language" content="en">
|
|
<meta http-equiv="content-type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<meta name="robots" content="All,Index,Follow">
|
|
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
</head>
|
|
|
|
<body>
|
|
<div class="rightlinks">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
|
|
<li><a href="pics/index.html">Screenshots</a></li>
|
|
|
|
<li><a href="download.html">Downloads</a></li>
|
|
|
|
<li><a href="doc.html">Documentation</a></li>
|
|
|
|
<li><a href="support.html">Support</a></li>
|
|
|
|
<li><a href="devel.html">Development</a></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="content">
|
|
<h1>Recoll features</h1>
|
|
|
|
<div class="intrapage">
|
|
<table width=100%>
|
|
<tbody>
|
|
<tr>
|
|
<td><a href="#systems">Supported systems</a></td>
|
|
<td><a href="#doctypes">Document types</a></td>
|
|
<td><a href="#other">Other features</a></td>
|
|
<td><a href="#integration">Desktop and web integration</a></td>
|
|
<td><a href="#stemming">Stemming</a></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
|
|
<h2><a name="general">General features</a></h2>
|
|
<ul>
|
|
<li>Easy installation, few dependancies. No database daemon,
|
|
web server, desktop environment or exotic language necessary.</li>
|
|
<li>Will run on most Unix-based <a href="features.html#systems">
|
|
systems</a>, and on MS-Windows too.</li>
|
|
<li>Qt 4 GUI, plus command line, Unity Lens, KIO and krunner
|
|
interfaces.</li>
|
|
|
|
<li>Searches most common
|
|
<a href="features.html#doctypes">document types</a>, emails and
|
|
their attachments. Transparently handles decompression
|
|
(gzip, bzip2).</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, proximity, wildcards, filter on file types and directory
|
|
tree.</li>
|
|
|
|
<li>Multi-language and multi-character set with Unicode based
|
|
internals.</li>
|
|
|
|
<li>Extensive documentation, with a
|
|
complete <a href="usermanual/usermanual.html">user
|
|
manual</a> and manual pages for each command.</li>
|
|
|
|
</ul>
|
|
|
|
<h2><a name="systems">Supported systems</a></h2>
|
|
|
|
<p><span class="application">Recoll</span> has been compiled and
|
|
tested on Linux, MS-Windows 7-10, MacOS X and Solaris (initial
|
|
versions Redhat 7, Fedora Core 5, Suse 10, Gentoo, Debian 3.1,
|
|
Solaris 8). It should compile and run on all subsequent releases
|
|
of these systems and probably a few others too.</p>
|
|
|
|
<p>Qt versions from 4.7 and later</p>
|
|
|
|
<h2><a name="doctypes">Document types</a></h2>
|
|
|
|
<p><span class="application">Recoll</span> can index many document
|
|
types (along with their compressed versions). Some types are
|
|
handled internally (no external application needed). Other types
|
|
need a separate application to be installed to extract the
|
|
text. Types that only need very common utilities
|
|
(awk/sed/groff/Python etc.) are listed in the native section.</p>
|
|
|
|
<p>The MS-Windows installer includes the supporting application,
|
|
the only additional package you will need is the Python language
|
|
installation.</p>
|
|
|
|
<p>Many formats are processed
|
|
by <span class="application">Python</span> scripts. The Python
|
|
dependency will not always be mentionned. In general, Recoll
|
|
expects Python 2.x to be available (many, but not all, scripts
|
|
are compatible with Python 3). Formats which are processed
|
|
using <span class="application">Python</span> and its standard
|
|
library are listed in the <i>native</i> section.</p>
|
|
|
|
<h4>File types indexed natively</h4>
|
|
|
|
<ul>
|
|
<li><span class="application">text</span>.</li>
|
|
<li><span class="application">html</span>.</li>
|
|
<li><span class="application">maildir</span>,
|
|
<span class="application">mh</span>, and
|
|
<span class="application">mailbox</span> (
|
|
<span class="application">Mozilla</span>,
|
|
<span class="application">Thunderbird</span> and
|
|
<span class="application">Evolution</span> mail ok).
|
|
<em><b>Evolution note</b>: be sure to remove <tt>.cache</tt> from
|
|
the <tt>skippedNames</tt> list in the GUI <tt>Indexing
|
|
preferences/Local Parameters/</tt> pane if you want to
|
|
index local copies of Imap mail.</em>
|
|
</li>
|
|
|
|
<li><span class="application">gaim</span> and
|
|
<span class="application">purple</span> log files.</li>
|
|
|
|
<li><span class="application">Scribus</span> files.</li>
|
|
|
|
<li><span class="application">Man pages</span> (needs
|
|
<span class="application">groff</span>).</li>
|
|
|
|
<li><span class="application">Dia</span> diagrams.</li>
|
|
<li><span class="application">Excel</span>
|
|
and <span class="application">Powerpoint</span>
|
|
for <span class="application">Recoll</span> versions 1.19.12
|
|
and later.</li>
|
|
|
|
<li><span class="application">Tar</span> archives. Tar file
|
|
indexing is disabled by default (because tar archives don't
|
|
typically contain the kind of documents that people search
|
|
for), you will need to enable it explicitely, like with the
|
|
following in your
|
|
<span class="filename">$HOME/.recoll/mimeconf</span> file:
|
|
<pre>
|
|
[index]
|
|
application/x-tar = execm rcltar
|
|
</pre>
|
|
</li>
|
|
|
|
<li><span class="application">Zip</span> archives.</li>
|
|
<li><span class="application">Konqueror webarchive</span>
|
|
format with Python (uses the <tt>tarfile</tt> standard
|
|
library module).</li>
|
|
|
|
<li><span class="application">Mimehtml web archive
|
|
format</span> (support based on the mail
|
|
filter, which introduces some mild weirdness, but still
|
|
usable).</li>
|
|
</ul>
|
|
|
|
|
|
|
|
<h4>File types indexed with external helpers</h4>
|
|
|
|
<p>Many document types need the <span class="command">iconv</span>
|
|
command in addition to the applications specifically listed.</p>
|
|
|
|
<h5>The XML ones</h5>
|
|
|
|
<p>The following types need <span class="command">
|
|
xsltproc</span> from the <b>libxslt</b> package for recoll
|
|
versions before 1.22, and in addition, python-libxslt1 and
|
|
python-libxml2 for 1.22 and newer.
|
|
Quite a few also need <span class="command">unzip</span>:</p>
|
|
|
|
<ul>
|
|
<li><span class="application">Abiword</span> files.</li>
|
|
|
|
<li><span class="application">Fb2</span> ebooks.</li>
|
|
|
|
<li><span class="application">Kword</span> files.</li>
|
|
|
|
<li><span class="application">Microsoft Office Open XML</span>
|
|
files.</li>
|
|
|
|
<li><span class="application">OpenOffice</span> files.</li>
|
|
|
|
<li><span class="application">SVG</span> files.</li>
|
|
<li><span class="application">Gnumeric</span> files.</li>
|
|
<li><span class="application">Okular</span> annotations files.</li>
|
|
|
|
</ul>
|
|
|
|
<h5>Other formats</h5>
|
|
|
|
<p>The following need miscellaneous helper programs to decode
|
|
the internal formats.</p>
|
|
|
|
<ul>
|
|
<li><span class="application">pdf</span> with the <span class=
|
|
"command">pdftotext</span> command, which comes with
|
|
<a href="http://poppler.freedesktop.org/">poppler</a>,
|
|
(the package name is quite often <tt>poppler-utils</tt>). <br/>
|
|
Note: the older <span class="command">pdftotext</span> command
|
|
which comes with <span class="application">xpdf</span> is
|
|
not compatible with <span class="application">
|
|
Recoll</span><br/>
|
|
|
|
<em>New in 1.21</em>: if the <span class="application">
|
|
tesseract</span> OCR application, and the
|
|
<span class="command">pdftoppm</span> command are available
|
|
on the system, the <span class="command">rclpdf</span>
|
|
filter has the capability to run OCR. See the comments at
|
|
the top of <span class="command">rclpdf</span> (usually
|
|
found
|
|
in <span class="filename">/usr/share/recoll/filters</span>)
|
|
for how to enable this and configuration details.<br/>
|
|
<em>Opening PDFs at the right page</em>: the default
|
|
configuration uses <span class="command">evince</span>,
|
|
which has options for direct page access and pre-setting the
|
|
search strings (hits will be highlighted). There is an
|
|
example line in the default mimeview for doing the same
|
|
thing with <span class="command">qpdfview</span>
|
|
(<span class="literal">qpdfview --search %s %f#%p</span>).
|
|
Okular does not have a search string option (but it does
|
|
have a page number one).
|
|
</li>
|
|
|
|
<li><span class="application">msword</span> with <a href=
|
|
"http://www.winfield.demon.nl/">antiword</a>. It is also useful to
|
|
have <a href="http://wvware.sourceforge.net/">wvWare</a> installed
|
|
as it may be be used as a fallback for some files which antiword
|
|
does not handle.</li>
|
|
|
|
<li><span class="application">Wordperfect</span> with the
|
|
<span class="command">wpd2html</span> command from <a href=
|
|
"http://libwpd.sourceforge.net">libwpd</a>. On some distributions,
|
|
the command may come with a package named <span
|
|
class="literal">libwpd-tools</span> or such, not the base <a
|
|
span="literal">libwpd</a> package.</li>
|
|
|
|
<li><span class="application">Lyx</span> files (needs
|
|
<span class="application">Lyx</span> to be installed).</li>
|
|
|
|
<li><span class="application">Powerpoint</span> and <span
|
|
class="application">Excel</span> with the <a href=
|
|
"http://vitus.wagner.pp.ru/software/catdoc/">catdoc</a>
|
|
utilities up to recoll 1.19.12. Recoll 1.19.12 and later use
|
|
internal Python filters for Excel and Powerpoint, and catdoc
|
|
is not needed at all (catdoc did not work on many semi-recent
|
|
Excel and Powerpoint files).</li>
|
|
|
|
<li><span class="application">CHM (Microsoft help)</span> files
|
|
with <span class="command">Python,
|
|
<a href="http://gnochm.sourceforge.net/pychm.html">pychm</a>
|
|
and <a href="http://www.jedrea.com/chmlib/">chmlib</a></span>.</li>
|
|
|
|
<li><span class="application">GNU info</span> files
|
|
with <span class="command">Python</span> and the
|
|
<span class="command">info</span> command.</li>
|
|
|
|
<li><span class="application">EPUB</span> files
|
|
with <span class="command">Python</span> and this
|
|
<a href="http://pypi.python.org/pypi/epub/">Python epub</a>
|
|
decoding module, which is packaged on Fedora, but not Debian.</li>
|
|
|
|
<li><span class="application">Rar</span> archives (needs <span
|
|
class="command">Python</span>), the
|
|
<a href="http://pypi.python.org/pypi/rarfile/">rarfile</a> Python
|
|
module and the <a
|
|
href="http://www.rarlab.com/rar_add.htm">unrar</a>
|
|
utility. The Python module is packaged by Fedora, not by Debian.</li>
|
|
|
|
<li><span class="application">7zip</span> archives (needs
|
|
<span class="command">Python</span> and
|
|
the <a href="https://pypi.python.org/pypi/pylzma">pylzma
|
|
module</a>). This is a recent addition, and you need to
|
|
download the filter from
|
|
the <a href="filters/filters.html">filters pages</a> for
|
|
all Recoll versions prior to 1.21.</li>
|
|
|
|
<li><span class="application">iCalendar</span>(.ics) files
|
|
(needs <span class="command">Python, <a href=
|
|
"http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
|
|
|
<li><span class="application">Mozilla calendar data</span> See
|
|
<a href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
|
the wiki</a> about this.</li>
|
|
|
|
<li><span class="application">postscript</span> with <a href=
|
|
"http://www.gnu.org/software/ghostscript/ghostscript.html">
|
|
ghostscript</a> and <a href=
|
|
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
|
|
Pstotext 1.9 has a serious issue with special characters in
|
|
file names, and you should either use the version packaged for
|
|
your system which is probably patched, or apply the Debian
|
|
patch which is stored <a href=
|
|
"files/pstotext-1.9_4-debian.patch">here</a> for
|
|
convenience. See http://packages.debian.org/squeeze/pstotext
|
|
and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
|
|
for references/explanations.
|
|
<blockquote>
|
|
To make things a bit easier, I also
|
|
store <a href="files/pstotext-1.9-patched.tar.gz">an
|
|
already patched version</a>. I added an
|
|
install target to the Makefile... This installs to
|
|
/usr/local, use <i>make install PREFIX=/usr</i> to
|
|
change. So all you need is:
|
|
<pre>
|
|
tar xvzf pstotext-1.9-patched.tar.gz
|
|
cd pstotext-1.9-patched
|
|
make
|
|
make install
|
|
</pre>
|
|
</blockquote>
|
|
</li>
|
|
|
|
|
|
<li><span class="application">RTF</span> files with
|
|
<a href="http://www.gnu.org/software/unrtf/unrtf.html">
|
|
unrtf</a>. Please note that up to version 0.21.3,
|
|
<span class="command">unrtf</span> mostly does not work with
|
|
non western-european character sets. Many serious problems
|
|
(crashes with serious security implications and infinite
|
|
loops) were fixed in unrtf 0.21.8, so you really want to use
|
|
this or a newer release. Building Unrtf from source is quick
|
|
and easy.</li>
|
|
|
|
<li><span class="application">TeX</span> with <span class=
|
|
"command">untex</span>. If there is no untex package for
|
|
your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
|
|
source package is stored on this site</a> (as untex has no
|
|
obvious home). Will also work with <a href=
|
|
"http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
|
if this is installed.</li>
|
|
|
|
<li><span class="application">dvi</span> with <a href=
|
|
"http://www.radicaleye.com/dvips.html">dvips</a>.</li>
|
|
|
|
<li><span class="application">djvu</span> with <a href=
|
|
"http://djvu.sourceforge.net">DjVuLibre</a>.</li>
|
|
|
|
<li><span class="application">Audio file tags</span>.
|
|
Recoll releases 1.14 and later use a Python filter based
|
|
on <a href="http://code.google.com/p/mutagen/">mutagen</a>
|
|
for all audio types.</li>
|
|
|
|
<li><span class="application">Image file tags</span> with <a href=
|
|
"http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
|
|
This is a perl program, so you also need perl on the
|
|
system. This works with about any possible image file and
|
|
tag format (jpg, png, tiff, gif etc.).</li>
|
|
|
|
<li><span class="application">Midi karaoke files</span> with
|
|
Python, the
|
|
<a href="http://pypi.python.org/pypi/midi/0.2.1">
|
|
midi module</a>, and some help
|
|
from <a href="http://chardet.feedparser.org/">chardet</a>. There
|
|
is probably a <tt>python-chardet</tt> package for your distribution,
|
|
but you will quite probably need to build the midi
|
|
package. This is easy but see the <a href="helpernotes.html#midi">
|
|
notes here</a>.
|
|
</li>
|
|
|
|
<li><span class="application">MediaWiki dump files</span>:
|
|
Thomas Levine has written a handler for these, you will find
|
|
it here:
|
|
<a href="https://bitbucket.org/tlevine/recoll/src/0127be78bffdd8a294067966a3ba7b2663d7b0cf/src/filters/rclmwdump?at=default&fileviewer=file-view-default">rclmwdump</a>.</li>
|
|
|
|
</ul>
|
|
|
|
<h2><a name="other">Other features</a></h2>
|
|
|
|
<ul>
|
|
<li>Can use a Firefox extension to index visited Web pages
|
|
history. See <a href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexWebHistory">the
|
|
Wiki</a> for more detail.</li>
|
|
|
|
<li>Processes all email attachments, and more generally any
|
|
realistic level of container imbrication (the "msword attachment to
|
|
a message inside a mailbox in a zip" thingy...) .</li>
|
|
|
|
<li>Multiple selectable databases.</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, filter on file types and directory tree.</li>
|
|
|
|
<li>Xesam-compatible query language.</li>
|
|
|
|
<li>Wildcard searches (with a specific and faster function
|
|
for file names).</li>
|
|
|
|
<li>Support for multiple charsets. Internal processing and
|
|
storage uses Unicode UTF-8.</li>
|
|
|
|
<li><a href="#Stemming">Stemming</a> performed at query
|
|
time (can switch stemming language after indexing).</li>
|
|
|
|
<li>Easy installation. No database daemon, web server or
|
|
exotic language necessary.</li>
|
|
|
|
<li>An indexer which runs either as a batch, cron'able
|
|
program, or as a real-time indexing daemon, depending on
|
|
preference.</li>
|
|
</ul>
|
|
|
|
<h2><a name="integration">Desktop and web integration</a></h2>
|
|
|
|
<p>The <span class="application">Recoll</span> GUI has many
|
|
features that help to specify an efficient search and to manage
|
|
the results. However it maybe sometimes preferable to use a
|
|
simpler tool with a better integration with your desktop
|
|
interfaces. Several solutions exist:</p>
|
|
<ul>
|
|
<li>The <span class="application">Recoll</span> KIO module
|
|
allows starting queries and viewing results from the
|
|
Konqueror browser or KDE applications <em>Open</em> dialogs.</li>
|
|
<li>The <a href="http://kde-apps.org">recollrunner</a> krunner
|
|
module allows integrating Recoll search results into a
|
|
krunner query.</li>
|
|
<li>The Ubuntu Unity Recoll Lens lets you access Recoll search
|
|
from the Unity Dash. More
|
|
info <a href="https://bitbucket.org/medoc/recoll/wiki/UnityLens">
|
|
here</a>. </li>
|
|
<li>The <a href="http://github.com/medoc92/recoll-webui">Recoll
|
|
Web UI</a> lets you query a Recoll index from a web browser</li>
|
|
</ul>
|
|
<p>Recoll also has
|
|
<a href="usermanual/usermanual.html#RCL.PROGRAM.PYTHONAPI">
|
|
<span class="application">Python</span></a> and
|
|
<span class="application">PHP</span> modules which can allow
|
|
easy integration with web or other applications.</p>
|
|
|
|
<h2><a name="stemming"></a>Stemming</h2>
|
|
|
|
<p>Stemming is a process which transforms inflected words
|
|
into their most basic form. For example, <i>flooring</i>,
|
|
<i>floors</i>, <i>floored</i> would probably all be
|
|
transformed to <i>floor</i> by a stemmer for the English
|
|
language.</p>
|
|
|
|
<p>In many search engines, the stemming process occurs during
|
|
indexing. The index will only contain the stemmed form of
|
|
words, with exceptions for terms which are detected as being
|
|
probably proper nouns (ie: capitalized). At query time, the
|
|
terms entered by the user are stemmed, then matched against
|
|
the index.</p>
|
|
|
|
<p>This process results into a smaller index, but it has the
|
|
grave inconvenient of irrevocably losing information during
|
|
indexing.</p>
|
|
|
|
<p>Recoll works in a different way. No stemming is performed
|
|
at query time, so that all information gets into the index.
|
|
The resulting index is bigger, but most people probably don't
|
|
care much about this nowadays, because they have a 100Gb disk
|
|
95% full of binary data <em>which does not get
|
|
indexed</em>.</p>
|
|
|
|
<p>At the end of an indexing pass, Recoll builds one or
|
|
several stemming dictionaries, where all word stems are
|
|
listed in correspondence to the list of their
|
|
derivatives.</p>
|
|
|
|
<p>At query time, by default, user-entered terms are stemmed,
|
|
then matched against the stem database, and the query is
|
|
expanded to include all derivatives. This will yield search
|
|
results analogous to those obtained by a classical engine.
|
|
The benefits of this approach is that stem expansion can be
|
|
controlled instantly at query time in several ways:</p>
|
|
|
|
<ul>
|
|
<li>It can be selectively turned-off for any query term by
|
|
capitalizing it (<i>Floor</i>).</li>
|
|
|
|
<li>The stemming language (ie: english, french...) can be
|
|
selected (this supposes that several stemming databases
|
|
have been built, which can be configured as part of the
|
|
indexing, or done later, in a reasonably fast way).</li>
|
|
</ul>
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|