407 lines
16 KiB
HTML
407 lines
16 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>RECOLL: a personal text search system for
|
|
Unix/Linux</title>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta name="Author" content="Jean-Francois Dockes">
|
|
<meta name="Description" content=
|
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
<meta name="Keywords" content=
|
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
<meta http-equiv="Content-language" content="en">
|
|
<meta http-equiv="content-type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<meta name="robots" content="All,Index,Follow">
|
|
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
</head>
|
|
|
|
<body>
|
|
<div class="rightlinks">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
|
|
<li><a href="pics/index.html">Screenshots</a></li>
|
|
|
|
<li><a href="download.html">Downloads</a></li>
|
|
|
|
<li><a href="usermanual/index.html">User manual</a></li>
|
|
|
|
<li><a href="index.html#support">Support</a></li>
|
|
|
|
<li><a href="devel.html">Development</a></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="content">
|
|
<h1 class="intro">Recoll features</h1>
|
|
|
|
<table width=100%>
|
|
<tbody>
|
|
<tr>
|
|
<td><a href="#systems">Supported systems</a></td>
|
|
<td><a href="#doctypes">Document types</a></td>
|
|
<td><a href="#other">Other features</a></td>
|
|
<td><a href="#integration">Desktop and web integration</a></td>
|
|
<td><a href="#stemming">Stemming</a></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<ul>
|
|
<li>Easy installation, few dependancies. No database daemon,
|
|
web server, desktop environment or exotic language necessary.</li>
|
|
<li>Will run on most Unix-based <a href="features.html#systems">
|
|
systems</a></li>
|
|
<li>Qt 4 GUI, plus command line, KIO and krunner interfaces.</li>
|
|
|
|
<li>Searches most common
|
|
<a href="features.html#doctypes">document types</a>, emails and
|
|
their attachments. Transparently handles decompression
|
|
(gzip, bzip2).</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, proximity, wildcards, filter on file types and directory
|
|
tree.</li>
|
|
|
|
<li>Multi-language and multi-character set with Unicode based
|
|
internals.</li>
|
|
|
|
<li>Extensive documentation, with a
|
|
complete <a href="usermanual/usermanual.html">user
|
|
manual</a> and manual pages for each command.</li>
|
|
|
|
</ul>
|
|
|
|
<h2><a name="systems">Supported systems</a></h2>
|
|
|
|
<p><span class="application">Recoll</span> has been compiled and
|
|
tested on Linux, Darwin and Solaris (initial versions Redhat 7,
|
|
Fedora Core 5, Suse 10, Gentoo, Debian 3.1, Solaris 8). It
|
|
should compile and run on all subsequent releases of these
|
|
systems and probably a few others too.</p>
|
|
|
|
<p>Qt versions from 3.1 to 4.7</p>
|
|
|
|
<h2><a name="doctypes">Document types</a></h2>
|
|
|
|
<p>Recoll can index many document types (along with their
|
|
compressed versions). Some types are handled internally (no
|
|
external application needed). Other types need a separate
|
|
application to be installed to extract the text. Types that
|
|
only need very common utilities (awk/sed/groff/Python etc.)
|
|
are listed in the native section.</p>
|
|
|
|
<h4>File types indexed natively</h4>
|
|
|
|
<ul>
|
|
<li><span class="literal">text</span>.</li>
|
|
|
|
<li><span class="literal">html</span>.</li>
|
|
|
|
<li><span class="literal">maildir</span> and
|
|
<span class="literal">mailbox</span> (
|
|
<span class="literal">Mozilla</span>,
|
|
<span class="literal">Thunderbird</span> and
|
|
<span class="literal">Evolution</span>mail ok).
|
|
</li>
|
|
|
|
<li><span class="literal">gaim</span> and
|
|
<span class="literal">purple</span> log files.</li>
|
|
|
|
<li><span class="literal">Lyx</span> files (needs
|
|
<span class="literal">Lyx</span> to be installed).</li>
|
|
|
|
<li><span class="literal">Scribus</span> files.</li>
|
|
|
|
<li><span class="literal">Man pages</span> (needs
|
|
<span class="command">groff</span>).</li>
|
|
|
|
<li><span class="literal">Dia</span> diagrams.</li>
|
|
</ul>
|
|
|
|
<h4>File types indexed with external helpers</h4>
|
|
|
|
<p>Many document types need the <span class="command">iconv</span>
|
|
command in addition to the applications specifically listed.</p>
|
|
|
|
<h5>The XML ones</h5>
|
|
|
|
<p>The following types need <span class="command">
|
|
xsltproc</span> from the <b>libxslt</b> package.
|
|
Quite a few also need <span class="command">unzip</span>:</p>
|
|
|
|
<ul>
|
|
<li><span class="literal">Abiword</span> files.</li>
|
|
|
|
<li><span class="literal">Fb2</span> ebooks.</li>
|
|
|
|
<li><span class="literal">Kword</span> files.</li>
|
|
|
|
<li><span class="literal">Microsoft Office Open XML</span>
|
|
files.</li>
|
|
|
|
<li><span class="literal">OpenOffice</span> files.</li>
|
|
|
|
<li><span class="literal">SVG</span> files.</li>
|
|
<li><span class="literal">Gnumeric</span> files.</li>
|
|
<li><span class="literal">Okular</span> annotations files.</li>
|
|
|
|
</ul>
|
|
|
|
<h5>Other formats</h5>
|
|
|
|
<p>The following need miscellaneous helper programs to decode
|
|
the internal formats.</p>
|
|
|
|
<ul>
|
|
<li><span class="literal">pdf</span> with the <span class=
|
|
"command">pdftotext</span> command, which can be installed
|
|
as part of <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
or <a href="http://poppler.freedesktop.org/">poppler</a>,
|
|
depending on your distribution.</li>
|
|
|
|
<li><span class="literal">msword</span> with <a href=
|
|
"http://www.winfield.demon.nl/">antiword</a>. It is also useful to
|
|
have <a href="http://wvware.sourceforge.net/">wvWare</a> installed
|
|
as it may be be used as a fallback for some files which antiword
|
|
does not handle.</li>
|
|
|
|
<li><span class="literal">Powerpoint</span> and <span
|
|
class="literal">Excel</span> with the <a href=
|
|
"http://vitus.wagner.pp.ru/software/catdoc/">catdoc</a> utilities.</li>
|
|
|
|
<li><span class="literal">CHM (Microsoft help)</span> files
|
|
with <span class="command">Python,
|
|
<a href="http://gnochm.sourceforge.net/pychm.html">pychm</a>
|
|
and <a href="http://www.jedrea.com/chmlib/">chmlib</a></span>.</li>
|
|
|
|
<li><span class="literal">GNU info</span> files
|
|
with <span class="command">Python</span> and the
|
|
<span class="command">info</span> command.</li>
|
|
|
|
<li><span class="literal">Zip</span> archives (needs <span
|
|
class="command">Python</span>).</li>
|
|
|
|
<li><span class="literal">Rar</span> archives (needs <span
|
|
class="command">Python</span>), the
|
|
<a href="http://pypi.python.org/pypi/rarfile/">rarfile</a> Python
|
|
module and the <a
|
|
href="http://www.rarlab.com/rar_add.htm">unrar</a> utility.</li>
|
|
|
|
<li><span class="literal">iCalendar</span>(.ics) files
|
|
(needs <span class="command">Python, <a href=
|
|
"http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
|
|
|
<li><span class="literal">Mozilla calendar data</span> See
|
|
<a href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
|
the wiki</a> about this.</li>
|
|
|
|
<li><span class="literal">Wordperfect</span> with the
|
|
<span class="command">wpd2html</span> command from <a href=
|
|
"http://libwpd.sourceforge.net">libwpd</a>. On some distributions,
|
|
the command may come with an package named <span
|
|
class="literal">libwpd-tools</span> or such, not the base <a
|
|
span="literal">libwpd</a> package.</li>
|
|
|
|
<li><span class="literal">postscript</span> with <a href=
|
|
"http://www.gnu.org/software/ghostscript/ghostscript.html">
|
|
ghostscript</a> and <a href=
|
|
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
|
|
Pstotext 1.9 has a serious issue with special characters in
|
|
file names, and you should either use the version packaged for
|
|
your system which is probably patched, or apply the Debian
|
|
patch which is stored <a href=
|
|
"files/pstotext-1.9_4-debian.patch">here</a> for
|
|
convenience. See http://packages.debian.org/squeeze/pstotext
|
|
and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
|
|
for references/explanations.
|
|
<blockquote>
|
|
To make things a bit easier, I also
|
|
store <a href="files/pstotext-1.9-patched.tar.gz">an
|
|
already patched version</a>. I added an
|
|
install target to the Makefile... This installs to
|
|
/usr/local, use <i>make install PREFIX=/usr</i> to
|
|
change. So all you need is:
|
|
<pre>
|
|
tar xvzf pstotext-1.9-patched.tar.gz
|
|
cd pstotext-1.9-patched
|
|
make
|
|
make install
|
|
</pre>
|
|
</blockquote>
|
|
</li>
|
|
|
|
|
|
<li><span class="literal">RTF</span> files with <a href=
|
|
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>. Please
|
|
note that up to version
|
|
0.21, <span class="command">unrtf</span> mostly does not work
|
|
with non western-european character sets. If you have a need
|
|
for indexing, ie, russian or chinese RTF files, I have
|
|
produced a modified version which works much better (as
|
|
indicated by my tests and a few external ones). You can
|
|
download the <a href="unrtf/unrtf-0.22.2beta.tar.gz">source
|
|
here</a>. The development is hosted
|
|
on <a href="http://www.bitbucket.org/medoc/unrtf-int">
|
|
bitbucket.org</a>.</li>
|
|
|
|
<li><span class="literal">TeX</span> with <span class=
|
|
"command">untex</span>. If there is no untex package for
|
|
your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
|
|
source package is stored on this site</a> (as untex has no
|
|
obvious home). Will also work with <a href=
|
|
"http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
|
if this is installed.</li>
|
|
|
|
<li><span class="literal">dvi</span> with <a href=
|
|
"http://www.radicaleye.com/dvips.html">dvips</a>.</li>
|
|
|
|
<li><span class="literal">djvu</span> with <a href=
|
|
"http://djvu.sourceforge.net">DjVuLibre</a>.</li>
|
|
|
|
<li>Audio file tags: Recoll releases 1.13 and older use <a
|
|
href="http://id3lib.sourceforge.net/">id3info (id3lib)</a>
|
|
(compiling id3lib on recent systems may need a small patch,
|
|
see <a href="id3lib.html">here.</a>) or the ogg and flac
|
|
tools.<br>
|
|
Recoll releases 1.14 and later use a Python filter based
|
|
on <a href="http://code.google.com/p/mutagen/">mutagen</a>
|
|
for all audio types.</li>
|
|
|
|
<li>Image file tags with <a href=
|
|
"http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
|
|
This is a perl program, so you also need perl on the
|
|
system. This works with about any possible image file and
|
|
tag format (jpg, png, tiff, gif etc.).</li>
|
|
|
|
<li>Midi karaoke files with Python, the
|
|
<a href="http://pypi.python.org/pypi/midi/0.2.1">
|
|
midi module</a>, and some help
|
|
from <a href="http://chardet.feedparser.org/">chardet</a>. There
|
|
is probably a <tt>chardet</tt> package for your distribution,
|
|
but you will quite probably need to build the midi
|
|
package. This is easy but see the
|
|
to <a href="helpernotes.html#midi">notes here</a>.
|
|
</li>
|
|
|
|
<li>Konqueror webarchive format with Python (uses the tarfile
|
|
module).</li>
|
|
|
|
<li>mimehtml web archive format (support based on the mail
|
|
filter, which introduces some mild weirdness, but still
|
|
usable).</li>
|
|
|
|
</ul>
|
|
|
|
<h2><a name="other">Other features</a></h2>
|
|
|
|
<ul>
|
|
<li>Can use <b>Beagle</b> browser plug-ins to index web
|
|
history. See the <a href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">the
|
|
Wiki</a> for more detail.</li>
|
|
|
|
<li>Processes all email attachments, and more generally any
|
|
realistic level of container imbrication (the "msword attachment to
|
|
a message inside a mailbox in a zip" thingy...) .</li>
|
|
|
|
<li>Multiple selectable databases.</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, filter on file types and directory tree.</li>
|
|
|
|
<li>Xesam-compatible query language.</li>
|
|
|
|
<li>Wildcard searches (with a specific and faster function
|
|
for file names).</li>
|
|
|
|
<li>Support for multiple charsets. Internal processing and
|
|
storage uses Unicode UTF-8.</li>
|
|
|
|
<li><a href="#Stemming">Stemming</a> performed at query
|
|
time (can switch stemming language after indexing).</li>
|
|
|
|
<li>Easy installation. No database daemon, web server or
|
|
exotic language necessary.</li>
|
|
|
|
<li>An indexer which runs either as a thread inside the
|
|
GUI, as an external, batch, cron'able program, or as a
|
|
real-time indexing daemon.</li>
|
|
</ul>
|
|
|
|
<h2><a name="integration">Desktop and web integration</a></h2>
|
|
|
|
<p>The <span class="application">Recoll</span> GUI has many
|
|
features that help to specify an efficient search and to manage
|
|
the results. However it maybe sometimes preferable to use a
|
|
simpler tool with a better integration with your desktop
|
|
interfaces. Several solutions exist, at the moment mostly for
|
|
the KDE desktop:</p>
|
|
<ul>
|
|
<li>The <span class="application">Recoll</span> KIO module
|
|
allows starting queries and viewing results from the
|
|
Konqueror browser or KDE applications <em>Open</em> dialogs.</li>
|
|
<li>The <a href="http://kde-apps.org">recollrunner</a> krunner
|
|
module allows integrating Recoll search results into a
|
|
krunner query.</li>
|
|
</ul>
|
|
<p>Recoll also has
|
|
<a href="usermanual/rcl.program.api.html#RCL.PROGRAM.API.PYTHON">
|
|
<span class="application">Python</span></a> and
|
|
<span class="application">PHP</span> modules which can allow
|
|
easy integration with web or other applications.</p>
|
|
|
|
<h2><a name="stemming"></a>Stemming</h2>
|
|
|
|
<p>Stemming is a process which transforms inflected words
|
|
into their most basic form. For example, <i>flooring</i>,
|
|
<i>floors</i>, <i>floored</i> would probably all be
|
|
transformed to <i>floor</i> by a stemmer for the English
|
|
language.</p>
|
|
|
|
<p>In many search engines, the stemming process occurs during
|
|
indexing. The index will only contain the stemmed form of
|
|
words, with exceptions for terms which are detected as being
|
|
probably proper nouns (ie: capitalized). At query time, the
|
|
terms entered by the user are stemmed, then matched against
|
|
the index.</p>
|
|
|
|
<p>This process results into a smaller index, but it has the
|
|
grave inconvenient of irrevocably losing information during
|
|
indexing.</p>
|
|
|
|
<p>Recoll works in a different way. No stemming is performed
|
|
at query time, so that all information gets into the index.
|
|
The resulting index is bigger, but most people probably don't
|
|
care much about this nowadays, because they have a 100Gb disk
|
|
95% full of binary data <em>which does not get
|
|
indexed</em>.</p>
|
|
|
|
<p>At the end of an indexing pass, Recoll builds one or
|
|
several stemming dictionaries, where all word stems are
|
|
listed in correspondence to the list of their
|
|
derivatives.</p>
|
|
|
|
<p>At query time, by default, user-entered terms are stemmed,
|
|
then matched against the stem database, and the query is
|
|
expanded to include all derivatives. This will yield search
|
|
results analogous to those obtained by a classical engine.
|
|
The benefits of this approach is that stem expansion can be
|
|
controlled instantly at query time in several ways:</p>
|
|
|
|
<ul>
|
|
<li>It can be selectively turned-off for any query term by
|
|
capitalizing it (<i>Floor</i>).</li>
|
|
|
|
<li>The stemming language (ie: english, french...) can be
|
|
selected (this supposes that several stemming databases
|
|
have been built, which can be configured as part of the
|
|
indexing, or done later, in a reasonably fast way).</li>
|
|
</ul>
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|