recoll/website/features.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
  <head>
    <title>RECOLL: a personal text search system for
    Unix/Linux</title>
    <meta name="generator" content="HTML Tidy, see www.w3.org">
    <meta name="Author" content="Jean-Francois Dockes">
    <meta name="Description" content=
    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
    <meta name="Keywords" content=
    "full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
    <meta http-equiv="Content-language" content="en">
    <meta http-equiv="content-type" content=
    "text/html; charset=iso-8859-1">
    <meta name="robots" content="All,Index,Follow">
    <link type="text/css" rel="stylesheet" href="styles/style.css">
  </head>

  <body>
    <div class="rightlinks">
      <ul>
        <li><a href="index.html">Home</a></li>

        <li><a href="pics/index.html">Screenshots</a></li>

        <li><a href="download.html">Downloads</a></li>

        <li><a href="doc.html">Documentation</a></li>

        <li><a href="support.html">Support</a></li>

        <li><a href="devel.html">Development</a></li>
      </ul>
    </div>

    <div class="content">
      <h1>Recoll features</h1>

      <div class="intrapage">
	<table width=100%>
	  <tbody>
	    <tr>
	      <td><a href="#systems">Supported systems</a></td>
              <td><a href="#doctypes">Document types</a></td>
	      <td><a href="#other">Other features</a></td>
	      <td><a href="#integration">Desktop and web integration</a></td>
	      <td><a href="#stemming">Stemming</a></td>
	    </tr>
	  </tbody>
	</table>
       </div>

      <h2><a name="general">General features</a></h2>
      <ul>
        <li>Easy installation, few dependancies. No database daemon,
	  web server, desktop environment or exotic language necessary.</li>
	<li>Will run on most Unix-based <a href="features.html#systems">
            systems</a>, and on MS-Windows too.</li>
        <li>Qt 4 GUI, plus command line, Unity Lens, KIO and krunner
          interfaces.</li>

        <li>Searches most common
	  <a href="features.html#doctypes">document types</a>, emails and
	    their attachments. Transparently handles decompression
	    (gzip, bzip2).</li>

        <li>Powerful query facilities, with boolean searches,
	  phrases, proximity, wildcards, filter on file types and directory
	  tree.</li>

        <li>Multi-language and multi-character set with Unicode based
	  internals.</li>

	<li>Extensive documentation, with a
	  complete <a href="usermanual/usermanual.html">user
	    manual</a> and manual pages for each command.</li>

      </ul>

      <h2><a name="systems">Supported systems</a></h2>

      <p><span class="application">Recoll</span> has been compiled and
      tested on Linux, MS-Windows 7-10, MacOS X and Solaris (initial
      versions Redhat 7, Fedora Core 5, Suse 10, Gentoo, Debian 3.1,
      Solaris 8). It should compile and run on all subsequent releases
      of these systems and probably a few others too.</p>

      <p>Qt versions from 4.7 and later</p>

      <h2><a name="doctypes">Document types</a></h2>

      <p><span class="application">Recoll</span> can index many document
        types (along with their compressed versions). Some types are
        handled internally (no external application needed). Other types
        need a separate application to be installed to extract the
        text. Types that only need very common utilities
        (awk/sed/groff/Python etc.)  are listed in the native section.</p>

      <p>The MS-Windows installer includes the supporting application,
        the only additional package you will need is the Python language
        installation.</p>

      <p>Many formats are processed
        by <span class="application">Python</span> scripts. The Python
        dependency will not always be mentionned. In general, Recoll
        expects Python 2.x to be available (many, but not all, scripts
        are compatible with Python 3). Formats which are processed
        using <span class="application">Python</span> and its standard
        library are listed in the <i>native</i> section.</p>

      <h4>File types indexed natively</h4>

      <ul>
        <li><span class="application">text</span>.</li>
        <li><span class="application">html</span>.</li>
        <li><span class="application">maildir</span>,
          <span class="application">mh</span>, and
          <span class="application">mailbox</span> (
          <span class="application">Mozilla</span>,
          <span class="application">Thunderbird</span> and
          <span class="application">Evolution</span> mail ok).
          <em><b>Evolution note</b>: be sure to remove <tt>.cache</tt> from
            the <tt>skippedNames</tt> list in the GUI <tt>Indexing
              preferences/Local Parameters/</tt> pane if you want to
              index local copies of Imap mail.</em>
        </li>

        <li><span class="application">gaim</span> and
          <span class="application">purple</span> log files.</li>

        <li><span class="application">Scribus</span> files.</li>

        <li><span class="application">Man pages</span> (needs
          <span class="application">groff</span>).</li>

        <li><span class="application">Dia</span> diagrams.</li>
        <li><span class="application">Excel</span>
          and <span class="application">Powerpoint</span>
          for <span class="application">Recoll</span> versions 1.19.12
          and later.</li>

        <li><span class="application">Tar</span> archives. Tar file
        indexing is disabled by default (because tar archives don't
        typically contain the kind of documents that people search
        for), you will need to enable it explicitely, like with the
        following in your
          <span class="filename">$HOME/.recoll/mimeconf</span> file:
          <pre>
[index]
application/x-tar = execm rcltar
</pre>
        </li>

        <li><span class="application">Zip</span> archives.</li>
        <li><span class="application">Konqueror webarchive</span>
          format with Python (uses the <tt>tarfile</tt> standard
          library module).</li>

        <li><span class="application">Mimehtml web archive
            format</span> (support based on the mail
          filter, which introduces some mild weirdness, but still
          usable).</li>
      </ul>


      <h4>File types indexed with external helpers</h4>

      <p>Many document types need the <span class="command">iconv</span>
      command in addition to the applications specifically listed.</p>

      <h5>The XML ones</h5>

      <p>The following types need <span class="command">
          xsltproc</span> from the <b>libxslt</b> package for recoll
        versions before 1.22, and in addition, python-libxslt1 and
        python-libxml2 for 1.22 and newer.
        Quite a few also need <span class="command">unzip</span>:</p>

      <ul>
        <li><span class="application">Abiword</span> files.</li>

        <li><span class="application">Fb2</span> ebooks.</li>

        <li><span class="application">Kword</span> files.</li>

        <li><span class="application">Microsoft Office Open XML</span>
        files.</li>

        <li><span class="application">OpenOffice</span> files.</li>

        <li><span class="application">SVG</span> files.</li>
        <li><span class="application">Gnumeric</span> files.</li>
        <li><span class="application">Okular</span> annotations files.</li>

      </ul>

      <h5>Other formats</h5>

      <p>The following need miscellaneous helper programs to decode
        the internal formats.</p>

      <ul>
        <li><span class="application">pdf</span> with the <span class=
        "command">pdftotext</span> command, which comes with
          <a href="http://poppler.freedesktop.org/">poppler</a>,
          (the package name is quite often <tt>poppler-utils</tt>). <br/>
          Note: the older <span class="command">pdftotext</span> command
            which comes with <span class="application">xpdf</span> is
            not compatible with <span class="application">
              Recoll</span><br/>

          <em>New in 1.21</em>: if the <span class="application">
            tesseract</span> OCR application, and the
          <span class="command">pdftoppm</span> command are available
          on the system, the <span class="command">rclpdf</span>
          filter has the capability to run OCR. See the comments at
          the top of <span class="command">rclpdf</span> (usually
          found
          in <span class="filename">/usr/share/recoll/filters</span>)
          for how to enable this and configuration details.<br/>
          <em>Opening PDFs at the right page</em>: the default
          configuration uses <span class="command">evince</span>,
          which has options for direct page access and pre-setting the
          search strings (hits will be highlighted). There is an
          example line in the default mimeview for doing the same
          thing with <span class="command">qpdfview</span>
          (<span class="literal">qpdfview --search %s %f#%p</span>).
          Okular does not have a search string option (but it does
          have a page number one).
        </li>

        <li><span class="application">msword</span> with <a href=
        "http://www.winfield.demon.nl/">antiword</a>.  It is also useful to
        have <a href="http://wvware.sourceforge.net/">wvWare</a> installed
        as it may be be used as a fallback for some files which antiword
        does not handle.</li>

        <li><span class="application">Wordperfect</span> with the
         <span class="command">wpd2html</span> command from <a href=
        "http://libwpd.sourceforge.net">libwpd</a>. On some distributions,
        the command may come with a package named <span
        class="literal">libwpd-tools</span> or such, not the base <a
        span="literal">libwpd</a> package.</li>

        <li><span class="application">Lyx</span> files (needs
          <span class="application">Lyx</span> to be installed).</li>

        <li><span class="application">Powerpoint</span> and <span
        class="application">Excel</span> with the <a href=
        "http://vitus.wagner.pp.ru/software/catdoc/">catdoc</a>
        utilities up to recoll 1.19.12. Recoll 1.19.12 and later use
        internal Python filters for Excel and Powerpoint, and catdoc
        is not needed at all (catdoc did not work on many semi-recent
        Excel and Powerpoint files).</li>

        <li><span class="application">CHM (Microsoft help)</span> files
          with <span class="command">Python,
            <a href="http://gnochm.sourceforge.net/pychm.html">pychm</a>
          and <a href="http://www.jedrea.com/chmlib/">chmlib</a></span>.</li>

        <li><span class="application">GNU info</span> files
        with <span class="command">Python</span> and the
        <span class="command">info</span> command.</li>

        <li><span class="application">EPUB</span> files
          with <span class="command">Python</span> and this
          <a href="http://pypi.python.org/pypi/epub/">Python epub</a>
            decoding module, which is packaged on Fedora, but not Debian.</li>

        <li><span class="application">Rar</span> archives (needs <span
        class="command">Python</span>), the
        <a href="http://pypi.python.org/pypi/rarfile/">rarfile</a> Python
        module and the <a
        href="http://www.rarlab.com/rar_add.htm">unrar</a>
        utility. The Python module is packaged by Fedora, not by Debian.</li>

        <li><span class="application">7zip</span> archives (needs
          <span class="command">Python</span> and
          the <a href="https://pypi.python.org/pypi/pylzma">pylzma
            module</a>). This is a recent addition, and you need to
            download the filter from
          the <a href="filters/filters.html">filters pages</a> for
          all Recoll versions prior to 1.21.</li>

        <li><span class="application">iCalendar</span>(.ics) files
        (needs <span class="command">Python, <a href=
        "http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>

        <li><span class="application">Mozilla calendar data</span> See
        <a href=
        "http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
        the wiki</a> about this.</li>

        <li><span class="application">postscript</span> with <a href=
        "http://www.gnu.org/software/ghostscript/ghostscript.html">
            ghostscript</a> and <a href=
        "http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
        Pstotext 1.9 has a serious issue with special characters in
        file names, and you should either use the version packaged for
        your system which is probably patched, or apply the Debian
        patch which is stored <a href=
        "files/pstotext-1.9_4-debian.patch">here</a> for
        convenience. See http://packages.debian.org/squeeze/pstotext
        and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
        for references/explanations.
          <blockquote>
            To make things a bit easier, I also
            store <a href="files/pstotext-1.9-patched.tar.gz">an
            already patched version</a>. I added an
            install target to the Makefile... This installs to
            /usr/local, use <i>make install PREFIX=/usr</i> to
            change. So all you need is:
            <pre>
              tar xvzf pstotext-1.9-patched.tar.gz
              cd pstotext-1.9-patched
              make
              make install
            </pre>
          </blockquote>
        </li>


        <li><span class="application">RTF</span> files with
          <a href="http://www.gnu.org/software/unrtf/unrtf.html">
            unrtf</a>. Please note that up to version 0.21.3,
          <span class="command">unrtf</span> mostly does not work with
          non western-european character sets. Many serious problems
          (crashes with serious security implications and infinite
          loops) were fixed in unrtf 0.21.8, so you really want to use
          this or a newer release. Building Unrtf from source is quick
          and easy.</li>

        <li><span class="application">TeX</span> with <span class=
        "command">untex</span>. If there is no untex package for
        your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
        source package is stored on this site</a> (as untex has no
        obvious home). Will also work with <a href=
        "http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
        if this is installed.</li>

        <li><span class="application">dvi</span> with <a href=
        "http://www.radicaleye.com/dvips.html">dvips</a>.</li>

        <li><span class="application">djvu</span> with <a href=
        "http://djvu.sourceforge.net">DjVuLibre</a>.</li>

        <li><span class="application">Audio file tags</span>.
          Recoll releases 1.14 and later use a Python filter based
          on <a href="http://code.google.com/p/mutagen/">mutagen</a>
          for all audio types.</li>

        <li><span class="application">Image file tags</span> with <a href=
        "http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
        This is a perl program, so you also need perl on the
        system. This works with about any possible image file and
        tag format (jpg, png, tiff, gif etc.).</li>

        <li><span class="application">Midi karaoke files</span> with
          Python, the
          <a href="http://pypi.python.org/pypi/midi/0.2.1">
            midi module</a>, and some help
          from <a href="http://chardet.feedparser.org/">chardet</a>. There
          is probably a <tt>python-chardet</tt> package for your distribution,
          but you will quite probably need to build the midi
          package. This is easy but see the <a href="helpernotes.html#midi">
            notes here</a>.
        </li>

        <li><span class="application">MediaWiki dump files</span>:
          Thomas Levine has written a handler for these, you will find
          it here:
          <a href="https://bitbucket.org/tlevine/recoll/src/0127be78bffdd8a294067966a3ba7b2663d7b0cf/src/filters/rclmwdump?at=default&fileviewer=file-view-default">rclmwdump</a>.</li>

      </ul>

      <h2><a name="other">Other features</a></h2>

      <ul>
        <li>Can use a Firefox extension to index visited Web pages
        history. See <a href=
        "http://bitbucket.org/medoc/recoll/wiki/IndexWebHistory">the
        Wiki</a> for more detail.</li>

        <li>Processes all email attachments, and more generally any
         realistic level of container imbrication (the "msword attachment to
         a message inside a mailbox in a zip" thingy...) .</li>

        <li>Multiple selectable databases.</li>

        <li>Powerful query facilities, with boolean searches,
        phrases, filter on file types and directory tree.</li>

        <li>Xesam-compatible query language.</li>

        <li>Wildcard searches (with a specific and faster function
        for file names).</li>

        <li>Support for multiple charsets. Internal processing and
        storage uses Unicode UTF-8.</li>

        <li><a href="#Stemming">Stemming</a> performed at query
        time (can switch stemming language after indexing).</li>

        <li>Easy installation. No database daemon, web server or
        exotic language necessary.</li>

        <li>An indexer which runs either as a batch, cron'able
          program, or as a real-time indexing daemon, depending on
          preference.</li>
      </ul>

      <h2><a name="integration">Desktop and web integration</a></h2>

      <p>The <span class="application">Recoll</span> GUI has many
	features that help to specify an efficient search and to manage
	the results. However it maybe sometimes preferable to use a
	simpler tool with a better integration with your desktop
	interfaces. Several solutions exist:</p>
      <ul>
	<li>The <span class="application">Recoll</span> KIO module
	  allows starting queries and viewing results from the
	  Konqueror browser or KDE applications <em>Open</em> dialogs.</li>
	<li>The <a href="http://kde-apps.org">recollrunner</a> krunner
	  module allows integrating Recoll search results into a
	  krunner query.</li>
        <li>The Ubuntu Unity Recoll Lens lets you access Recoll search
          from the Unity Dash. More
          info <a href="https://bitbucket.org/medoc/recoll/wiki/UnityLens">
            here</a>. </li>
        <li>The <a href="http://github.com/medoc92/recoll-webui">Recoll
            Web UI</a> lets you query a Recoll index from a web browser</li>
      </ul>
      <p>Recoll also has
	<a href="usermanual/usermanual.html#RCL.PROGRAM.PYTHONAPI">
	  <span class="application">Python</span></a> and
	<span class="application">PHP</span> modules which can allow
	easy integration with web or other applications.</p>

      <h2><a name="stemming"></a>Stemming</h2>

      <p>Stemming is a process which transforms inflected words
      into their most basic form. For example, <i>flooring</i>,
      <i>floors</i>, <i>floored</i> would probably all be
      transformed to <i>floor</i> by a stemmer for the English
      language.</p>

      <p>In many search engines, the stemming process occurs during
      indexing. The index will only contain the stemmed form of
      words, with exceptions for terms which are detected as being
      probably proper nouns (ie: capitalized). At query time, the
      terms entered by the user are stemmed, then matched against
      the index.</p>

      <p>This process results into a smaller index, but it has the
      grave inconvenient of irrevocably losing information during
      indexing.</p>

      <p>Recoll works in a different way. No stemming is performed
      at query time, so that all information gets into the index.
      The resulting index is bigger, but most people probably don't
      care much about this nowadays, because they have a 100Gb disk
      95% full of binary data <em>which does not get
      indexed</em>.</p>

      <p>At the end of an indexing pass, Recoll builds one or
      several stemming dictionaries, where all word stems are
      listed in correspondence to the list of their
      derivatives.</p>

      <p>At query time, by default, user-entered terms are stemmed,
      then matched against the stem database, and the query is
      expanded to include all derivatives. This will yield search
      results analogous to those obtained by a classical engine.
      The benefits of this approach is that stem expansion can be
      controlled instantly at query time in several ways:</p>

      <ul>
        <li>It can be selectively turned-off for any query term by
        capitalizing it (<i>Floor</i>).</li>

        <li>The stemming language (ie: english, french...) can be
        selected (this supposes that several stemming databases
        have been built, which can be configured as part of the
        indexing, or done later, in a reasonably fast way).</li>
      </ul>
    </div>
  </body>
</html>