242 lines
9.6 KiB
HTML
242 lines
9.6 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||
|
||
<html>
|
||
<head>
|
||
<title>Recoll updated filters</title>
|
||
|
||
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
||
<meta name="Author" content="Jean-Francois Dockes">
|
||
<meta name="Description" content=
|
||
"recoll is a simple full-text search system for unix and linux
|
||
based on the powerful and mature xapian engine">
|
||
<meta name="Keywords" content=
|
||
"full text search, desktop search, unix, linux">
|
||
<meta http-equiv="Content-language" content="en">
|
||
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
|
||
<meta name="robots" content="All,Index,Follow">
|
||
|
||
<link type="text/css" rel="stylesheet" href="../styles/style.css">
|
||
</head>
|
||
|
||
<body>
|
||
|
||
<div class="rightlinks">
|
||
<ul>
|
||
<li><a href="../index.html">Home</a></li>
|
||
<li><a href="../download.html">Downloads</a></li>
|
||
<li><a href="../usermanual/index.html">User manual</a></li>
|
||
<li><a href="../usermanual/RCL.INSTALL.html">Installation</a></li>
|
||
<li><a href="../index.html#support">Support</a></li>
|
||
</ul>
|
||
</div>
|
||
|
||
<div class="content">
|
||
|
||
<h1>Updated filters for Recoll</h1>
|
||
|
||
<p>The following describe new and updated filters, which will be
|
||
part of the next release, but can be installed on an older
|
||
release if you need them.</p>
|
||
|
||
<p>For updated filters, you just need to copy the script to the
|
||
filters directory which may be typically either <span
|
||
class="filename">/usr/share/recoll/filters</span>, or <span
|
||
class="filename">/usr/local/share/recoll/filters</span>. Please check
|
||
that the script is executable after copying it, and make it so if
|
||
needed (chmod a+x <i>scriptname</i>)</p>
|
||
|
||
<p>For new filters, you'll need to copy the script file as
|
||
above, possibly install the supporting application, and usually
|
||
edit the
|
||
<span class="filename">mimemap</span>,
|
||
<span class="filename">mimeview</span> and
|
||
<span class="filename">mimeconf</span> files, either in the
|
||
shared directory
|
||
(<span class="filename">
|
||
/usr[/local]/share/recoll/examples</span>), or
|
||
in your personal configuration directory
|
||
(<span class="filename">$HOME/.recoll</span> or
|
||
<span class="filename">$RECOLL_CONFDIR</span>).</p>
|
||
|
||
<p>Alternatively, you can replace your system files with
|
||
these updated and complete versions:
|
||
<a href="mimemap">mimemap</a>
|
||
<a href="mimeconf">mimeconf</a>
|
||
<a href="mimeview">mimeview</a>.</p>
|
||
|
||
<p>There is a slightly more detailed description of the filter
|
||
installation procedure on the
|
||
<a href="https://bitbucket.org/medoc/recoll/wiki/FilterRetrofit.wiki">
|
||
Recoll Wiki</a>.</p>
|
||
|
||
<p>The following entries are in reverse chronologic order. Each
|
||
lists the latest Recoll release on which the update makes sense
|
||
(newer releases have an up to date version of the filter).</p>
|
||
|
||
<p>However, if you are running a Recoll version older than 1.17,
|
||
you should really upgrade.</p>
|
||
|
||
<h2>PDF documents</h2>
|
||
<p>Fixded <a href="rclpdf">rclpdf</a> filter, compatible with
|
||
newer poppler pdftotext versions, which now properly escape
|
||
text inside the html <head> section (but not the body,
|
||
curiously).</p>
|
||
|
||
<h2>Scribus documents</h2>
|
||
<p>An improved <a href="rclscribus">rclscribus</a> filter,
|
||
thanks to Morten Langlo.</p>
|
||
|
||
<h2>7zip archives</h2>
|
||
<p>A new <a href="rcl7z">rcl7z</a> filter by Fran<61>ois Botha
|
||
for 7zip archives. Needs the
|
||
<a href="https://pypi.python.org/pypi/pylzma">pylzma Python
|
||
module</a>. </p>
|
||
|
||
<h2>Attachments to PDF documents (1.20 and older)</h2>
|
||
|
||
<p>A new <a href="rclmpdf">rclmpdf</a> filter for processing
|
||
PDF files with attachments. This replaces the old <b>rclpdf</b>
|
||
filter. You need to add it to ~/.recoll/mimeconf until it is
|
||
made standard (this is still a bit experimental, and a big
|
||
change from the previous filter):
|
||
<pre><tt>
|
||
[index]
|
||
application/pdf = execm rclmpdf
|
||
</tt></pre>
|
||
Note the <tt>execm</tt> instead of <tt>exec</tt>. </p>
|
||
|
||
<h2><a name="soff1">Open/Libre-Office documents (1.19 and older)</a></h2>
|
||
|
||
<p><a href="rclsoff">rclsoff</a>: the previous version did not
|
||
produce white space between input tab-separated words, leading
|
||
to search failures.</p>
|
||
|
||
|
||
<h2>Purple logs (1.20 and older)</h2>
|
||
|
||
<p>New <a href="rclpurple">rclpurple</a> filter for Pidging and
|
||
other chat applications log files. Handles newer log
|
||
formats. </p>
|
||
|
||
<h2>PowerPoint documents (1.19 and older)</h2>
|
||
|
||
<p>The <b>rclppt</b> filter was based on <b>catppt</b>, but this
|
||
seems to fail quite often on newer PPT
|
||
documents. The new version is based on code from
|
||
the <b>libreoffice</b> <b>mso-dump</b> project. It is both
|
||
reasonably fast and quite thorough.
|
||
</p>
|
||
|
||
<p>Installation:<ul>
|
||
<li>As <tt>recollindex</tt> was executing <b>catppt</b>
|
||
directly in the default configuration, you will also need to add
|
||
the following to
|
||
the <tt>mimeconf</tt> file (e.g.: ~/.recoll/mimeconf):
|
||
<pre>
|
||
[index]
|
||
application/vnd.ms-powerpoint = exec rclppt
|
||
</pre>
|
||
</li>
|
||
<li>Copy the 3 following files to the Recoll filters directory (e.g:
|
||
<i>/usr/share/recoll/filters</i>) and make sure
|
||
that <tt>ppt-dump.py</tt> and <tt>rclppt</tt> are executable.
|
||
<ul>
|
||
<li><a href="rclppt">rclppt</a></li>
|
||
<li><a href="ppt-dump.py">ppt-dump.py</a></li>
|
||
<li><a href="msodump.zip">msodump.zip</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
</p>
|
||
|
||
<h2>EPUB documents (1.17 and older)</h2>
|
||
|
||
<p>New <a href="rclepub">rclepub</a> filter for EPUB documents.
|
||
This needs
|
||
the <a href="http://pypi.python.org/pypi/epub/0.5.0">
|
||
python epub decoding module</a>. </p>
|
||
|
||
<h2>CHM files (1.17.1 and older)</h2>
|
||
<p><a href="rclchm">rclchm</a>. The previous version of the
|
||
filter mishandled files which had encoded internal URLs (not
|
||
very frequent, but happens).</p>
|
||
|
||
<h2>Updated Open Document filter (1.17 and older)</h2>
|
||
|
||
<p>The <a href="rclsoff">new filter</a> will correctly handle
|
||
exported Google Docs documents and also Open/LibreOffice ones in
|
||
some cases. The previous filters concatenated all the text
|
||
inside the exported Google docs without any spacing...</p>
|
||
|
||
<h2>TAR archives (1.17 and older)</h2>
|
||
|
||
<p>New <a href="rcltar">rcltar</a> filter for tar archives. The
|
||
indexing of tar archives is disabled by default in the sample
|
||
configuration (stored here). This is an <tt>execm</tt>
|
||
filter !. You'll need to add an <br>
|
||
<tt>application/x-tar = execm rcltar</tt><br>
|
||
line in the [index] section of your
|
||
$HOME/mimeconf to enable it, not an <tt>exec</tt> one.</p>
|
||
|
||
<h2>XML files (1.17 and older)</h2>
|
||
|
||
<p>By default, the current recoll version does not index xml
|
||
content (except for known formats like dia, svg etc.). This
|
||
new <a href="rclxml">rclxml</a> filter will extract the data
|
||
from any xml file. Only text data is extracted, no attribute
|
||
values. The other option is to treat xml file as plain text
|
||
one (see comment in mimeconf), and index everything, including
|
||
a lot of garbage.</p>
|
||
|
||
<h2>DIA files (1.16 and older)</h2>
|
||
<p><a href="rcldia">rcldia</a> is a new filter
|
||
for <a href="http://projects.gnome.org/dia/">Dia</a> files,
|
||
contributed by Stefan Friedel.</p>
|
||
|
||
|
||
<h2>Okular annotations (1.16 and older)</h2>
|
||
<p><a href="rclokulnote">rclokulnote</a>. Okular lets you create
|
||
annotations for PDF documents and stores them in xml format
|
||
somewhere under ~/.kde. This filter does not do a nice job to
|
||
format the data, but will at least let you find it...</p>
|
||
|
||
<h2>Gnumeric (1.16 and older)</h2>
|
||
<p><a href="rclgnm">rclgnm</a>. Needs xsltproc and
|
||
gunzip. As <tt>.gnumeric</tt> was in the list of
|
||
explicitely ignored suffixes, you can't just add the mime
|
||
and indexer script lines to your local mimemap and mimeconf, you
|
||
also need to define recoll_noindex in the local mimemap (to
|
||
override the system one which
|
||
contains <tt>.gnumeric</tt>). The simplest approach may be to
|
||
just replace the system files with those above.</p>
|
||
|
||
<h2>Rar archive support (1.15 and older)</h2>
|
||
<p><a href="rclrar">rclrar</a>. This is up to date in Recoll
|
||
1.16.2 but may be added to Recoll 1.15. It needs the Python
|
||
rarfile module. </p>
|
||
|
||
<h2>Mimehtml support (1.15)</h2>
|
||
<p>This is based on the internal mail filter, you just need to
|
||
download and install the configuration files (mimemap and
|
||
mimeconf. Will only work with 1.15 and later.</p>
|
||
|
||
<h2>Konqueror webarchive (.war) filter (1.15)</h2>
|
||
<p><a href="rclwar">rclwar</a></p>
|
||
|
||
<h2>Updated zip archive filter (1.15)</h2>
|
||
<p>The filter is corrected to handle utf-8 paths in zip archives:
|
||
<a href="rclzip">rclzip</a>. Up to date in Recoll 1.16, but
|
||
may be useful with Recoll 1.15</p>
|
||
|
||
<h2>Updated audio tag filter (1.14)</h2>
|
||
<p>The mutagen-based rclaudio filter delivered with recoll 1.14.2
|
||
used a very recent mutagen interface which will only work with
|
||
mutagen versions after 1.17 (probably. at least works with 1.19,
|
||
doesn't with 1.15).
|
||
You can download the <a href="rclaudio">corrected script
|
||
here. Not useful with Recoll 1.5 or 1.6</a>.
|
||
</p>
|
||
|
||
</div>
|
||
</body>
|
||
</html>
|