release 3636

This commit is contained in:
Jean-Francois Dockes 2014-05-24 14:26:53 +02:00
parent 16b63b4e14
commit b7511f6f17
2 changed files with 209 additions and 185 deletions

View file

@ -81,7 +81,7 @@ Chapter 5. Installation and configuration
text file inside the configuration directory.
A list of common file types which need external commands follows. Many of
the filters need the iconv command, which is not always listed as a
the handlers need the iconv command, which is not always listed as a
dependancy.
Please note that, due to the relatively dynamic nature of this
@ -96,7 +96,7 @@ Chapter 5. Installation and configuration
http://www.recoll.org/features.html if a file type is important to you.
As of Recoll release 1.14, a number of XML-based formats that were handled
by ad hoc filter code now use the xsltproc command, which usually comes
by ad hoc handler code now use the xsltproc command, which usually comes
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
Now for the list:
@ -114,7 +114,7 @@ Chapter 5. Installation and configuration
it may be be used as a fallback for some files which antiword does not
handle.
o MS Excel and PowerPoint need catdoc.
o MS Excel and PowerPoint are processed by internal Python handlers.
o MS Open XML (docx) needs xsltproc.
@ -133,11 +133,8 @@ Chapter 5. Installation and configuration
o djvu files need djvutxt and djvused from the DjVuLibre package.
o Audio files: Recoll releases before 1.13 used the id3info command from
the id3lib package to extract mp3 tag information, metaflac (standard
flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
Releases 1.14 and later use a single Python filter based on mutagen
for all audio file types.
o Audio files: Recoll releases 1.14 and later use a single Python
handler based on mutagen for all audio file types.
o Pictures: Recoll uses the Exiftool Perl package to extract tag
information. Most image file formats are supported. Note that there
@ -145,7 +142,7 @@ Chapter 5. Installation and configuration
aperture, etc.). This is only of interest if you store personal tags
or textual descriptions inside the image files.
o chm: files in microsoft help format need Python and the pychm module
o chm: files in Microsoft help format need Python and the pychm module
(which needs chmlib).
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
@ -161,11 +158,11 @@ Chapter 5. Installation and configuration
o Konqueror webarchive format with Python (uses the Tarfile module).
o mimehtml web archive format (support based on the email filter, which
o Mimehtml web archive format (support based on the email handler, which
introduces some mild weirdness, but still usable).
Text, HTML, email folders, and Scribus files are processed internally. Lyx
is used to index Lyx files. Many filters need iconv and the standard sed
is used to index Lyx files. Many handlers need iconv and the standard sed
and awk.
----------------------------------------------------------------------
@ -515,10 +512,10 @@ Chapter 5. Installation and configuration
A space-separated list of patterns for names of files or
directories that should be ignored inside zip archives. This is
used directly by the zip filter, and has a function similar to
used directly by the zip handler, and has a function similar to
skippedNames, but works independantly. Can be redefined for
filesystem subdirectories. For versions up to 1.19, you will need
to update the Zip filter and install a supplementary Python
to update the Zip handler and install a supplementary Python
module. The details are described on the Recoll wiki.
followLinks
@ -533,11 +530,16 @@ Chapter 5. Installation and configuration
indexedmimetypes
Recoll normally indexes any file which it knows how to read. This
list lets you restrict the indexed mime types to what you specify.
list lets you restrict the indexed MIME types to what you specify.
If the variable is unspecified or the list empty (the default),
all supported types are processed. Can be redefined for
subdirectories.
excludedmimetypes
This list lets you exclude some MIME types from indexing. Can be
redefined for subdirectories.
compressedfilemaxkbs
Size limit for compressed (.gz or .bz2) files. These need to be
@ -570,14 +572,14 @@ Chapter 5. Installation and configuration
Recoll indexes file names in a special section of the database to
allow specific file names searches using wild cards. This
parameter decides if file name indexing is performed only for
files with mime types that would qualify them for full text
files with MIME types that would qualify them for full text
indexing, or for all files inside the selected subtrees,
independently of mime type.
independently of MIME type.
usesystemfilecommand
Decide if we use the file -i system command as a final step for
determining the mime type for a file (the main procedure uses
determining the MIME type for a file (the main procedure uses
suffix associations as defined in the mimemap file). This can be
useful for files with suffix-less names, but it will also cause
the indexing of many bogus "text" files.
@ -790,6 +792,9 @@ Chapter 5. Installation and configuration
This is only used by the web browser plugin indexing code, and
defines the maximum size for the web page cache. Default: 40 MB.
Quite unfortunately, this is only taken into account when creating
the cache file. You need to delete the file for a change to be
taken into account.
idxflushmb
@ -929,15 +934,15 @@ Chapter 5. Installation and configuration
filtermaxseconds
Maximum filter execution time, after which it is aborted. Some
Maximum handler execution time, after which it is aborted. Some
postscript programs just loop...
filtersdir
A directory to search for the external filter scripts used to
index some types of files. The value should not be changed, except
if you want to modify one of the default scripts. The value can be
redefined for any sub-directory.
A directory to search for the external input handler scripts used
to index some types of files. The value should not be changed,
except if you want to modify one of the default scripts. The value
can be redefined for any sub-directory.
iconsdir
@ -1018,17 +1023,17 @@ Chapter 5. Installation and configuration
This section defines lists of synonyms for the canonical names
used inside the [prefixes] and [stored] sections
filter-specific sections
handler-specific sections
Some filters may need specific configuration for handling fields.
Only the email message filter currently has such a section (named
[mail]). It allows indexing arbitrary email headers in addition to
the ones indexed by default. Other such sections may appear in the
future.
Some input handlers may need specific configuration for handling
fields. Only the email message handler currently has such a
section (named [mail]). It allows indexing arbitrary email headers
in addition to the ones indexed by default. Other such sections
may appear in the future.
Here follows a small example of a personal fields file. This would extract
a specific email header and use it as a searchable field, with data
displayable inside result lists. (Side note: as the email filter does no
displayable inside result lists. (Side note: as the email handler does no
decoding on the values, only plain ascii headers can be indexed, and only
the first occurrence will be used for headers that occur several times).
@ -1060,10 +1065,10 @@ Chapter 5. Installation and configuration
5.4.3. The mimemap file
mimemap specifies the file name extension to mime type mappings.
mimemap specifies the file name extension to MIME type mappings.
For file names without an extension, or with an unknown one, the system's
file -i command will be executed to determine the mime type (this can be
file -i command will be executed to determine the MIME type (this can be
switched off inside the main configuration file).
The mappings can be specified on a per-subtree basis, which may be useful
@ -1084,7 +1089,7 @@ Chapter 5. Installation and configuration
5.4.4. The mimeconf file
mimeconf specifies how the different mime types are handled for indexing,
mimeconf specifies how the different MIME types are handled for indexing,
and which icons are displayed in the recoll result lists.
Changing the parameters in the [index] section is probably not a good idea
@ -1108,7 +1113,7 @@ Chapter 5. Installation and configuration
Recoll GUI preferences, all mimeview entries will be ignored except the
one labelled application/x-all (which is set to use xdg-open by default).
In this case, the xallexcepts top level variable defines a list of mime
In this case, the xallexcepts top level variable defines a list of MIME
type exceptions which will be processed according to the local entries
instead of being passed to the desktop. This is so that specific Recoll
options such as a page number or a search string can be passed to
@ -1121,13 +1126,13 @@ Chapter 5. Installation and configuration
All viewer definition entries must be placed under a [view] section.
The keys in the file are normally mime types. You can add an application
The keys in the file are normally MIME types. You can add an application
tag to specialize the choice for an area of the filesystem (using a
localfields specification in mimeconf). The syntax for the key is
mimetype|tag
The nouncompforviewmts entry, (placed at the top level, outside of the
[view] section), holds a list of mime types that should not be
[view] section), holds a list of MIME types that should not be
uncompressed before starting the viewer (if they are found compressed, ie:
mydoc.doc.gz).
@ -1147,7 +1152,7 @@ Chapter 5. Installation and configuration
will not create a temporary file to extract the subdocument, expecting
the called application (possibly a script) to be able to handle it.
o %M. Mime type
o %M. MIME type
o %p. Page index. Only significant for a subset of document types,
currently only PDF, Postscript and DVI files. Can be used to start the
@ -1200,7 +1205,7 @@ Chapter 5. Installation and configuration
.blob = application/x-blobapp
Note that the mime type is made up here, and you could call it
Note that the MIME type is made up here, and you could call it
diesel/oil just the same.
o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
@ -1211,7 +1216,7 @@ Chapter 5. Installation and configuration
would use %u if it liked URLs better.
If you just wanted to change the application used by Recoll to display a
mime type which it already knows, you would just need to edit mimeview.
MIME type which it already knows, you would just need to edit mimeview.
The entries you add in your personal file override those in the central
configuration, which you do not need to alter. mimeview can also be
modified from the Gui.
@ -1233,17 +1238,17 @@ Chapter 5. Installation and configuration
for the files inside the result lists. Icons are normally 64x64 pixels
PNG files which live in /usr/[local/]share/recoll/images.
o Under the [categories] section, you should add the mime type where it
o Under the [categories] section, you should add the MIME type where it
makes sense (you can also create a category). Categories may be used
for filtering in advanced search.
The rclblob filter should be an executable program or script which exists
The rclblob handler should be an executable program or script which exists
inside /usr/[local/]share/recoll/filters. It will be given a file name as
argument and should output the text or html contents on the standard
output.
The filter programming section describes in more detail how to write a
filter.
The filter programming section describes in more detail how to write an
input handler.
----------------------------------------------------------------------

View file

@ -134,15 +134,15 @@ More documentation can be found in the doc/ directory or at http://www.recoll.or
4. Programming interface
4.1. Writing a document filter
4.1. Writing a document input handler
4.1.1. Simple filters
4.1.1. Simple input handlers
4.1.2. "Multiple" filters
4.1.2. "Multiple" handlers
4.1.3. Telling Recoll about the filter
4.1.3. Telling Recoll about the handler
4.1.4. Filter HTML output
4.1.4. Input handler HTML output
4.1.5. Page numbers
@ -259,7 +259,7 @@ Chapter 1. Introduction
Recoll stores all internal data in Unicode UTF-8 format, and it can index
files with different character sets, encodings, and languages into the
same index. It has input filters for many document types.
same index. It has can process many document types.
Stemming is the process by which Recoll reduces words to their radicals so
that searching does not depend, for example, on a word being singular or
@ -418,13 +418,13 @@ Chapter 2. Indexing
Excluding types can be done by adding wildcard name patterns to the
skippedNames list, which can be done from the GUI Index configuration
menu. It is also possible to exclude a mime type independantly of the file
name by associating it with the rclnull filter. This can be done by
editing the mimeconf configuration file.
menu. For versions 1.20 and later, you can alternatively set the
excludedmimetypes list in the configuration file. This can be redefined
for subdirectories.
In order to define a positive list, You need to edit the main
configuration file (recoll.conf) and set the indexedmimetypes
configuration variable. Example:
You can also define an exclusive list of MIME types to be indexed (no
others will be indexed), by settting the indexedmimetypes configuration
variable. Example:
indexedmimetypes = text/html application/pdf
@ -436,10 +436,11 @@ Chapter 2. Indexing
(When using sections like this, don't forget that they remain in effect
until the end of the file or another section indicator). There is no GUI
way to edit the parameter, because this option runs contrary to Recoll
main goal which is to help you find information, independantly of how it
may be stored.
until the end of the file or another section indicator).
excludedmimetypes or indexedmimetypes, can be set either by editing the
main configuration file (recoll.conf), or from the GUI index configuration
tool.
2.1.4. Recovery
@ -702,7 +703,7 @@ Chapter 2. Indexing
mime_type
If set, this overrides any other determination of the file mime
If set, this overrides any other determination of the file MIME
type.
charset
@ -1018,11 +1019,11 @@ Chapter 3. Searching
you prefer to completely customize the choice of applications, you can
uncheck the Use desktop preferences option in the GUI preferences dialog,
and click the Choose editor applications button to adjust the predefined
Recoll choices. The tool accepts multiple selections of mime types (e.g.
Recoll choices. The tool accepts multiple selections of MIME types (e.g.
to set up the editor for the dozens of office file types).
Even when Use desktop preferences is checked, there is a small list of
exceptions, for mime types where the Recoll choice should override the
exceptions, for MIME types where the Recoll choice should override the
desktop one. These are applications which are well integrated with Recoll,
especially evince for viewing PDF and Postscript files because of its
support for opening the document at a specific page and passing a search
@ -1242,7 +1243,7 @@ Chapter 3. Searching
specifying multiple clauses which are combined to build the search.
2. The second tab lets filter the results according to file size, date of
modification, mime type, or location.
modification, MIME type, or location.
Click on the Start Search button in the advanced search dialog, or type
Enter in any text field to start the search. The button in the main window
@ -1305,8 +1306,8 @@ Chapter 3. Searching
can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
respectively.
o The next section allows filtering the results by their mime types, or
mime categories (ie: media/text/message/etc.).
o The next section allows filtering the results by their MIME types, or
MIME categories (ie: media/text/message/etc.).
You can transfer the types between two boxes, to define which will be
included or excluded by the search.
@ -1647,7 +1648,7 @@ Chapter 3. Searching
an appropriate application.
o Exceptions: when using the desktop preferences for opening documents,
these are mime types that will still be opened according to Recoll
these are MIME types that will still be opened according to Recoll
preferences. This is useful for passing parameters like page numbers
or search strings to applications that support them (e.g. evince).
This cannot be done with xdg-open which only supports passing one
@ -1789,7 +1790,7 @@ Chapter 3. Searching
o %D. Date
o %I. Icon image name. This is normally determined from the mime type.
o %I. Icon image name. This is normally determined from the MIME type.
The associations are defined inside the mimeconf configuration file.
If a thumbnail for the file is found at the standard Freedesktop
location, this will be displayed instead.
@ -1798,7 +1799,7 @@ Chapter 3. Searching
o %L. Precooked Preview, Edit, and possibly Snippets links
o %M. Mime type
o %M. MIME type
o %N. result Number inside the result page
@ -1824,7 +1825,7 @@ Chapter 3. Searching
stored by default, apart from the values above (only author and filename),
so this feature will need some custom local configuration to be useful. An
example candidate would be the recipient field which is generated by the
message filters.
message input handlers.
The default value for the paragraph format string is:
@ -1949,6 +1950,8 @@ Chapter 3. Searching
-m : dump the whole document meta[] array for each result
-A : output the document abstracts
-S fld : sort by field <fld>
-s stemlang : set stemming language to use (must exist in index...)
Use -s "" to turn off stem expansion
-D : sort descending
-i <dbdir> : additional index, several can be given
-e use url encoding (%xx) for urls
@ -2139,7 +2142,7 @@ Chapter 3. Searching
Periods can also be specified with small letters (ie: p2y).
o mime or format for specifying the mime type. This one is quite special
o mime or format for specifying the MIME type. This one is quite special
because you can specify several values which will be OR'ed (the normal
default for the language is AND). Ex: mime:text/plain mime:text/html.
Specifying an explicit boolean operator before a mime specification is
@ -2149,11 +2152,11 @@ Chapter 3. Searching
with an OR default. You do need to use OR with ext terms for example.
o type or rclcat for specifying the category (as in
text/media/presentation/etc.). The classification of mime types in
text/media/presentation/etc.). The classification of MIME types in
categories is defined in the Recoll configuration (mimeconf), and can
be modified or extended. The default category names are those which
permit filtering results in the main GUI screen. Categories are OR'ed
like mime types above. This can't be negated with - either.
like MIME types above. This can't be negated with - either.
Words inside phrases and capitalized words are not stem-expanded.
Wildcards may be used anywhere inside a term. Specifying a wild-card on
@ -2161,9 +2164,9 @@ Chapter 3. Searching
one if the expansion is truncated because of excessive size). Also see
More about wildcards.
The document filters used while indexing have the possibility to create
other fields with arbitrary names, and aliases may be defined in the
configuration, so that the exact field search possibilities may be
The document input handlers used while indexing have the possibility to
create other fields with arbitrary names, and aliases may be defined in
the configuration, so that the exact field search possibilities may be
different for you if someone took care of the customisation.
3.5.1. Modifiers
@ -2378,81 +2381,91 @@ Chapter 4. Programming interface
Recoll has an Application Programming Interface, usable both for indexing
and searching, currently accessible from the Python language.
Another less radical way to extend the application is to write filters for
new types of documents.
Another less radical way to extend the application is to write input
handlers for new types of documents.
The processing of metadata attributes for documents (fields) is highly
configurable.
4.1. Writing a document filter
4.1. Writing a document input handler
Recoll filters cooperate to translate from the multitude of input document
formats, simple ones as opendocument, acrobat), or compound ones such as
Zip or Email, into the final Recoll indexing input format, which may be
text/plain or text/html. Most filters are executable programs or scripts.
A few filters are coded in C++ and live inside recollindex. This latter
Terminology
The small programs or pieces of code which handle the processing of the
different document types for Recoll used to be called filters, which is
still reflected in the name of the directory which holds them and many
configuration variables. They were named this way because one of their
primary functions is to filter out the formatting directives and keep the
text content. However these modules may have other behaviours, and the
term input handler is now progressively substituted in the documentation.
filter is still used in many places though.
Recoll input handlers cooperate to translate from the multitude of input
document formats, simple ones as opendocument, acrobat), or compound ones
such as Zip or Email, into the final Recoll indexing input format, which
is plain text. Most input handlers are executable programs or scripts. A
few handlers are coded in C++ and live inside recollindex. This latter
kind will not be described here.
There are currently (1.18 and since 1.13) two kinds of external executable
filters:
input handlers:
o Simple filters (exec filters) run once and exit. They can be bare
programs like antiword, or scripts using other programs. They are very
simple to write, because they just need to print the converted
document to the standard output. Their output can be text/plain or
text/html.
o Simple exec handlers run once and exit. They can be bare programs like
antiword, or scripts using other programs. They are very simple to
write, because they just need to print the converted document to the
standard output. Their output can be plain text or HTML. HTML is
usually preferred because it can store metadata fields and it allows
preserving some of the formatting for the GUI preview.
o Multiple filters (execm filters), run as long as their master process
(recollindex) is active. They can process multiple files (sparing the
o Multiple execm handlers can process multiple files (sparing the
process startup time which can be very significant), or multiple
documents per file (e.g.: for zip or chm files). They communicate with
the indexer through a simple protocol, but are nevertheless a bit more
complicated than the older kind. Most of new filters are written in
complicated than the older kind. Most of new handlers are written in
Python, using a common module to handle the protocol. There is an
exception, rclimg which is written in Perl. The subdocuments output by
these filters can be directly indexable (text or HTML), or they can be
other simple or compound documents that will need to be processed by
another filter.
these handlers can be directly indexable (text or HTML), or they can
be other simple or compound documents that will need to be processed
by another handler.
In both cases, filters deal with regular file system files, and can
In both cases, handlers deal with regular file system files, and can
process either a single document, or a linear list of documents in each
file. Recoll is responsible for performing up to date checks, deal with
more complex embedding and other upper level issues.
In the extreme case of a simple filter returning a document in text/plain
format, no metadata can be transferred from the filter to the indexer.
Generic metadata, like document size or modification date, will be
gathered and stored by the indexer.
A simple handler returning a document in text/plain format, can transfer
no metadata to the indexer. Generic metadata, like document size or
modification date, will be gathered and stored by the indexer.
Filters that produce text/html format can return an arbitrary amount of
Handlers that produce text/html format can return an arbitrary amount of
metadata inside HTML meta tags. These will be processed according to the
directives found in the fields configuration file.
The filters that can handle multiple documents per file return a single
The handlers that can handle multiple documents per file return a single
piece of data to identify each document inside the file. This piece of
data, called an ipath element will be sent back by Recoll to extract the
document at query time, for previewing, or for creating a temporary file
to be opened by a viewer.
The following section describes the simple filters, and the next one gives
a few explanations about the execm ones. You could conceivably write a
simple filter with only the elements in the manual. This will not be the
case for the other ones, for which you will have to look at the code.
The following section describes the simple handlers, and the next one
gives a few explanations about the execm ones. You could conceivably write
a simple handler with only the elements in the manual. This will not be
the case for the other ones, for which you will have to look at the code.
4.1.1. Simple filters
4.1.1. Simple input handlers
Recoll simple filters are usually shell-scripts, but this is in no way
Recoll simple handlers are usually shell-scripts, but this is in no way
necessary. Extracting the text from the native format is the difficult
part. Outputting the format expected by Recoll is trivial. Happily enough,
most document formats have translators or text extractors which can be
called from the filter. In some cases the output of the translating
called from the handler. In some cases the output of the translating
program is completely appropriate, and no intermediate shell-script is
needed.
Filters are called with a single argument which is the source file name.
They should output the result to stdout.
Input handlers are called with a single argument which is the source file
name. They should output the result to stdout.
When writing a filter, you should decide if it will output plain text or
When writing a handler, you should decide if it will output plain text or
HTML. Plain text is simpler, but you will not be able to add metadata or
vary the output character encoding (this will be defined in a
configuration file). Additionally, some formatting may be easier to
@ -2461,25 +2474,26 @@ Chapter 4. Programming interface
field searches..
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
the filter if the operation is for indexing or previewing. Some filters
the handler if the operation is for indexing or previewing. Some handlers
use this to output a slightly different format, for example stripping
uninteresting repeated keywords (ie: Subject: for email) when indexing.
This is not essential.
You should look at one of the simple filters, for example rclps for a
You should look at one of the simple handlers, for example rclps for a
starting point.
Don't forget to make your filter executable before testing !
Don't forget to make your handler executable before testing !
4.1.2. "Multiple" filters
4.1.2. "Multiple" handlers
If you can program and want to write an execm filter, it should not be too
difficult to make sense of one of the existing modules. For example, look
at rclzip which uses Zip file paths as identifiers (ipath), and rclics,
which uses an integer index. Also have a look at the comments inside the
internfile/mh_execm.h file and possibly at the corresponding module.
If you can program and want to write an execm handler, it should not be
too difficult to make sense of one of the existing modules. For example,
look at rclzip which uses Zip file paths as identifiers (ipath), and
rclics, which uses an integer index. Also have a look at the comments
inside the internfile/mh_execm.h file and possibly at the corresponding
module.
execm filters sometimes need to make a choice for the nature of the ipath
execm handlers sometimes need to make a choice for the nature of the ipath
elements that they use in communication with the indexer. Here are a few
guidelines:
@ -2491,34 +2505,34 @@ Chapter 4. Programming interface
o Recoll uses a colon (:) as a separator to store a complex path
internally (for deeper embedding). Colons inside the ipath elements
output by a filter will be escaped, but would be a bad choice as a
filter-specific separator (mostly, again, for debugging issues).
output by a handler will be escaped, but would be a bad choice as a
handler-specific separator (mostly, again, for debugging issues).
In any case, the main goal is that it should be easy for the filter to
In any case, the main goal is that it should be easy for the handler to
extract the target document, given the file name and the ipath element.
execm filters will also produce a document with a null ipath element.
execm handlers will also produce a document with a null ipath element.
Depending on the type of document, this may have some associated data
(e.g. the body of an email message), or none (typical for an archive
file). If it is empty, this document will be useful anyway for some
operations, as the parent of the actual data documents.
4.1.3. Telling Recoll about the filter
4.1.3. Telling Recoll about the handler
There are two elements that link a file to the filter which should process
it: the association of file to mime type and the association of a mime
type with a filter.
There are two elements that link a file to the handler which should
process it: the association of file to MIME type and the association of a
MIME type with a handler.
The association of files to mime types is mostly based on name suffixes.
The association of files to MIME types is mostly based on name suffixes.
The types are defined inside the mimemap file. Example:
.doc = application/msword
If no suffix association is found for the file name, Recoll will try to
execute the file -i command to determine a mime type.
execute the file -i command to determine a MIME type.
The association of file types to filters is performed in the mimeconf
The association of file types to handlers is performed in the mimeconf
file. A sample will probably be of better help than a long explanation:
@ -2545,10 +2559,10 @@ Chapter 4. Programming interface
iso-8859-1 encoding is specified because it is not the utf-8 default,
and not output by unrtf in the HTML header section.
o application/x-chm is processed by a persistant filter. This is
o application/x-chm is processed by a persistant handler. This is
determined by the execm keyword.
4.1.4. Filter HTML output
4.1.4. Input handler HTML output
The output HTML could be very minimal like the following example:
@ -2600,8 +2614,8 @@ Chapter 4. Programming interface
<meta name="date" content="2013-02-24 17:50:00">
Filters also have the possibility to "invent" field names. This should
also be output as meta tags:
Input handlers also have the possibility to "invent" field names. This
should also be output as meta tags:
<meta name="somefield" content="Some textual data" />
@ -2617,10 +2631,10 @@ Chapter 4. Programming interface
4.1.5. Page numbers
The indexer will interpret ^L characters in the filter output as
The indexer will interpret ^L characters in the handler output as
indicating page breaks, and will record them. At query time, this allows
starting a viewer on the right page for a hit or a snippet. Currently,
only the PDF, Postscript and DVI filters generate page breaks.
only the PDF, Postscript and DVI handlers generate page breaks.
4.2. Field data processing
@ -2628,14 +2642,14 @@ Chapter 4. Programming interface
author, abstract.
The field values for documents can appear in several ways during indexing:
either output by filters as meta fields in the HTML header section, or
extracted from file extended attributes, or added as attributes of the Doc
object when using the API, or again synthetized internally by Recoll.
either output by input handlers as meta fields in the HTML header section,
or extracted from file extended attributes, or added as attributes of the
Doc object when using the API, or again synthetized internally by Recoll.
The Recoll query language allows searching for text in a specific field.
Recoll defines a number of default fields. Additional ones can be output
by filters, and described in the fields configuration file.
by handlers, and described in the fields configuration file.
Fields can be:
@ -2794,7 +2808,7 @@ Chapter 4. Programming interface
The Db class
A Db object is created by a connect() function and holds a connection to a
A Db object is created by a connect() call and holds a connection to a
Recoll index.
Methods
@ -3088,7 +3102,7 @@ Chapter 5. Installation and configuration
text file inside the configuration directory.
A list of common file types which need external commands follows. Many of
the filters need the iconv command, which is not always listed as a
the handlers need the iconv command, which is not always listed as a
dependancy.
Please note that, due to the relatively dynamic nature of this
@ -3103,7 +3117,7 @@ Chapter 5. Installation and configuration
http://www.recoll.org/features.html if a file type is important to you.
As of Recoll release 1.14, a number of XML-based formats that were handled
by ad hoc filter code now use the xsltproc command, which usually comes
by ad hoc handler code now use the xsltproc command, which usually comes
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
Now for the list:
@ -3121,7 +3135,7 @@ Chapter 5. Installation and configuration
it may be be used as a fallback for some files which antiword does not
handle.
o MS Excel and PowerPoint need catdoc.
o MS Excel and PowerPoint are processed by internal Python handlers.
o MS Open XML (docx) needs xsltproc.
@ -3140,11 +3154,8 @@ Chapter 5. Installation and configuration
o djvu files need djvutxt and djvused from the DjVuLibre package.
o Audio files: Recoll releases before 1.13 used the id3info command from
the id3lib package to extract mp3 tag information, metaflac (standard
flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
Releases 1.14 and later use a single Python filter based on mutagen
for all audio file types.
o Audio files: Recoll releases 1.14 and later use a single Python
handler based on mutagen for all audio file types.
o Pictures: Recoll uses the Exiftool Perl package to extract tag
information. Most image file formats are supported. Note that there
@ -3152,7 +3163,7 @@ Chapter 5. Installation and configuration
aperture, etc.). This is only of interest if you store personal tags
or textual descriptions inside the image files.
o chm: files in microsoft help format need Python and the pychm module
o chm: files in Microsoft help format need Python and the pychm module
(which needs chmlib).
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
@ -3168,11 +3179,11 @@ Chapter 5. Installation and configuration
o Konqueror webarchive format with Python (uses the Tarfile module).
o mimehtml web archive format (support based on the email filter, which
o Mimehtml web archive format (support based on the email handler, which
introduces some mild weirdness, but still usable).
Text, HTML, email folders, and Scribus files are processed internally. Lyx
is used to index Lyx files. Many filters need iconv and the standard sed
is used to index Lyx files. Many handlers need iconv and the standard sed
and awk.
5.3. Building from source
@ -3495,10 +3506,10 @@ Chapter 5. Installation and configuration
A space-separated list of patterns for names of files or
directories that should be ignored inside zip archives. This is
used directly by the zip filter, and has a function similar to
used directly by the zip handler, and has a function similar to
skippedNames, but works independantly. Can be redefined for
filesystem subdirectories. For versions up to 1.19, you will need
to update the Zip filter and install a supplementary Python
to update the Zip handler and install a supplementary Python
module. The details are described on the Recoll wiki.
followLinks
@ -3513,11 +3524,16 @@ Chapter 5. Installation and configuration
indexedmimetypes
Recoll normally indexes any file which it knows how to read. This
list lets you restrict the indexed mime types to what you specify.
list lets you restrict the indexed MIME types to what you specify.
If the variable is unspecified or the list empty (the default),
all supported types are processed. Can be redefined for
subdirectories.
excludedmimetypes
This list lets you exclude some MIME types from indexing. Can be
redefined for subdirectories.
compressedfilemaxkbs
Size limit for compressed (.gz or .bz2) files. These need to be
@ -3550,14 +3566,14 @@ Chapter 5. Installation and configuration
Recoll indexes file names in a special section of the database to
allow specific file names searches using wild cards. This
parameter decides if file name indexing is performed only for
files with mime types that would qualify them for full text
files with MIME types that would qualify them for full text
indexing, or for all files inside the selected subtrees,
independently of mime type.
independently of MIME type.
usesystemfilecommand
Decide if we use the file -i system command as a final step for
determining the mime type for a file (the main procedure uses
determining the MIME type for a file (the main procedure uses
suffix associations as defined in the mimemap file). This can be
useful for files with suffix-less names, but it will also cause
the indexing of many bogus "text" files.
@ -3770,6 +3786,9 @@ Chapter 5. Installation and configuration
This is only used by the web browser plugin indexing code, and
defines the maximum size for the web page cache. Default: 40 MB.
Quite unfortunately, this is only taken into account when creating
the cache file. You need to delete the file for a change to be
taken into account.
idxflushmb
@ -3909,15 +3928,15 @@ Chapter 5. Installation and configuration
filtermaxseconds
Maximum filter execution time, after which it is aborted. Some
Maximum handler execution time, after which it is aborted. Some
postscript programs just loop...
filtersdir
A directory to search for the external filter scripts used to
index some types of files. The value should not be changed, except
if you want to modify one of the default scripts. The value can be
redefined for any sub-directory.
A directory to search for the external input handler scripts used
to index some types of files. The value should not be changed,
except if you want to modify one of the default scripts. The value
can be redefined for any sub-directory.
iconsdir
@ -3998,17 +4017,17 @@ Chapter 5. Installation and configuration
This section defines lists of synonyms for the canonical names
used inside the [prefixes] and [stored] sections
filter-specific sections
handler-specific sections
Some filters may need specific configuration for handling fields.
Only the email message filter currently has such a section (named
[mail]). It allows indexing arbitrary email headers in addition to
the ones indexed by default. Other such sections may appear in the
future.
Some input handlers may need specific configuration for handling
fields. Only the email message handler currently has such a
section (named [mail]). It allows indexing arbitrary email headers
in addition to the ones indexed by default. Other such sections
may appear in the future.
Here follows a small example of a personal fields file. This would extract
a specific email header and use it as a searchable field, with data
displayable inside result lists. (Side note: as the email filter does no
displayable inside result lists. (Side note: as the email handler does no
decoding on the values, only plain ascii headers can be indexed, and only
the first occurrence will be used for headers that occur several times).
@ -4040,10 +4059,10 @@ Chapter 5. Installation and configuration
5.4.3. The mimemap file
mimemap specifies the file name extension to mime type mappings.
mimemap specifies the file name extension to MIME type mappings.
For file names without an extension, or with an unknown one, the system's
file -i command will be executed to determine the mime type (this can be
file -i command will be executed to determine the MIME type (this can be
switched off inside the main configuration file).
The mappings can be specified on a per-subtree basis, which may be useful
@ -4064,7 +4083,7 @@ Chapter 5. Installation and configuration
5.4.4. The mimeconf file
mimeconf specifies how the different mime types are handled for indexing,
mimeconf specifies how the different MIME types are handled for indexing,
and which icons are displayed in the recoll result lists.
Changing the parameters in the [index] section is probably not a good idea
@ -4088,7 +4107,7 @@ Chapter 5. Installation and configuration
Recoll GUI preferences, all mimeview entries will be ignored except the
one labelled application/x-all (which is set to use xdg-open by default).
In this case, the xallexcepts top level variable defines a list of mime
In this case, the xallexcepts top level variable defines a list of MIME
type exceptions which will be processed according to the local entries
instead of being passed to the desktop. This is so that specific Recoll
options such as a page number or a search string can be passed to
@ -4101,13 +4120,13 @@ Chapter 5. Installation and configuration
All viewer definition entries must be placed under a [view] section.
The keys in the file are normally mime types. You can add an application
The keys in the file are normally MIME types. You can add an application
tag to specialize the choice for an area of the filesystem (using a
localfields specification in mimeconf). The syntax for the key is
mimetype|tag
The nouncompforviewmts entry, (placed at the top level, outside of the
[view] section), holds a list of mime types that should not be
[view] section), holds a list of MIME types that should not be
uncompressed before starting the viewer (if they are found compressed, ie:
mydoc.doc.gz).
@ -4127,7 +4146,7 @@ Chapter 5. Installation and configuration
will not create a temporary file to extract the subdocument, expecting
the called application (possibly a script) to be able to handle it.
o %M. Mime type
o %M. MIME type
o %p. Page index. Only significant for a subset of document types,
currently only PDF, Postscript and DVI files. Can be used to start the
@ -4180,7 +4199,7 @@ Chapter 5. Installation and configuration
.blob = application/x-blobapp
Note that the mime type is made up here, and you could call it
Note that the MIME type is made up here, and you could call it
diesel/oil just the same.
o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
@ -4191,7 +4210,7 @@ Chapter 5. Installation and configuration
would use %u if it liked URLs better.
If you just wanted to change the application used by Recoll to display a
mime type which it already knows, you would just need to edit mimeview.
MIME type which it already knows, you would just need to edit mimeview.
The entries you add in your personal file override those in the central
configuration, which you do not need to alter. mimeview can also be
modified from the Gui.
@ -4213,14 +4232,14 @@ Chapter 5. Installation and configuration
for the files inside the result lists. Icons are normally 64x64 pixels
PNG files which live in /usr/[local/]share/recoll/images.
o Under the [categories] section, you should add the mime type where it
o Under the [categories] section, you should add the MIME type where it
makes sense (you can also create a category). Categories may be used
for filtering in advanced search.
The rclblob filter should be an executable program or script which exists
The rclblob handler should be an executable program or script which exists
inside /usr/[local/]share/recoll/filters. It will be given a file name as
argument and should output the text or html contents on the standard
output.
The filter programming section describes in more detail how to write a
filter.
The filter programming section describes in more detail how to write an
input handler.