release 3636

This commit is contained in:
Jean-Francois Dockes 2014-05-24 14:26:53 +02:00
parent 16b63b4e14
commit b7511f6f17
2 changed files with 209 additions and 185 deletions

View file

@ -81,7 +81,7 @@ Chapter 5. Installation and configuration
text file inside the configuration directory. text file inside the configuration directory.
A list of common file types which need external commands follows. Many of A list of common file types which need external commands follows. Many of
the filters need the iconv command, which is not always listed as a the handlers need the iconv command, which is not always listed as a
dependancy. dependancy.
Please note that, due to the relatively dynamic nature of this Please note that, due to the relatively dynamic nature of this
@ -96,7 +96,7 @@ Chapter 5. Installation and configuration
http://www.recoll.org/features.html if a file type is important to you. http://www.recoll.org/features.html if a file type is important to you.
As of Recoll release 1.14, a number of XML-based formats that were handled As of Recoll release 1.14, a number of XML-based formats that were handled
by ad hoc filter code now use the xsltproc command, which usually comes by ad hoc handler code now use the xsltproc command, which usually comes
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg. with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
Now for the list: Now for the list:
@ -114,7 +114,7 @@ Chapter 5. Installation and configuration
it may be be used as a fallback for some files which antiword does not it may be be used as a fallback for some files which antiword does not
handle. handle.
o MS Excel and PowerPoint need catdoc. o MS Excel and PowerPoint are processed by internal Python handlers.
o MS Open XML (docx) needs xsltproc. o MS Open XML (docx) needs xsltproc.
@ -133,11 +133,8 @@ Chapter 5. Installation and configuration
o djvu files need djvutxt and djvused from the DjVuLibre package. o djvu files need djvutxt and djvused from the DjVuLibre package.
o Audio files: Recoll releases before 1.13 used the id3info command from o Audio files: Recoll releases 1.14 and later use a single Python
the id3lib package to extract mp3 tag information, metaflac (standard handler based on mutagen for all audio file types.
flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
Releases 1.14 and later use a single Python filter based on mutagen
for all audio file types.
o Pictures: Recoll uses the Exiftool Perl package to extract tag o Pictures: Recoll uses the Exiftool Perl package to extract tag
information. Most image file formats are supported. Note that there information. Most image file formats are supported. Note that there
@ -145,7 +142,7 @@ Chapter 5. Installation and configuration
aperture, etc.). This is only of interest if you store personal tags aperture, etc.). This is only of interest if you store personal tags
or textual descriptions inside the image files. or textual descriptions inside the image files.
o chm: files in microsoft help format need Python and the pychm module o chm: files in Microsoft help format need Python and the pychm module
(which needs chmlib). (which needs chmlib).
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
@ -161,11 +158,11 @@ Chapter 5. Installation and configuration
o Konqueror webarchive format with Python (uses the Tarfile module). o Konqueror webarchive format with Python (uses the Tarfile module).
o mimehtml web archive format (support based on the email filter, which o Mimehtml web archive format (support based on the email handler, which
introduces some mild weirdness, but still usable). introduces some mild weirdness, but still usable).
Text, HTML, email folders, and Scribus files are processed internally. Lyx Text, HTML, email folders, and Scribus files are processed internally. Lyx
is used to index Lyx files. Many filters need iconv and the standard sed is used to index Lyx files. Many handlers need iconv and the standard sed
and awk. and awk.
---------------------------------------------------------------------- ----------------------------------------------------------------------
@ -515,10 +512,10 @@ Chapter 5. Installation and configuration
A space-separated list of patterns for names of files or A space-separated list of patterns for names of files or
directories that should be ignored inside zip archives. This is directories that should be ignored inside zip archives. This is
used directly by the zip filter, and has a function similar to used directly by the zip handler, and has a function similar to
skippedNames, but works independantly. Can be redefined for skippedNames, but works independantly. Can be redefined for
filesystem subdirectories. For versions up to 1.19, you will need filesystem subdirectories. For versions up to 1.19, you will need
to update the Zip filter and install a supplementary Python to update the Zip handler and install a supplementary Python
module. The details are described on the Recoll wiki. module. The details are described on the Recoll wiki.
followLinks followLinks
@ -533,11 +530,16 @@ Chapter 5. Installation and configuration
indexedmimetypes indexedmimetypes
Recoll normally indexes any file which it knows how to read. This Recoll normally indexes any file which it knows how to read. This
list lets you restrict the indexed mime types to what you specify. list lets you restrict the indexed MIME types to what you specify.
If the variable is unspecified or the list empty (the default), If the variable is unspecified or the list empty (the default),
all supported types are processed. Can be redefined for all supported types are processed. Can be redefined for
subdirectories. subdirectories.
excludedmimetypes
This list lets you exclude some MIME types from indexing. Can be
redefined for subdirectories.
compressedfilemaxkbs compressedfilemaxkbs
Size limit for compressed (.gz or .bz2) files. These need to be Size limit for compressed (.gz or .bz2) files. These need to be
@ -570,14 +572,14 @@ Chapter 5. Installation and configuration
Recoll indexes file names in a special section of the database to Recoll indexes file names in a special section of the database to
allow specific file names searches using wild cards. This allow specific file names searches using wild cards. This
parameter decides if file name indexing is performed only for parameter decides if file name indexing is performed only for
files with mime types that would qualify them for full text files with MIME types that would qualify them for full text
indexing, or for all files inside the selected subtrees, indexing, or for all files inside the selected subtrees,
independently of mime type. independently of MIME type.
usesystemfilecommand usesystemfilecommand
Decide if we use the file -i system command as a final step for Decide if we use the file -i system command as a final step for
determining the mime type for a file (the main procedure uses determining the MIME type for a file (the main procedure uses
suffix associations as defined in the mimemap file). This can be suffix associations as defined in the mimemap file). This can be
useful for files with suffix-less names, but it will also cause useful for files with suffix-less names, but it will also cause
the indexing of many bogus "text" files. the indexing of many bogus "text" files.
@ -790,6 +792,9 @@ Chapter 5. Installation and configuration
This is only used by the web browser plugin indexing code, and This is only used by the web browser plugin indexing code, and
defines the maximum size for the web page cache. Default: 40 MB. defines the maximum size for the web page cache. Default: 40 MB.
Quite unfortunately, this is only taken into account when creating
the cache file. You need to delete the file for a change to be
taken into account.
idxflushmb idxflushmb
@ -929,15 +934,15 @@ Chapter 5. Installation and configuration
filtermaxseconds filtermaxseconds
Maximum filter execution time, after which it is aborted. Some Maximum handler execution time, after which it is aborted. Some
postscript programs just loop... postscript programs just loop...
filtersdir filtersdir
A directory to search for the external filter scripts used to A directory to search for the external input handler scripts used
index some types of files. The value should not be changed, except to index some types of files. The value should not be changed,
if you want to modify one of the default scripts. The value can be except if you want to modify one of the default scripts. The value
redefined for any sub-directory. can be redefined for any sub-directory.
iconsdir iconsdir
@ -1018,17 +1023,17 @@ Chapter 5. Installation and configuration
This section defines lists of synonyms for the canonical names This section defines lists of synonyms for the canonical names
used inside the [prefixes] and [stored] sections used inside the [prefixes] and [stored] sections
filter-specific sections handler-specific sections
Some filters may need specific configuration for handling fields. Some input handlers may need specific configuration for handling
Only the email message filter currently has such a section (named fields. Only the email message handler currently has such a
[mail]). It allows indexing arbitrary email headers in addition to section (named [mail]). It allows indexing arbitrary email headers
the ones indexed by default. Other such sections may appear in the in addition to the ones indexed by default. Other such sections
future. may appear in the future.
Here follows a small example of a personal fields file. This would extract Here follows a small example of a personal fields file. This would extract
a specific email header and use it as a searchable field, with data a specific email header and use it as a searchable field, with data
displayable inside result lists. (Side note: as the email filter does no displayable inside result lists. (Side note: as the email handler does no
decoding on the values, only plain ascii headers can be indexed, and only decoding on the values, only plain ascii headers can be indexed, and only
the first occurrence will be used for headers that occur several times). the first occurrence will be used for headers that occur several times).
@ -1060,10 +1065,10 @@ Chapter 5. Installation and configuration
5.4.3. The mimemap file 5.4.3. The mimemap file
mimemap specifies the file name extension to mime type mappings. mimemap specifies the file name extension to MIME type mappings.
For file names without an extension, or with an unknown one, the system's For file names without an extension, or with an unknown one, the system's
file -i command will be executed to determine the mime type (this can be file -i command will be executed to determine the MIME type (this can be
switched off inside the main configuration file). switched off inside the main configuration file).
The mappings can be specified on a per-subtree basis, which may be useful The mappings can be specified on a per-subtree basis, which may be useful
@ -1084,7 +1089,7 @@ Chapter 5. Installation and configuration
5.4.4. The mimeconf file 5.4.4. The mimeconf file
mimeconf specifies how the different mime types are handled for indexing, mimeconf specifies how the different MIME types are handled for indexing,
and which icons are displayed in the recoll result lists. and which icons are displayed in the recoll result lists.
Changing the parameters in the [index] section is probably not a good idea Changing the parameters in the [index] section is probably not a good idea
@ -1108,7 +1113,7 @@ Chapter 5. Installation and configuration
Recoll GUI preferences, all mimeview entries will be ignored except the Recoll GUI preferences, all mimeview entries will be ignored except the
one labelled application/x-all (which is set to use xdg-open by default). one labelled application/x-all (which is set to use xdg-open by default).
In this case, the xallexcepts top level variable defines a list of mime In this case, the xallexcepts top level variable defines a list of MIME
type exceptions which will be processed according to the local entries type exceptions which will be processed according to the local entries
instead of being passed to the desktop. This is so that specific Recoll instead of being passed to the desktop. This is so that specific Recoll
options such as a page number or a search string can be passed to options such as a page number or a search string can be passed to
@ -1121,13 +1126,13 @@ Chapter 5. Installation and configuration
All viewer definition entries must be placed under a [view] section. All viewer definition entries must be placed under a [view] section.
The keys in the file are normally mime types. You can add an application The keys in the file are normally MIME types. You can add an application
tag to specialize the choice for an area of the filesystem (using a tag to specialize the choice for an area of the filesystem (using a
localfields specification in mimeconf). The syntax for the key is localfields specification in mimeconf). The syntax for the key is
mimetype|tag mimetype|tag
The nouncompforviewmts entry, (placed at the top level, outside of the The nouncompforviewmts entry, (placed at the top level, outside of the
[view] section), holds a list of mime types that should not be [view] section), holds a list of MIME types that should not be
uncompressed before starting the viewer (if they are found compressed, ie: uncompressed before starting the viewer (if they are found compressed, ie:
mydoc.doc.gz). mydoc.doc.gz).
@ -1147,7 +1152,7 @@ Chapter 5. Installation and configuration
will not create a temporary file to extract the subdocument, expecting will not create a temporary file to extract the subdocument, expecting
the called application (possibly a script) to be able to handle it. the called application (possibly a script) to be able to handle it.
o %M. Mime type o %M. MIME type
o %p. Page index. Only significant for a subset of document types, o %p. Page index. Only significant for a subset of document types,
currently only PDF, Postscript and DVI files. Can be used to start the currently only PDF, Postscript and DVI files. Can be used to start the
@ -1200,7 +1205,7 @@ Chapter 5. Installation and configuration
.blob = application/x-blobapp .blob = application/x-blobapp
Note that the mime type is made up here, and you could call it Note that the MIME type is made up here, and you could call it
diesel/oil just the same. diesel/oil just the same.
o In $RECOLL_CONFDIR/mimeview under the [view] section, add: o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
@ -1211,7 +1216,7 @@ Chapter 5. Installation and configuration
would use %u if it liked URLs better. would use %u if it liked URLs better.
If you just wanted to change the application used by Recoll to display a If you just wanted to change the application used by Recoll to display a
mime type which it already knows, you would just need to edit mimeview. MIME type which it already knows, you would just need to edit mimeview.
The entries you add in your personal file override those in the central The entries you add in your personal file override those in the central
configuration, which you do not need to alter. mimeview can also be configuration, which you do not need to alter. mimeview can also be
modified from the Gui. modified from the Gui.
@ -1233,17 +1238,17 @@ Chapter 5. Installation and configuration
for the files inside the result lists. Icons are normally 64x64 pixels for the files inside the result lists. Icons are normally 64x64 pixels
PNG files which live in /usr/[local/]share/recoll/images. PNG files which live in /usr/[local/]share/recoll/images.
o Under the [categories] section, you should add the mime type where it o Under the [categories] section, you should add the MIME type where it
makes sense (you can also create a category). Categories may be used makes sense (you can also create a category). Categories may be used
for filtering in advanced search. for filtering in advanced search.
The rclblob filter should be an executable program or script which exists The rclblob handler should be an executable program or script which exists
inside /usr/[local/]share/recoll/filters. It will be given a file name as inside /usr/[local/]share/recoll/filters. It will be given a file name as
argument and should output the text or html contents on the standard argument and should output the text or html contents on the standard
output. output.
The filter programming section describes in more detail how to write a The filter programming section describes in more detail how to write an
filter. input handler.
---------------------------------------------------------------------- ----------------------------------------------------------------------

View file

@ -134,15 +134,15 @@ More documentation can be found in the doc/ directory or at http://www.recoll.or
4. Programming interface 4. Programming interface
4.1. Writing a document filter 4.1. Writing a document input handler
4.1.1. Simple filters 4.1.1. Simple input handlers
4.1.2. "Multiple" filters 4.1.2. "Multiple" handlers
4.1.3. Telling Recoll about the filter 4.1.3. Telling Recoll about the handler
4.1.4. Filter HTML output 4.1.4. Input handler HTML output
4.1.5. Page numbers 4.1.5. Page numbers
@ -259,7 +259,7 @@ Chapter 1. Introduction
Recoll stores all internal data in Unicode UTF-8 format, and it can index Recoll stores all internal data in Unicode UTF-8 format, and it can index
files with different character sets, encodings, and languages into the files with different character sets, encodings, and languages into the
same index. It has input filters for many document types. same index. It has can process many document types.
Stemming is the process by which Recoll reduces words to their radicals so Stemming is the process by which Recoll reduces words to their radicals so
that searching does not depend, for example, on a word being singular or that searching does not depend, for example, on a word being singular or
@ -418,13 +418,13 @@ Chapter 2. Indexing
Excluding types can be done by adding wildcard name patterns to the Excluding types can be done by adding wildcard name patterns to the
skippedNames list, which can be done from the GUI Index configuration skippedNames list, which can be done from the GUI Index configuration
menu. It is also possible to exclude a mime type independantly of the file menu. For versions 1.20 and later, you can alternatively set the
name by associating it with the rclnull filter. This can be done by excludedmimetypes list in the configuration file. This can be redefined
editing the mimeconf configuration file. for subdirectories.
In order to define a positive list, You need to edit the main You can also define an exclusive list of MIME types to be indexed (no
configuration file (recoll.conf) and set the indexedmimetypes others will be indexed), by settting the indexedmimetypes configuration
configuration variable. Example: variable. Example:
indexedmimetypes = text/html application/pdf indexedmimetypes = text/html application/pdf
@ -436,10 +436,11 @@ Chapter 2. Indexing
(When using sections like this, don't forget that they remain in effect (When using sections like this, don't forget that they remain in effect
until the end of the file or another section indicator). There is no GUI until the end of the file or another section indicator).
way to edit the parameter, because this option runs contrary to Recoll
main goal which is to help you find information, independantly of how it excludedmimetypes or indexedmimetypes, can be set either by editing the
may be stored. main configuration file (recoll.conf), or from the GUI index configuration
tool.
2.1.4. Recovery 2.1.4. Recovery
@ -702,7 +703,7 @@ Chapter 2. Indexing
mime_type mime_type
If set, this overrides any other determination of the file mime If set, this overrides any other determination of the file MIME
type. type.
charset charset
@ -1018,11 +1019,11 @@ Chapter 3. Searching
you prefer to completely customize the choice of applications, you can you prefer to completely customize the choice of applications, you can
uncheck the Use desktop preferences option in the GUI preferences dialog, uncheck the Use desktop preferences option in the GUI preferences dialog,
and click the Choose editor applications button to adjust the predefined and click the Choose editor applications button to adjust the predefined
Recoll choices. The tool accepts multiple selections of mime types (e.g. Recoll choices. The tool accepts multiple selections of MIME types (e.g.
to set up the editor for the dozens of office file types). to set up the editor for the dozens of office file types).
Even when Use desktop preferences is checked, there is a small list of Even when Use desktop preferences is checked, there is a small list of
exceptions, for mime types where the Recoll choice should override the exceptions, for MIME types where the Recoll choice should override the
desktop one. These are applications which are well integrated with Recoll, desktop one. These are applications which are well integrated with Recoll,
especially evince for viewing PDF and Postscript files because of its especially evince for viewing PDF and Postscript files because of its
support for opening the document at a specific page and passing a search support for opening the document at a specific page and passing a search
@ -1242,7 +1243,7 @@ Chapter 3. Searching
specifying multiple clauses which are combined to build the search. specifying multiple clauses which are combined to build the search.
2. The second tab lets filter the results according to file size, date of 2. The second tab lets filter the results according to file size, date of
modification, mime type, or location. modification, MIME type, or location.
Click on the Start Search button in the advanced search dialog, or type Click on the Start Search button in the advanced search dialog, or type
Enter in any text field to start the search. The button in the main window Enter in any text field to start the search. The button in the main window
@ -1305,8 +1306,8 @@ Chapter 3. Searching
can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12 can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
respectively. respectively.
o The next section allows filtering the results by their mime types, or o The next section allows filtering the results by their MIME types, or
mime categories (ie: media/text/message/etc.). MIME categories (ie: media/text/message/etc.).
You can transfer the types between two boxes, to define which will be You can transfer the types between two boxes, to define which will be
included or excluded by the search. included or excluded by the search.
@ -1647,7 +1648,7 @@ Chapter 3. Searching
an appropriate application. an appropriate application.
o Exceptions: when using the desktop preferences for opening documents, o Exceptions: when using the desktop preferences for opening documents,
these are mime types that will still be opened according to Recoll these are MIME types that will still be opened according to Recoll
preferences. This is useful for passing parameters like page numbers preferences. This is useful for passing parameters like page numbers
or search strings to applications that support them (e.g. evince). or search strings to applications that support them (e.g. evince).
This cannot be done with xdg-open which only supports passing one This cannot be done with xdg-open which only supports passing one
@ -1789,7 +1790,7 @@ Chapter 3. Searching
o %D. Date o %D. Date
o %I. Icon image name. This is normally determined from the mime type. o %I. Icon image name. This is normally determined from the MIME type.
The associations are defined inside the mimeconf configuration file. The associations are defined inside the mimeconf configuration file.
If a thumbnail for the file is found at the standard Freedesktop If a thumbnail for the file is found at the standard Freedesktop
location, this will be displayed instead. location, this will be displayed instead.
@ -1798,7 +1799,7 @@ Chapter 3. Searching
o %L. Precooked Preview, Edit, and possibly Snippets links o %L. Precooked Preview, Edit, and possibly Snippets links
o %M. Mime type o %M. MIME type
o %N. result Number inside the result page o %N. result Number inside the result page
@ -1824,7 +1825,7 @@ Chapter 3. Searching
stored by default, apart from the values above (only author and filename), stored by default, apart from the values above (only author and filename),
so this feature will need some custom local configuration to be useful. An so this feature will need some custom local configuration to be useful. An
example candidate would be the recipient field which is generated by the example candidate would be the recipient field which is generated by the
message filters. message input handlers.
The default value for the paragraph format string is: The default value for the paragraph format string is:
@ -1949,6 +1950,8 @@ Chapter 3. Searching
-m : dump the whole document meta[] array for each result -m : dump the whole document meta[] array for each result
-A : output the document abstracts -A : output the document abstracts
-S fld : sort by field <fld> -S fld : sort by field <fld>
-s stemlang : set stemming language to use (must exist in index...)
Use -s "" to turn off stem expansion
-D : sort descending -D : sort descending
-i <dbdir> : additional index, several can be given -i <dbdir> : additional index, several can be given
-e use url encoding (%xx) for urls -e use url encoding (%xx) for urls
@ -2139,7 +2142,7 @@ Chapter 3. Searching
Periods can also be specified with small letters (ie: p2y). Periods can also be specified with small letters (ie: p2y).
o mime or format for specifying the mime type. This one is quite special o mime or format for specifying the MIME type. This one is quite special
because you can specify several values which will be OR'ed (the normal because you can specify several values which will be OR'ed (the normal
default for the language is AND). Ex: mime:text/plain mime:text/html. default for the language is AND). Ex: mime:text/plain mime:text/html.
Specifying an explicit boolean operator before a mime specification is Specifying an explicit boolean operator before a mime specification is
@ -2149,11 +2152,11 @@ Chapter 3. Searching
with an OR default. You do need to use OR with ext terms for example. with an OR default. You do need to use OR with ext terms for example.
o type or rclcat for specifying the category (as in o type or rclcat for specifying the category (as in
text/media/presentation/etc.). The classification of mime types in text/media/presentation/etc.). The classification of MIME types in
categories is defined in the Recoll configuration (mimeconf), and can categories is defined in the Recoll configuration (mimeconf), and can
be modified or extended. The default category names are those which be modified or extended. The default category names are those which
permit filtering results in the main GUI screen. Categories are OR'ed permit filtering results in the main GUI screen. Categories are OR'ed
like mime types above. This can't be negated with - either. like MIME types above. This can't be negated with - either.
Words inside phrases and capitalized words are not stem-expanded. Words inside phrases and capitalized words are not stem-expanded.
Wildcards may be used anywhere inside a term. Specifying a wild-card on Wildcards may be used anywhere inside a term. Specifying a wild-card on
@ -2161,9 +2164,9 @@ Chapter 3. Searching
one if the expansion is truncated because of excessive size). Also see one if the expansion is truncated because of excessive size). Also see
More about wildcards. More about wildcards.
The document filters used while indexing have the possibility to create The document input handlers used while indexing have the possibility to
other fields with arbitrary names, and aliases may be defined in the create other fields with arbitrary names, and aliases may be defined in
configuration, so that the exact field search possibilities may be the configuration, so that the exact field search possibilities may be
different for you if someone took care of the customisation. different for you if someone took care of the customisation.
3.5.1. Modifiers 3.5.1. Modifiers
@ -2378,81 +2381,91 @@ Chapter 4. Programming interface
Recoll has an Application Programming Interface, usable both for indexing Recoll has an Application Programming Interface, usable both for indexing
and searching, currently accessible from the Python language. and searching, currently accessible from the Python language.
Another less radical way to extend the application is to write filters for Another less radical way to extend the application is to write input
new types of documents. handlers for new types of documents.
The processing of metadata attributes for documents (fields) is highly The processing of metadata attributes for documents (fields) is highly
configurable. configurable.
4.1. Writing a document filter 4.1. Writing a document input handler
Recoll filters cooperate to translate from the multitude of input document Terminology
formats, simple ones as opendocument, acrobat), or compound ones such as
Zip or Email, into the final Recoll indexing input format, which may be The small programs or pieces of code which handle the processing of the
text/plain or text/html. Most filters are executable programs or scripts. different document types for Recoll used to be called filters, which is
A few filters are coded in C++ and live inside recollindex. This latter still reflected in the name of the directory which holds them and many
configuration variables. They were named this way because one of their
primary functions is to filter out the formatting directives and keep the
text content. However these modules may have other behaviours, and the
term input handler is now progressively substituted in the documentation.
filter is still used in many places though.
Recoll input handlers cooperate to translate from the multitude of input
document formats, simple ones as opendocument, acrobat), or compound ones
such as Zip or Email, into the final Recoll indexing input format, which
is plain text. Most input handlers are executable programs or scripts. A
few handlers are coded in C++ and live inside recollindex. This latter
kind will not be described here. kind will not be described here.
There are currently (1.18 and since 1.13) two kinds of external executable There are currently (1.18 and since 1.13) two kinds of external executable
filters: input handlers:
o Simple filters (exec filters) run once and exit. They can be bare o Simple exec handlers run once and exit. They can be bare programs like
programs like antiword, or scripts using other programs. They are very antiword, or scripts using other programs. They are very simple to
simple to write, because they just need to print the converted write, because they just need to print the converted document to the
document to the standard output. Their output can be text/plain or standard output. Their output can be plain text or HTML. HTML is
text/html. usually preferred because it can store metadata fields and it allows
preserving some of the formatting for the GUI preview.
o Multiple filters (execm filters), run as long as their master process o Multiple execm handlers can process multiple files (sparing the
(recollindex) is active. They can process multiple files (sparing the
process startup time which can be very significant), or multiple process startup time which can be very significant), or multiple
documents per file (e.g.: for zip or chm files). They communicate with documents per file (e.g.: for zip or chm files). They communicate with
the indexer through a simple protocol, but are nevertheless a bit more the indexer through a simple protocol, but are nevertheless a bit more
complicated than the older kind. Most of new filters are written in complicated than the older kind. Most of new handlers are written in
Python, using a common module to handle the protocol. There is an Python, using a common module to handle the protocol. There is an
exception, rclimg which is written in Perl. The subdocuments output by exception, rclimg which is written in Perl. The subdocuments output by
these filters can be directly indexable (text or HTML), or they can be these handlers can be directly indexable (text or HTML), or they can
other simple or compound documents that will need to be processed by be other simple or compound documents that will need to be processed
another filter. by another handler.
In both cases, filters deal with regular file system files, and can In both cases, handlers deal with regular file system files, and can
process either a single document, or a linear list of documents in each process either a single document, or a linear list of documents in each
file. Recoll is responsible for performing up to date checks, deal with file. Recoll is responsible for performing up to date checks, deal with
more complex embedding and other upper level issues. more complex embedding and other upper level issues.
In the extreme case of a simple filter returning a document in text/plain A simple handler returning a document in text/plain format, can transfer
format, no metadata can be transferred from the filter to the indexer. no metadata to the indexer. Generic metadata, like document size or
Generic metadata, like document size or modification date, will be modification date, will be gathered and stored by the indexer.
gathered and stored by the indexer.
Filters that produce text/html format can return an arbitrary amount of Handlers that produce text/html format can return an arbitrary amount of
metadata inside HTML meta tags. These will be processed according to the metadata inside HTML meta tags. These will be processed according to the
directives found in the fields configuration file. directives found in the fields configuration file.
The filters that can handle multiple documents per file return a single The handlers that can handle multiple documents per file return a single
piece of data to identify each document inside the file. This piece of piece of data to identify each document inside the file. This piece of
data, called an ipath element will be sent back by Recoll to extract the data, called an ipath element will be sent back by Recoll to extract the
document at query time, for previewing, or for creating a temporary file document at query time, for previewing, or for creating a temporary file
to be opened by a viewer. to be opened by a viewer.
The following section describes the simple filters, and the next one gives The following section describes the simple handlers, and the next one
a few explanations about the execm ones. You could conceivably write a gives a few explanations about the execm ones. You could conceivably write
simple filter with only the elements in the manual. This will not be the a simple handler with only the elements in the manual. This will not be
case for the other ones, for which you will have to look at the code. the case for the other ones, for which you will have to look at the code.
4.1.1. Simple filters 4.1.1. Simple input handlers
Recoll simple filters are usually shell-scripts, but this is in no way Recoll simple handlers are usually shell-scripts, but this is in no way
necessary. Extracting the text from the native format is the difficult necessary. Extracting the text from the native format is the difficult
part. Outputting the format expected by Recoll is trivial. Happily enough, part. Outputting the format expected by Recoll is trivial. Happily enough,
most document formats have translators or text extractors which can be most document formats have translators or text extractors which can be
called from the filter. In some cases the output of the translating called from the handler. In some cases the output of the translating
program is completely appropriate, and no intermediate shell-script is program is completely appropriate, and no intermediate shell-script is
needed. needed.
Filters are called with a single argument which is the source file name. Input handlers are called with a single argument which is the source file
They should output the result to stdout. name. They should output the result to stdout.
When writing a filter, you should decide if it will output plain text or When writing a handler, you should decide if it will output plain text or
HTML. Plain text is simpler, but you will not be able to add metadata or HTML. Plain text is simpler, but you will not be able to add metadata or
vary the output character encoding (this will be defined in a vary the output character encoding (this will be defined in a
configuration file). Additionally, some formatting may be easier to configuration file). Additionally, some formatting may be easier to
@ -2461,25 +2474,26 @@ Chapter 4. Programming interface
field searches.. field searches..
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
the filter if the operation is for indexing or previewing. Some filters the handler if the operation is for indexing or previewing. Some handlers
use this to output a slightly different format, for example stripping use this to output a slightly different format, for example stripping
uninteresting repeated keywords (ie: Subject: for email) when indexing. uninteresting repeated keywords (ie: Subject: for email) when indexing.
This is not essential. This is not essential.
You should look at one of the simple filters, for example rclps for a You should look at one of the simple handlers, for example rclps for a
starting point. starting point.
Don't forget to make your filter executable before testing ! Don't forget to make your handler executable before testing !
4.1.2. "Multiple" filters 4.1.2. "Multiple" handlers
If you can program and want to write an execm filter, it should not be too If you can program and want to write an execm handler, it should not be
difficult to make sense of one of the existing modules. For example, look too difficult to make sense of one of the existing modules. For example,
at rclzip which uses Zip file paths as identifiers (ipath), and rclics, look at rclzip which uses Zip file paths as identifiers (ipath), and
which uses an integer index. Also have a look at the comments inside the rclics, which uses an integer index. Also have a look at the comments
internfile/mh_execm.h file and possibly at the corresponding module. inside the internfile/mh_execm.h file and possibly at the corresponding
module.
execm filters sometimes need to make a choice for the nature of the ipath execm handlers sometimes need to make a choice for the nature of the ipath
elements that they use in communication with the indexer. Here are a few elements that they use in communication with the indexer. Here are a few
guidelines: guidelines:
@ -2491,34 +2505,34 @@ Chapter 4. Programming interface
o Recoll uses a colon (:) as a separator to store a complex path o Recoll uses a colon (:) as a separator to store a complex path
internally (for deeper embedding). Colons inside the ipath elements internally (for deeper embedding). Colons inside the ipath elements
output by a filter will be escaped, but would be a bad choice as a output by a handler will be escaped, but would be a bad choice as a
filter-specific separator (mostly, again, for debugging issues). handler-specific separator (mostly, again, for debugging issues).
In any case, the main goal is that it should be easy for the filter to In any case, the main goal is that it should be easy for the handler to
extract the target document, given the file name and the ipath element. extract the target document, given the file name and the ipath element.
execm filters will also produce a document with a null ipath element. execm handlers will also produce a document with a null ipath element.
Depending on the type of document, this may have some associated data Depending on the type of document, this may have some associated data
(e.g. the body of an email message), or none (typical for an archive (e.g. the body of an email message), or none (typical for an archive
file). If it is empty, this document will be useful anyway for some file). If it is empty, this document will be useful anyway for some
operations, as the parent of the actual data documents. operations, as the parent of the actual data documents.
4.1.3. Telling Recoll about the filter 4.1.3. Telling Recoll about the handler
There are two elements that link a file to the filter which should process There are two elements that link a file to the handler which should
it: the association of file to mime type and the association of a mime process it: the association of file to MIME type and the association of a
type with a filter. MIME type with a handler.
The association of files to mime types is mostly based on name suffixes. The association of files to MIME types is mostly based on name suffixes.
The types are defined inside the mimemap file. Example: The types are defined inside the mimemap file. Example:
.doc = application/msword .doc = application/msword
If no suffix association is found for the file name, Recoll will try to If no suffix association is found for the file name, Recoll will try to
execute the file -i command to determine a mime type. execute the file -i command to determine a MIME type.
The association of file types to filters is performed in the mimeconf The association of file types to handlers is performed in the mimeconf
file. A sample will probably be of better help than a long explanation: file. A sample will probably be of better help than a long explanation:
@ -2545,10 +2559,10 @@ Chapter 4. Programming interface
iso-8859-1 encoding is specified because it is not the utf-8 default, iso-8859-1 encoding is specified because it is not the utf-8 default,
and not output by unrtf in the HTML header section. and not output by unrtf in the HTML header section.
o application/x-chm is processed by a persistant filter. This is o application/x-chm is processed by a persistant handler. This is
determined by the execm keyword. determined by the execm keyword.
4.1.4. Filter HTML output 4.1.4. Input handler HTML output
The output HTML could be very minimal like the following example: The output HTML could be very minimal like the following example:
@ -2600,8 +2614,8 @@ Chapter 4. Programming interface
<meta name="date" content="2013-02-24 17:50:00"> <meta name="date" content="2013-02-24 17:50:00">
Filters also have the possibility to "invent" field names. This should Input handlers also have the possibility to "invent" field names. This
also be output as meta tags: should also be output as meta tags:
<meta name="somefield" content="Some textual data" /> <meta name="somefield" content="Some textual data" />
@ -2617,10 +2631,10 @@ Chapter 4. Programming interface
4.1.5. Page numbers 4.1.5. Page numbers
The indexer will interpret ^L characters in the filter output as The indexer will interpret ^L characters in the handler output as
indicating page breaks, and will record them. At query time, this allows indicating page breaks, and will record them. At query time, this allows
starting a viewer on the right page for a hit or a snippet. Currently, starting a viewer on the right page for a hit or a snippet. Currently,
only the PDF, Postscript and DVI filters generate page breaks. only the PDF, Postscript and DVI handlers generate page breaks.
4.2. Field data processing 4.2. Field data processing
@ -2628,14 +2642,14 @@ Chapter 4. Programming interface
author, abstract. author, abstract.
The field values for documents can appear in several ways during indexing: The field values for documents can appear in several ways during indexing:
either output by filters as meta fields in the HTML header section, or either output by input handlers as meta fields in the HTML header section,
extracted from file extended attributes, or added as attributes of the Doc or extracted from file extended attributes, or added as attributes of the
object when using the API, or again synthetized internally by Recoll. Doc object when using the API, or again synthetized internally by Recoll.
The Recoll query language allows searching for text in a specific field. The Recoll query language allows searching for text in a specific field.
Recoll defines a number of default fields. Additional ones can be output Recoll defines a number of default fields. Additional ones can be output
by filters, and described in the fields configuration file. by handlers, and described in the fields configuration file.
Fields can be: Fields can be:
@ -2794,7 +2808,7 @@ Chapter 4. Programming interface
The Db class The Db class
A Db object is created by a connect() function and holds a connection to a A Db object is created by a connect() call and holds a connection to a
Recoll index. Recoll index.
Methods Methods
@ -3088,7 +3102,7 @@ Chapter 5. Installation and configuration
text file inside the configuration directory. text file inside the configuration directory.
A list of common file types which need external commands follows. Many of A list of common file types which need external commands follows. Many of
the filters need the iconv command, which is not always listed as a the handlers need the iconv command, which is not always listed as a
dependancy. dependancy.
Please note that, due to the relatively dynamic nature of this Please note that, due to the relatively dynamic nature of this
@ -3103,7 +3117,7 @@ Chapter 5. Installation and configuration
http://www.recoll.org/features.html if a file type is important to you. http://www.recoll.org/features.html if a file type is important to you.
As of Recoll release 1.14, a number of XML-based formats that were handled As of Recoll release 1.14, a number of XML-based formats that were handled
by ad hoc filter code now use the xsltproc command, which usually comes by ad hoc handler code now use the xsltproc command, which usually comes
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg. with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
Now for the list: Now for the list:
@ -3121,7 +3135,7 @@ Chapter 5. Installation and configuration
it may be be used as a fallback for some files which antiword does not it may be be used as a fallback for some files which antiword does not
handle. handle.
o MS Excel and PowerPoint need catdoc. o MS Excel and PowerPoint are processed by internal Python handlers.
o MS Open XML (docx) needs xsltproc. o MS Open XML (docx) needs xsltproc.
@ -3140,11 +3154,8 @@ Chapter 5. Installation and configuration
o djvu files need djvutxt and djvused from the DjVuLibre package. o djvu files need djvutxt and djvused from the DjVuLibre package.
o Audio files: Recoll releases before 1.13 used the id3info command from o Audio files: Recoll releases 1.14 and later use a single Python
the id3lib package to extract mp3 tag information, metaflac (standard handler based on mutagen for all audio file types.
flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
Releases 1.14 and later use a single Python filter based on mutagen
for all audio file types.
o Pictures: Recoll uses the Exiftool Perl package to extract tag o Pictures: Recoll uses the Exiftool Perl package to extract tag
information. Most image file formats are supported. Note that there information. Most image file formats are supported. Note that there
@ -3152,7 +3163,7 @@ Chapter 5. Installation and configuration
aperture, etc.). This is only of interest if you store personal tags aperture, etc.). This is only of interest if you store personal tags
or textual descriptions inside the image files. or textual descriptions inside the image files.
o chm: files in microsoft help format need Python and the pychm module o chm: files in Microsoft help format need Python and the pychm module
(which needs chmlib). (which needs chmlib).
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
@ -3168,11 +3179,11 @@ Chapter 5. Installation and configuration
o Konqueror webarchive format with Python (uses the Tarfile module). o Konqueror webarchive format with Python (uses the Tarfile module).
o mimehtml web archive format (support based on the email filter, which o Mimehtml web archive format (support based on the email handler, which
introduces some mild weirdness, but still usable). introduces some mild weirdness, but still usable).
Text, HTML, email folders, and Scribus files are processed internally. Lyx Text, HTML, email folders, and Scribus files are processed internally. Lyx
is used to index Lyx files. Many filters need iconv and the standard sed is used to index Lyx files. Many handlers need iconv and the standard sed
and awk. and awk.
5.3. Building from source 5.3. Building from source
@ -3495,10 +3506,10 @@ Chapter 5. Installation and configuration
A space-separated list of patterns for names of files or A space-separated list of patterns for names of files or
directories that should be ignored inside zip archives. This is directories that should be ignored inside zip archives. This is
used directly by the zip filter, and has a function similar to used directly by the zip handler, and has a function similar to
skippedNames, but works independantly. Can be redefined for skippedNames, but works independantly. Can be redefined for
filesystem subdirectories. For versions up to 1.19, you will need filesystem subdirectories. For versions up to 1.19, you will need
to update the Zip filter and install a supplementary Python to update the Zip handler and install a supplementary Python
module. The details are described on the Recoll wiki. module. The details are described on the Recoll wiki.
followLinks followLinks
@ -3513,11 +3524,16 @@ Chapter 5. Installation and configuration
indexedmimetypes indexedmimetypes
Recoll normally indexes any file which it knows how to read. This Recoll normally indexes any file which it knows how to read. This
list lets you restrict the indexed mime types to what you specify. list lets you restrict the indexed MIME types to what you specify.
If the variable is unspecified or the list empty (the default), If the variable is unspecified or the list empty (the default),
all supported types are processed. Can be redefined for all supported types are processed. Can be redefined for
subdirectories. subdirectories.
excludedmimetypes
This list lets you exclude some MIME types from indexing. Can be
redefined for subdirectories.
compressedfilemaxkbs compressedfilemaxkbs
Size limit for compressed (.gz or .bz2) files. These need to be Size limit for compressed (.gz or .bz2) files. These need to be
@ -3550,14 +3566,14 @@ Chapter 5. Installation and configuration
Recoll indexes file names in a special section of the database to Recoll indexes file names in a special section of the database to
allow specific file names searches using wild cards. This allow specific file names searches using wild cards. This
parameter decides if file name indexing is performed only for parameter decides if file name indexing is performed only for
files with mime types that would qualify them for full text files with MIME types that would qualify them for full text
indexing, or for all files inside the selected subtrees, indexing, or for all files inside the selected subtrees,
independently of mime type. independently of MIME type.
usesystemfilecommand usesystemfilecommand
Decide if we use the file -i system command as a final step for Decide if we use the file -i system command as a final step for
determining the mime type for a file (the main procedure uses determining the MIME type for a file (the main procedure uses
suffix associations as defined in the mimemap file). This can be suffix associations as defined in the mimemap file). This can be
useful for files with suffix-less names, but it will also cause useful for files with suffix-less names, but it will also cause
the indexing of many bogus "text" files. the indexing of many bogus "text" files.
@ -3770,6 +3786,9 @@ Chapter 5. Installation and configuration
This is only used by the web browser plugin indexing code, and This is only used by the web browser plugin indexing code, and
defines the maximum size for the web page cache. Default: 40 MB. defines the maximum size for the web page cache. Default: 40 MB.
Quite unfortunately, this is only taken into account when creating
the cache file. You need to delete the file for a change to be
taken into account.
idxflushmb idxflushmb
@ -3909,15 +3928,15 @@ Chapter 5. Installation and configuration
filtermaxseconds filtermaxseconds
Maximum filter execution time, after which it is aborted. Some Maximum handler execution time, after which it is aborted. Some
postscript programs just loop... postscript programs just loop...
filtersdir filtersdir
A directory to search for the external filter scripts used to A directory to search for the external input handler scripts used
index some types of files. The value should not be changed, except to index some types of files. The value should not be changed,
if you want to modify one of the default scripts. The value can be except if you want to modify one of the default scripts. The value
redefined for any sub-directory. can be redefined for any sub-directory.
iconsdir iconsdir
@ -3998,17 +4017,17 @@ Chapter 5. Installation and configuration
This section defines lists of synonyms for the canonical names This section defines lists of synonyms for the canonical names
used inside the [prefixes] and [stored] sections used inside the [prefixes] and [stored] sections
filter-specific sections handler-specific sections
Some filters may need specific configuration for handling fields. Some input handlers may need specific configuration for handling
Only the email message filter currently has such a section (named fields. Only the email message handler currently has such a
[mail]). It allows indexing arbitrary email headers in addition to section (named [mail]). It allows indexing arbitrary email headers
the ones indexed by default. Other such sections may appear in the in addition to the ones indexed by default. Other such sections
future. may appear in the future.
Here follows a small example of a personal fields file. This would extract Here follows a small example of a personal fields file. This would extract
a specific email header and use it as a searchable field, with data a specific email header and use it as a searchable field, with data
displayable inside result lists. (Side note: as the email filter does no displayable inside result lists. (Side note: as the email handler does no
decoding on the values, only plain ascii headers can be indexed, and only decoding on the values, only plain ascii headers can be indexed, and only
the first occurrence will be used for headers that occur several times). the first occurrence will be used for headers that occur several times).
@ -4040,10 +4059,10 @@ Chapter 5. Installation and configuration
5.4.3. The mimemap file 5.4.3. The mimemap file
mimemap specifies the file name extension to mime type mappings. mimemap specifies the file name extension to MIME type mappings.
For file names without an extension, or with an unknown one, the system's For file names without an extension, or with an unknown one, the system's
file -i command will be executed to determine the mime type (this can be file -i command will be executed to determine the MIME type (this can be
switched off inside the main configuration file). switched off inside the main configuration file).
The mappings can be specified on a per-subtree basis, which may be useful The mappings can be specified on a per-subtree basis, which may be useful
@ -4064,7 +4083,7 @@ Chapter 5. Installation and configuration
5.4.4. The mimeconf file 5.4.4. The mimeconf file
mimeconf specifies how the different mime types are handled for indexing, mimeconf specifies how the different MIME types are handled for indexing,
and which icons are displayed in the recoll result lists. and which icons are displayed in the recoll result lists.
Changing the parameters in the [index] section is probably not a good idea Changing the parameters in the [index] section is probably not a good idea
@ -4088,7 +4107,7 @@ Chapter 5. Installation and configuration
Recoll GUI preferences, all mimeview entries will be ignored except the Recoll GUI preferences, all mimeview entries will be ignored except the
one labelled application/x-all (which is set to use xdg-open by default). one labelled application/x-all (which is set to use xdg-open by default).
In this case, the xallexcepts top level variable defines a list of mime In this case, the xallexcepts top level variable defines a list of MIME
type exceptions which will be processed according to the local entries type exceptions which will be processed according to the local entries
instead of being passed to the desktop. This is so that specific Recoll instead of being passed to the desktop. This is so that specific Recoll
options such as a page number or a search string can be passed to options such as a page number or a search string can be passed to
@ -4101,13 +4120,13 @@ Chapter 5. Installation and configuration
All viewer definition entries must be placed under a [view] section. All viewer definition entries must be placed under a [view] section.
The keys in the file are normally mime types. You can add an application The keys in the file are normally MIME types. You can add an application
tag to specialize the choice for an area of the filesystem (using a tag to specialize the choice for an area of the filesystem (using a
localfields specification in mimeconf). The syntax for the key is localfields specification in mimeconf). The syntax for the key is
mimetype|tag mimetype|tag
The nouncompforviewmts entry, (placed at the top level, outside of the The nouncompforviewmts entry, (placed at the top level, outside of the
[view] section), holds a list of mime types that should not be [view] section), holds a list of MIME types that should not be
uncompressed before starting the viewer (if they are found compressed, ie: uncompressed before starting the viewer (if they are found compressed, ie:
mydoc.doc.gz). mydoc.doc.gz).
@ -4127,7 +4146,7 @@ Chapter 5. Installation and configuration
will not create a temporary file to extract the subdocument, expecting will not create a temporary file to extract the subdocument, expecting
the called application (possibly a script) to be able to handle it. the called application (possibly a script) to be able to handle it.
o %M. Mime type o %M. MIME type
o %p. Page index. Only significant for a subset of document types, o %p. Page index. Only significant for a subset of document types,
currently only PDF, Postscript and DVI files. Can be used to start the currently only PDF, Postscript and DVI files. Can be used to start the
@ -4180,7 +4199,7 @@ Chapter 5. Installation and configuration
.blob = application/x-blobapp .blob = application/x-blobapp
Note that the mime type is made up here, and you could call it Note that the MIME type is made up here, and you could call it
diesel/oil just the same. diesel/oil just the same.
o In $RECOLL_CONFDIR/mimeview under the [view] section, add: o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
@ -4191,7 +4210,7 @@ Chapter 5. Installation and configuration
would use %u if it liked URLs better. would use %u if it liked URLs better.
If you just wanted to change the application used by Recoll to display a If you just wanted to change the application used by Recoll to display a
mime type which it already knows, you would just need to edit mimeview. MIME type which it already knows, you would just need to edit mimeview.
The entries you add in your personal file override those in the central The entries you add in your personal file override those in the central
configuration, which you do not need to alter. mimeview can also be configuration, which you do not need to alter. mimeview can also be
modified from the Gui. modified from the Gui.
@ -4213,14 +4232,14 @@ Chapter 5. Installation and configuration
for the files inside the result lists. Icons are normally 64x64 pixels for the files inside the result lists. Icons are normally 64x64 pixels
PNG files which live in /usr/[local/]share/recoll/images. PNG files which live in /usr/[local/]share/recoll/images.
o Under the [categories] section, you should add the mime type where it o Under the [categories] section, you should add the MIME type where it
makes sense (you can also create a category). Categories may be used makes sense (you can also create a category). Categories may be used
for filtering in advanced search. for filtering in advanced search.
The rclblob filter should be an executable program or script which exists The rclblob handler should be an executable program or script which exists
inside /usr/[local/]share/recoll/filters. It will be given a file name as inside /usr/[local/]share/recoll/filters. It will be given a file name as
argument and should output the text or html contents on the standard argument and should output the text or html contents on the standard
output. output.
The filter programming section describes in more detail how to write a The filter programming section describes in more detail how to write an
filter. input handler.