diff --git a/src/INSTALL b/src/INSTALL index 483b4cf7..65928c07 100644 --- a/src/INSTALL +++ b/src/INSTALL @@ -81,7 +81,7 @@ Chapter 5. Installation and configuration text file inside the configuration directory. A list of common file types which need external commands follows. Many of - the filters need the iconv command, which is not always listed as a + the handlers need the iconv command, which is not always listed as a dependancy. Please note that, due to the relatively dynamic nature of this @@ -96,7 +96,7 @@ Chapter 5. Installation and configuration http://www.recoll.org/features.html if a file type is important to you. As of Recoll release 1.14, a number of XML-based formats that were handled - by ad hoc filter code now use the xsltproc command, which usually comes + by ad hoc handler code now use the xsltproc command, which usually comes with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg. Now for the list: @@ -114,7 +114,7 @@ Chapter 5. Installation and configuration it may be be used as a fallback for some files which antiword does not handle. - o MS Excel and PowerPoint need catdoc. + o MS Excel and PowerPoint are processed by internal Python handlers. o MS Open XML (docx) needs xsltproc. @@ -133,11 +133,8 @@ Chapter 5. Installation and configuration o djvu files need djvutxt and djvused from the DjVuLibre package. - o Audio files: Recoll releases before 1.13 used the id3info command from - the id3lib package to extract mp3 tag information, metaflac (standard - flac tools) for flac files, and ogginfo (vorbis tools) for ogg files. - Releases 1.14 and later use a single Python filter based on mutagen - for all audio file types. + o Audio files: Recoll releases 1.14 and later use a single Python + handler based on mutagen for all audio file types. o Pictures: Recoll uses the Exiftool Perl package to extract tag information. Most image file formats are supported. Note that there @@ -145,7 +142,7 @@ Chapter 5. Installation and configuration aperture, etc.). This is only of interest if you store personal tags or textual descriptions inside the image files. - o chm: files in microsoft help format need Python and the pychm module + o chm: files in Microsoft help format need Python and the pychm module (which needs chmlib). o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar @@ -161,11 +158,11 @@ Chapter 5. Installation and configuration o Konqueror webarchive format with Python (uses the Tarfile module). - o mimehtml web archive format (support based on the email filter, which + o Mimehtml web archive format (support based on the email handler, which introduces some mild weirdness, but still usable). Text, HTML, email folders, and Scribus files are processed internally. Lyx - is used to index Lyx files. Many filters need iconv and the standard sed + is used to index Lyx files. Many handlers need iconv and the standard sed and awk. ---------------------------------------------------------------------- @@ -515,10 +512,10 @@ Chapter 5. Installation and configuration A space-separated list of patterns for names of files or directories that should be ignored inside zip archives. This is - used directly by the zip filter, and has a function similar to + used directly by the zip handler, and has a function similar to skippedNames, but works independantly. Can be redefined for filesystem subdirectories. For versions up to 1.19, you will need - to update the Zip filter and install a supplementary Python + to update the Zip handler and install a supplementary Python module. The details are described on the Recoll wiki. followLinks @@ -533,11 +530,16 @@ Chapter 5. Installation and configuration indexedmimetypes Recoll normally indexes any file which it knows how to read. This - list lets you restrict the indexed mime types to what you specify. + list lets you restrict the indexed MIME types to what you specify. If the variable is unspecified or the list empty (the default), all supported types are processed. Can be redefined for subdirectories. + excludedmimetypes + + This list lets you exclude some MIME types from indexing. Can be + redefined for subdirectories. + compressedfilemaxkbs Size limit for compressed (.gz or .bz2) files. These need to be @@ -570,14 +572,14 @@ Chapter 5. Installation and configuration Recoll indexes file names in a special section of the database to allow specific file names searches using wild cards. This parameter decides if file name indexing is performed only for - files with mime types that would qualify them for full text + files with MIME types that would qualify them for full text indexing, or for all files inside the selected subtrees, - independently of mime type. + independently of MIME type. usesystemfilecommand Decide if we use the file -i system command as a final step for - determining the mime type for a file (the main procedure uses + determining the MIME type for a file (the main procedure uses suffix associations as defined in the mimemap file). This can be useful for files with suffix-less names, but it will also cause the indexing of many bogus "text" files. @@ -790,6 +792,9 @@ Chapter 5. Installation and configuration This is only used by the web browser plugin indexing code, and defines the maximum size for the web page cache. Default: 40 MB. + Quite unfortunately, this is only taken into account when creating + the cache file. You need to delete the file for a change to be + taken into account. idxflushmb @@ -929,15 +934,15 @@ Chapter 5. Installation and configuration filtermaxseconds - Maximum filter execution time, after which it is aborted. Some + Maximum handler execution time, after which it is aborted. Some postscript programs just loop... filtersdir - A directory to search for the external filter scripts used to - index some types of files. The value should not be changed, except - if you want to modify one of the default scripts. The value can be - redefined for any sub-directory. + A directory to search for the external input handler scripts used + to index some types of files. The value should not be changed, + except if you want to modify one of the default scripts. The value + can be redefined for any sub-directory. iconsdir @@ -1018,17 +1023,17 @@ Chapter 5. Installation and configuration This section defines lists of synonyms for the canonical names used inside the [prefixes] and [stored] sections - filter-specific sections + handler-specific sections - Some filters may need specific configuration for handling fields. - Only the email message filter currently has such a section (named - [mail]). It allows indexing arbitrary email headers in addition to - the ones indexed by default. Other such sections may appear in the - future. + Some input handlers may need specific configuration for handling + fields. Only the email message handler currently has such a + section (named [mail]). It allows indexing arbitrary email headers + in addition to the ones indexed by default. Other such sections + may appear in the future. Here follows a small example of a personal fields file. This would extract a specific email header and use it as a searchable field, with data - displayable inside result lists. (Side note: as the email filter does no + displayable inside result lists. (Side note: as the email handler does no decoding on the values, only plain ascii headers can be indexed, and only the first occurrence will be used for headers that occur several times). @@ -1060,10 +1065,10 @@ Chapter 5. Installation and configuration 5.4.3. The mimemap file - mimemap specifies the file name extension to mime type mappings. + mimemap specifies the file name extension to MIME type mappings. For file names without an extension, or with an unknown one, the system's - file -i command will be executed to determine the mime type (this can be + file -i command will be executed to determine the MIME type (this can be switched off inside the main configuration file). The mappings can be specified on a per-subtree basis, which may be useful @@ -1084,7 +1089,7 @@ Chapter 5. Installation and configuration 5.4.4. The mimeconf file - mimeconf specifies how the different mime types are handled for indexing, + mimeconf specifies how the different MIME types are handled for indexing, and which icons are displayed in the recoll result lists. Changing the parameters in the [index] section is probably not a good idea @@ -1108,7 +1113,7 @@ Chapter 5. Installation and configuration Recoll GUI preferences, all mimeview entries will be ignored except the one labelled application/x-all (which is set to use xdg-open by default). - In this case, the xallexcepts top level variable defines a list of mime + In this case, the xallexcepts top level variable defines a list of MIME type exceptions which will be processed according to the local entries instead of being passed to the desktop. This is so that specific Recoll options such as a page number or a search string can be passed to @@ -1121,13 +1126,13 @@ Chapter 5. Installation and configuration All viewer definition entries must be placed under a [view] section. - The keys in the file are normally mime types. You can add an application + The keys in the file are normally MIME types. You can add an application tag to specialize the choice for an area of the filesystem (using a localfields specification in mimeconf). The syntax for the key is mimetype|tag The nouncompforviewmts entry, (placed at the top level, outside of the - [view] section), holds a list of mime types that should not be + [view] section), holds a list of MIME types that should not be uncompressed before starting the viewer (if they are found compressed, ie: mydoc.doc.gz). @@ -1147,7 +1152,7 @@ Chapter 5. Installation and configuration will not create a temporary file to extract the subdocument, expecting the called application (possibly a script) to be able to handle it. - o %M. Mime type + o %M. MIME type o %p. Page index. Only significant for a subset of document types, currently only PDF, Postscript and DVI files. Can be used to start the @@ -1200,7 +1205,7 @@ Chapter 5. Installation and configuration .blob = application/x-blobapp - Note that the mime type is made up here, and you could call it + Note that the MIME type is made up here, and you could call it diesel/oil just the same. o In $RECOLL_CONFDIR/mimeview under the [view] section, add: @@ -1211,7 +1216,7 @@ Chapter 5. Installation and configuration would use %u if it liked URLs better. If you just wanted to change the application used by Recoll to display a - mime type which it already knows, you would just need to edit mimeview. + MIME type which it already knows, you would just need to edit mimeview. The entries you add in your personal file override those in the central configuration, which you do not need to alter. mimeview can also be modified from the Gui. @@ -1233,17 +1238,17 @@ Chapter 5. Installation and configuration for the files inside the result lists. Icons are normally 64x64 pixels PNG files which live in /usr/[local/]share/recoll/images. - o Under the [categories] section, you should add the mime type where it + o Under the [categories] section, you should add the MIME type where it makes sense (you can also create a category). Categories may be used for filtering in advanced search. - The rclblob filter should be an executable program or script which exists + The rclblob handler should be an executable program or script which exists inside /usr/[local/]share/recoll/filters. It will be given a file name as argument and should output the text or html contents on the standard output. - The filter programming section describes in more detail how to write a - filter. + The filter programming section describes in more detail how to write an + input handler. ---------------------------------------------------------------------- diff --git a/src/README b/src/README index 39a0faeb..f972a648 100644 --- a/src/README +++ b/src/README @@ -134,15 +134,15 @@ More documentation can be found in the doc/ directory or at http://www.recoll.or 4. Programming interface - 4.1. Writing a document filter + 4.1. Writing a document input handler - 4.1.1. Simple filters + 4.1.1. Simple input handlers - 4.1.2. "Multiple" filters + 4.1.2. "Multiple" handlers - 4.1.3. Telling Recoll about the filter + 4.1.3. Telling Recoll about the handler - 4.1.4. Filter HTML output + 4.1.4. Input handler HTML output 4.1.5. Page numbers @@ -259,7 +259,7 @@ Chapter 1. Introduction Recoll stores all internal data in Unicode UTF-8 format, and it can index files with different character sets, encodings, and languages into the - same index. It has input filters for many document types. + same index. It has can process many document types. Stemming is the process by which Recoll reduces words to their radicals so that searching does not depend, for example, on a word being singular or @@ -418,13 +418,13 @@ Chapter 2. Indexing Excluding types can be done by adding wildcard name patterns to the skippedNames list, which can be done from the GUI Index configuration - menu. It is also possible to exclude a mime type independantly of the file - name by associating it with the rclnull filter. This can be done by - editing the mimeconf configuration file. + menu. For versions 1.20 and later, you can alternatively set the + excludedmimetypes list in the configuration file. This can be redefined + for subdirectories. - In order to define a positive list, You need to edit the main - configuration file (recoll.conf) and set the indexedmimetypes - configuration variable. Example: + You can also define an exclusive list of MIME types to be indexed (no + others will be indexed), by settting the indexedmimetypes configuration + variable. Example: indexedmimetypes = text/html application/pdf @@ -436,10 +436,11 @@ Chapter 2. Indexing (When using sections like this, don't forget that they remain in effect - until the end of the file or another section indicator). There is no GUI - way to edit the parameter, because this option runs contrary to Recoll - main goal which is to help you find information, independantly of how it - may be stored. + until the end of the file or another section indicator). + + excludedmimetypes or indexedmimetypes, can be set either by editing the + main configuration file (recoll.conf), or from the GUI index configuration + tool. 2.1.4. Recovery @@ -702,7 +703,7 @@ Chapter 2. Indexing mime_type - If set, this overrides any other determination of the file mime + If set, this overrides any other determination of the file MIME type. charset @@ -1018,11 +1019,11 @@ Chapter 3. Searching you prefer to completely customize the choice of applications, you can uncheck the Use desktop preferences option in the GUI preferences dialog, and click the Choose editor applications button to adjust the predefined - Recoll choices. The tool accepts multiple selections of mime types (e.g. + Recoll choices. The tool accepts multiple selections of MIME types (e.g. to set up the editor for the dozens of office file types). Even when Use desktop preferences is checked, there is a small list of - exceptions, for mime types where the Recoll choice should override the + exceptions, for MIME types where the Recoll choice should override the desktop one. These are applications which are well integrated with Recoll, especially evince for viewing PDF and Postscript files because of its support for opening the document at a specific page and passing a search @@ -1242,7 +1243,7 @@ Chapter 3. Searching specifying multiple clauses which are combined to build the search. 2. The second tab lets filter the results according to file size, date of - modification, mime type, or location. + modification, MIME type, or location. Click on the Start Search button in the advanced search dialog, or type Enter in any text field to start the search. The button in the main window @@ -1305,8 +1306,8 @@ Chapter 3. Searching can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12 respectively. - o The next section allows filtering the results by their mime types, or - mime categories (ie: media/text/message/etc.). + o The next section allows filtering the results by their MIME types, or + MIME categories (ie: media/text/message/etc.). You can transfer the types between two boxes, to define which will be included or excluded by the search. @@ -1647,7 +1648,7 @@ Chapter 3. Searching an appropriate application. o Exceptions: when using the desktop preferences for opening documents, - these are mime types that will still be opened according to Recoll + these are MIME types that will still be opened according to Recoll preferences. This is useful for passing parameters like page numbers or search strings to applications that support them (e.g. evince). This cannot be done with xdg-open which only supports passing one @@ -1789,7 +1790,7 @@ Chapter 3. Searching o %D. Date - o %I. Icon image name. This is normally determined from the mime type. + o %I. Icon image name. This is normally determined from the MIME type. The associations are defined inside the mimeconf configuration file. If a thumbnail for the file is found at the standard Freedesktop location, this will be displayed instead. @@ -1798,7 +1799,7 @@ Chapter 3. Searching o %L. Precooked Preview, Edit, and possibly Snippets links - o %M. Mime type + o %M. MIME type o %N. result Number inside the result page @@ -1824,7 +1825,7 @@ Chapter 3. Searching stored by default, apart from the values above (only author and filename), so this feature will need some custom local configuration to be useful. An example candidate would be the recipient field which is generated by the - message filters. + message input handlers. The default value for the paragraph format string is: @@ -1949,6 +1950,8 @@ Chapter 3. Searching -m : dump the whole document meta[] array for each result -A : output the document abstracts -S fld : sort by field + -s stemlang : set stemming language to use (must exist in index...) + Use -s "" to turn off stem expansion -D : sort descending -i : additional index, several can be given -e use url encoding (%xx) for urls @@ -2139,7 +2142,7 @@ Chapter 3. Searching Periods can also be specified with small letters (ie: p2y). - o mime or format for specifying the mime type. This one is quite special + o mime or format for specifying the MIME type. This one is quite special because you can specify several values which will be OR'ed (the normal default for the language is AND). Ex: mime:text/plain mime:text/html. Specifying an explicit boolean operator before a mime specification is @@ -2149,11 +2152,11 @@ Chapter 3. Searching with an OR default. You do need to use OR with ext terms for example. o type or rclcat for specifying the category (as in - text/media/presentation/etc.). The classification of mime types in + text/media/presentation/etc.). The classification of MIME types in categories is defined in the Recoll configuration (mimeconf), and can be modified or extended. The default category names are those which permit filtering results in the main GUI screen. Categories are OR'ed - like mime types above. This can't be negated with - either. + like MIME types above. This can't be negated with - either. Words inside phrases and capitalized words are not stem-expanded. Wildcards may be used anywhere inside a term. Specifying a wild-card on @@ -2161,9 +2164,9 @@ Chapter 3. Searching one if the expansion is truncated because of excessive size). Also see More about wildcards. - The document filters used while indexing have the possibility to create - other fields with arbitrary names, and aliases may be defined in the - configuration, so that the exact field search possibilities may be + The document input handlers used while indexing have the possibility to + create other fields with arbitrary names, and aliases may be defined in + the configuration, so that the exact field search possibilities may be different for you if someone took care of the customisation. 3.5.1. Modifiers @@ -2378,81 +2381,91 @@ Chapter 4. Programming interface Recoll has an Application Programming Interface, usable both for indexing and searching, currently accessible from the Python language. - Another less radical way to extend the application is to write filters for - new types of documents. + Another less radical way to extend the application is to write input + handlers for new types of documents. The processing of metadata attributes for documents (fields) is highly configurable. -4.1. Writing a document filter +4.1. Writing a document input handler - Recoll filters cooperate to translate from the multitude of input document - formats, simple ones as opendocument, acrobat), or compound ones such as - Zip or Email, into the final Recoll indexing input format, which may be - text/plain or text/html. Most filters are executable programs or scripts. - A few filters are coded in C++ and live inside recollindex. This latter + Terminology + + The small programs or pieces of code which handle the processing of the + different document types for Recoll used to be called filters, which is + still reflected in the name of the directory which holds them and many + configuration variables. They were named this way because one of their + primary functions is to filter out the formatting directives and keep the + text content. However these modules may have other behaviours, and the + term input handler is now progressively substituted in the documentation. + filter is still used in many places though. + + Recoll input handlers cooperate to translate from the multitude of input + document formats, simple ones as opendocument, acrobat), or compound ones + such as Zip or Email, into the final Recoll indexing input format, which + is plain text. Most input handlers are executable programs or scripts. A + few handlers are coded in C++ and live inside recollindex. This latter kind will not be described here. There are currently (1.18 and since 1.13) two kinds of external executable - filters: + input handlers: - o Simple filters (exec filters) run once and exit. They can be bare - programs like antiword, or scripts using other programs. They are very - simple to write, because they just need to print the converted - document to the standard output. Their output can be text/plain or - text/html. + o Simple exec handlers run once and exit. They can be bare programs like + antiword, or scripts using other programs. They are very simple to + write, because they just need to print the converted document to the + standard output. Their output can be plain text or HTML. HTML is + usually preferred because it can store metadata fields and it allows + preserving some of the formatting for the GUI preview. - o Multiple filters (execm filters), run as long as their master process - (recollindex) is active. They can process multiple files (sparing the + o Multiple execm handlers can process multiple files (sparing the process startup time which can be very significant), or multiple documents per file (e.g.: for zip or chm files). They communicate with the indexer through a simple protocol, but are nevertheless a bit more - complicated than the older kind. Most of new filters are written in + complicated than the older kind. Most of new handlers are written in Python, using a common module to handle the protocol. There is an exception, rclimg which is written in Perl. The subdocuments output by - these filters can be directly indexable (text or HTML), or they can be - other simple or compound documents that will need to be processed by - another filter. + these handlers can be directly indexable (text or HTML), or they can + be other simple or compound documents that will need to be processed + by another handler. - In both cases, filters deal with regular file system files, and can + In both cases, handlers deal with regular file system files, and can process either a single document, or a linear list of documents in each file. Recoll is responsible for performing up to date checks, deal with more complex embedding and other upper level issues. - In the extreme case of a simple filter returning a document in text/plain - format, no metadata can be transferred from the filter to the indexer. - Generic metadata, like document size or modification date, will be - gathered and stored by the indexer. + A simple handler returning a document in text/plain format, can transfer + no metadata to the indexer. Generic metadata, like document size or + modification date, will be gathered and stored by the indexer. - Filters that produce text/html format can return an arbitrary amount of + Handlers that produce text/html format can return an arbitrary amount of metadata inside HTML meta tags. These will be processed according to the directives found in the fields configuration file. - The filters that can handle multiple documents per file return a single + The handlers that can handle multiple documents per file return a single piece of data to identify each document inside the file. This piece of data, called an ipath element will be sent back by Recoll to extract the document at query time, for previewing, or for creating a temporary file to be opened by a viewer. - The following section describes the simple filters, and the next one gives - a few explanations about the execm ones. You could conceivably write a - simple filter with only the elements in the manual. This will not be the - case for the other ones, for which you will have to look at the code. + The following section describes the simple handlers, and the next one + gives a few explanations about the execm ones. You could conceivably write + a simple handler with only the elements in the manual. This will not be + the case for the other ones, for which you will have to look at the code. - 4.1.1. Simple filters + 4.1.1. Simple input handlers - Recoll simple filters are usually shell-scripts, but this is in no way + Recoll simple handlers are usually shell-scripts, but this is in no way necessary. Extracting the text from the native format is the difficult part. Outputting the format expected by Recoll is trivial. Happily enough, most document formats have translators or text extractors which can be - called from the filter. In some cases the output of the translating + called from the handler. In some cases the output of the translating program is completely appropriate, and no intermediate shell-script is needed. - Filters are called with a single argument which is the source file name. - They should output the result to stdout. + Input handlers are called with a single argument which is the source file + name. They should output the result to stdout. - When writing a filter, you should decide if it will output plain text or + When writing a handler, you should decide if it will output plain text or HTML. Plain text is simpler, but you will not be able to add metadata or vary the output character encoding (this will be defined in a configuration file). Additionally, some formatting may be easier to @@ -2461,25 +2474,26 @@ Chapter 4. Programming interface field searches.. The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells - the filter if the operation is for indexing or previewing. Some filters + the handler if the operation is for indexing or previewing. Some handlers use this to output a slightly different format, for example stripping uninteresting repeated keywords (ie: Subject: for email) when indexing. This is not essential. - You should look at one of the simple filters, for example rclps for a + You should look at one of the simple handlers, for example rclps for a starting point. - Don't forget to make your filter executable before testing ! + Don't forget to make your handler executable before testing ! - 4.1.2. "Multiple" filters + 4.1.2. "Multiple" handlers - If you can program and want to write an execm filter, it should not be too - difficult to make sense of one of the existing modules. For example, look - at rclzip which uses Zip file paths as identifiers (ipath), and rclics, - which uses an integer index. Also have a look at the comments inside the - internfile/mh_execm.h file and possibly at the corresponding module. + If you can program and want to write an execm handler, it should not be + too difficult to make sense of one of the existing modules. For example, + look at rclzip which uses Zip file paths as identifiers (ipath), and + rclics, which uses an integer index. Also have a look at the comments + inside the internfile/mh_execm.h file and possibly at the corresponding + module. - execm filters sometimes need to make a choice for the nature of the ipath + execm handlers sometimes need to make a choice for the nature of the ipath elements that they use in communication with the indexer. Here are a few guidelines: @@ -2491,34 +2505,34 @@ Chapter 4. Programming interface o Recoll uses a colon (:) as a separator to store a complex path internally (for deeper embedding). Colons inside the ipath elements - output by a filter will be escaped, but would be a bad choice as a - filter-specific separator (mostly, again, for debugging issues). + output by a handler will be escaped, but would be a bad choice as a + handler-specific separator (mostly, again, for debugging issues). - In any case, the main goal is that it should be easy for the filter to + In any case, the main goal is that it should be easy for the handler to extract the target document, given the file name and the ipath element. - execm filters will also produce a document with a null ipath element. + execm handlers will also produce a document with a null ipath element. Depending on the type of document, this may have some associated data (e.g. the body of an email message), or none (typical for an archive file). If it is empty, this document will be useful anyway for some operations, as the parent of the actual data documents. - 4.1.3. Telling Recoll about the filter + 4.1.3. Telling Recoll about the handler - There are two elements that link a file to the filter which should process - it: the association of file to mime type and the association of a mime - type with a filter. + There are two elements that link a file to the handler which should + process it: the association of file to MIME type and the association of a + MIME type with a handler. - The association of files to mime types is mostly based on name suffixes. + The association of files to MIME types is mostly based on name suffixes. The types are defined inside the mimemap file. Example: .doc = application/msword If no suffix association is found for the file name, Recoll will try to - execute the file -i command to determine a mime type. + execute the file -i command to determine a MIME type. - The association of file types to filters is performed in the mimeconf + The association of file types to handlers is performed in the mimeconf file. A sample will probably be of better help than a long explanation: @@ -2545,10 +2559,10 @@ Chapter 4. Programming interface iso-8859-1 encoding is specified because it is not the utf-8 default, and not output by unrtf in the HTML header section. - o application/x-chm is processed by a persistant filter. This is + o application/x-chm is processed by a persistant handler. This is determined by the execm keyword. - 4.1.4. Filter HTML output + 4.1.4. Input handler HTML output The output HTML could be very minimal like the following example: @@ -2600,8 +2614,8 @@ Chapter 4. Programming interface - Filters also have the possibility to "invent" field names. This should - also be output as meta tags: + Input handlers also have the possibility to "invent" field names. This + should also be output as meta tags: @@ -2617,10 +2631,10 @@ Chapter 4. Programming interface 4.1.5. Page numbers - The indexer will interpret ^L characters in the filter output as + The indexer will interpret ^L characters in the handler output as indicating page breaks, and will record them. At query time, this allows starting a viewer on the right page for a hit or a snippet. Currently, - only the PDF, Postscript and DVI filters generate page breaks. + only the PDF, Postscript and DVI handlers generate page breaks. 4.2. Field data processing @@ -2628,14 +2642,14 @@ Chapter 4. Programming interface author, abstract. The field values for documents can appear in several ways during indexing: - either output by filters as meta fields in the HTML header section, or - extracted from file extended attributes, or added as attributes of the Doc - object when using the API, or again synthetized internally by Recoll. + either output by input handlers as meta fields in the HTML header section, + or extracted from file extended attributes, or added as attributes of the + Doc object when using the API, or again synthetized internally by Recoll. The Recoll query language allows searching for text in a specific field. Recoll defines a number of default fields. Additional ones can be output - by filters, and described in the fields configuration file. + by handlers, and described in the fields configuration file. Fields can be: @@ -2794,7 +2808,7 @@ Chapter 4. Programming interface The Db class - A Db object is created by a connect() function and holds a connection to a + A Db object is created by a connect() call and holds a connection to a Recoll index. Methods @@ -3088,7 +3102,7 @@ Chapter 5. Installation and configuration text file inside the configuration directory. A list of common file types which need external commands follows. Many of - the filters need the iconv command, which is not always listed as a + the handlers need the iconv command, which is not always listed as a dependancy. Please note that, due to the relatively dynamic nature of this @@ -3103,7 +3117,7 @@ Chapter 5. Installation and configuration http://www.recoll.org/features.html if a file type is important to you. As of Recoll release 1.14, a number of XML-based formats that were handled - by ad hoc filter code now use the xsltproc command, which usually comes + by ad hoc handler code now use the xsltproc command, which usually comes with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg. Now for the list: @@ -3121,7 +3135,7 @@ Chapter 5. Installation and configuration it may be be used as a fallback for some files which antiword does not handle. - o MS Excel and PowerPoint need catdoc. + o MS Excel and PowerPoint are processed by internal Python handlers. o MS Open XML (docx) needs xsltproc. @@ -3140,11 +3154,8 @@ Chapter 5. Installation and configuration o djvu files need djvutxt and djvused from the DjVuLibre package. - o Audio files: Recoll releases before 1.13 used the id3info command from - the id3lib package to extract mp3 tag information, metaflac (standard - flac tools) for flac files, and ogginfo (vorbis tools) for ogg files. - Releases 1.14 and later use a single Python filter based on mutagen - for all audio file types. + o Audio files: Recoll releases 1.14 and later use a single Python + handler based on mutagen for all audio file types. o Pictures: Recoll uses the Exiftool Perl package to extract tag information. Most image file formats are supported. Note that there @@ -3152,7 +3163,7 @@ Chapter 5. Installation and configuration aperture, etc.). This is only of interest if you store personal tags or textual descriptions inside the image files. - o chm: files in microsoft help format need Python and the pychm module + o chm: files in Microsoft help format need Python and the pychm module (which needs chmlib). o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar @@ -3168,11 +3179,11 @@ Chapter 5. Installation and configuration o Konqueror webarchive format with Python (uses the Tarfile module). - o mimehtml web archive format (support based on the email filter, which + o Mimehtml web archive format (support based on the email handler, which introduces some mild weirdness, but still usable). Text, HTML, email folders, and Scribus files are processed internally. Lyx - is used to index Lyx files. Many filters need iconv and the standard sed + is used to index Lyx files. Many handlers need iconv and the standard sed and awk. 5.3. Building from source @@ -3495,10 +3506,10 @@ Chapter 5. Installation and configuration A space-separated list of patterns for names of files or directories that should be ignored inside zip archives. This is - used directly by the zip filter, and has a function similar to + used directly by the zip handler, and has a function similar to skippedNames, but works independantly. Can be redefined for filesystem subdirectories. For versions up to 1.19, you will need - to update the Zip filter and install a supplementary Python + to update the Zip handler and install a supplementary Python module. The details are described on the Recoll wiki. followLinks @@ -3513,11 +3524,16 @@ Chapter 5. Installation and configuration indexedmimetypes Recoll normally indexes any file which it knows how to read. This - list lets you restrict the indexed mime types to what you specify. + list lets you restrict the indexed MIME types to what you specify. If the variable is unspecified or the list empty (the default), all supported types are processed. Can be redefined for subdirectories. + excludedmimetypes + + This list lets you exclude some MIME types from indexing. Can be + redefined for subdirectories. + compressedfilemaxkbs Size limit for compressed (.gz or .bz2) files. These need to be @@ -3550,14 +3566,14 @@ Chapter 5. Installation and configuration Recoll indexes file names in a special section of the database to allow specific file names searches using wild cards. This parameter decides if file name indexing is performed only for - files with mime types that would qualify them for full text + files with MIME types that would qualify them for full text indexing, or for all files inside the selected subtrees, - independently of mime type. + independently of MIME type. usesystemfilecommand Decide if we use the file -i system command as a final step for - determining the mime type for a file (the main procedure uses + determining the MIME type for a file (the main procedure uses suffix associations as defined in the mimemap file). This can be useful for files with suffix-less names, but it will also cause the indexing of many bogus "text" files. @@ -3770,6 +3786,9 @@ Chapter 5. Installation and configuration This is only used by the web browser plugin indexing code, and defines the maximum size for the web page cache. Default: 40 MB. + Quite unfortunately, this is only taken into account when creating + the cache file. You need to delete the file for a change to be + taken into account. idxflushmb @@ -3909,15 +3928,15 @@ Chapter 5. Installation and configuration filtermaxseconds - Maximum filter execution time, after which it is aborted. Some + Maximum handler execution time, after which it is aborted. Some postscript programs just loop... filtersdir - A directory to search for the external filter scripts used to - index some types of files. The value should not be changed, except - if you want to modify one of the default scripts. The value can be - redefined for any sub-directory. + A directory to search for the external input handler scripts used + to index some types of files. The value should not be changed, + except if you want to modify one of the default scripts. The value + can be redefined for any sub-directory. iconsdir @@ -3998,17 +4017,17 @@ Chapter 5. Installation and configuration This section defines lists of synonyms for the canonical names used inside the [prefixes] and [stored] sections - filter-specific sections + handler-specific sections - Some filters may need specific configuration for handling fields. - Only the email message filter currently has such a section (named - [mail]). It allows indexing arbitrary email headers in addition to - the ones indexed by default. Other such sections may appear in the - future. + Some input handlers may need specific configuration for handling + fields. Only the email message handler currently has such a + section (named [mail]). It allows indexing arbitrary email headers + in addition to the ones indexed by default. Other such sections + may appear in the future. Here follows a small example of a personal fields file. This would extract a specific email header and use it as a searchable field, with data - displayable inside result lists. (Side note: as the email filter does no + displayable inside result lists. (Side note: as the email handler does no decoding on the values, only plain ascii headers can be indexed, and only the first occurrence will be used for headers that occur several times). @@ -4040,10 +4059,10 @@ Chapter 5. Installation and configuration 5.4.3. The mimemap file - mimemap specifies the file name extension to mime type mappings. + mimemap specifies the file name extension to MIME type mappings. For file names without an extension, or with an unknown one, the system's - file -i command will be executed to determine the mime type (this can be + file -i command will be executed to determine the MIME type (this can be switched off inside the main configuration file). The mappings can be specified on a per-subtree basis, which may be useful @@ -4064,7 +4083,7 @@ Chapter 5. Installation and configuration 5.4.4. The mimeconf file - mimeconf specifies how the different mime types are handled for indexing, + mimeconf specifies how the different MIME types are handled for indexing, and which icons are displayed in the recoll result lists. Changing the parameters in the [index] section is probably not a good idea @@ -4088,7 +4107,7 @@ Chapter 5. Installation and configuration Recoll GUI preferences, all mimeview entries will be ignored except the one labelled application/x-all (which is set to use xdg-open by default). - In this case, the xallexcepts top level variable defines a list of mime + In this case, the xallexcepts top level variable defines a list of MIME type exceptions which will be processed according to the local entries instead of being passed to the desktop. This is so that specific Recoll options such as a page number or a search string can be passed to @@ -4101,13 +4120,13 @@ Chapter 5. Installation and configuration All viewer definition entries must be placed under a [view] section. - The keys in the file are normally mime types. You can add an application + The keys in the file are normally MIME types. You can add an application tag to specialize the choice for an area of the filesystem (using a localfields specification in mimeconf). The syntax for the key is mimetype|tag The nouncompforviewmts entry, (placed at the top level, outside of the - [view] section), holds a list of mime types that should not be + [view] section), holds a list of MIME types that should not be uncompressed before starting the viewer (if they are found compressed, ie: mydoc.doc.gz). @@ -4127,7 +4146,7 @@ Chapter 5. Installation and configuration will not create a temporary file to extract the subdocument, expecting the called application (possibly a script) to be able to handle it. - o %M. Mime type + o %M. MIME type o %p. Page index. Only significant for a subset of document types, currently only PDF, Postscript and DVI files. Can be used to start the @@ -4180,7 +4199,7 @@ Chapter 5. Installation and configuration .blob = application/x-blobapp - Note that the mime type is made up here, and you could call it + Note that the MIME type is made up here, and you could call it diesel/oil just the same. o In $RECOLL_CONFDIR/mimeview under the [view] section, add: @@ -4191,7 +4210,7 @@ Chapter 5. Installation and configuration would use %u if it liked URLs better. If you just wanted to change the application used by Recoll to display a - mime type which it already knows, you would just need to edit mimeview. + MIME type which it already knows, you would just need to edit mimeview. The entries you add in your personal file override those in the central configuration, which you do not need to alter. mimeview can also be modified from the Gui. @@ -4213,14 +4232,14 @@ Chapter 5. Installation and configuration for the files inside the result lists. Icons are normally 64x64 pixels PNG files which live in /usr/[local/]share/recoll/images. - o Under the [categories] section, you should add the mime type where it + o Under the [categories] section, you should add the MIME type where it makes sense (you can also create a category). Categories may be used for filtering in advanced search. - The rclblob filter should be an executable program or script which exists + The rclblob handler should be an executable program or script which exists inside /usr/[local/]share/recoll/filters. It will be given a file name as argument and should output the text or html contents on the standard output. - The filter programming section describes in more detail how to write a - filter. + The filter programming section describes in more detail how to write an + input handler.