doc: clarifications in the synonyms section

This commit is contained in:
Jean-Francois Dockes 2016-04-17 08:07:52 +02:00
parent 1b0e8bb5b6
commit 766e7d4804
2 changed files with 142 additions and 101 deletions

View file

@ -20,8 +20,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h1 class="title"><a name="idp48862656" id= <h1 class="title"><a name="idp59627200" id=
"idp48862656"></a>Recoll user manual</h1> "idp59627200"></a>Recoll user manual</h1>
</div> </div>
<div> <div>
@ -109,13 +109,13 @@ alink="#0000FF">
multiple indexes</a></span></dt> multiple indexes</a></span></dt>
<dt><span class="sect2">2.1.3. <a href= <dt><span class="sect2">2.1.3. <a href=
"#idp54428144">Document types</a></span></dt> "#idp65068656">Document types</a></span></dt>
<dt><span class="sect2">2.1.4. <a href= <dt><span class="sect2">2.1.4. <a href=
"#idp54447824">Indexing failures</a></span></dt> "#idp65088336">Indexing failures</a></span></dt>
<dt><span class="sect2">2.1.5. <a href= <dt><span class="sect2">2.1.5. <a href=
"#idp48768496">Recovery</a></span></dt> "#idp65095792">Recovery</a></span></dt>
</dl> </dl>
</dd> </dd>
@ -293,9 +293,8 @@ alink="#0000FF">
line</a></span></dt> line</a></span></dt>
<dt><span class="sect1">3.4. <a href= <dt><span class="sect1">3.4. <a href=
"#RCL.SEARCH.SYNONYMS">Using Synonyms (<span class= "#RCL.SEARCH.SYNONYMS">Using Synonyms
"application">Recoll</span> 1.22 and (1.22)</a></span></dt>
later)</a></span></dt>
<dt><span class="sect1">3.5. <a href= <dt><span class="sect1">3.5. <a href=
"#RCL.SEARCH.PTRANS">Path translations</a></span></dt> "#RCL.SEARCH.PTRANS">Path translations</a></span></dt>
@ -500,12 +499,10 @@ alink="#0000FF">
are specific to Unix, and not valid on <span class= are specific to Unix, and not valid on <span class=
"application">Windows</span>. Some described features are "application">Windows</span>. Some described features are
also not available on <span class= also not available on <span class=
"application">Windows</span>.</p> "application">Windows</span>. The manual will be
progressively updated. Until this happens, most references to
<p>The manual will be progressively updated for <span class= shared files can be translated by looking under the Recoll
"application">Windows</span>. Until this happens, most installation directory (esp. the <code class=
references to files can be translated by looking under the
Recoll installation directory (esp. the <code class=
"filename">Share</code> subdirectory). The user configuration "filename">Share</code> subdirectory). The user configuration
is stored by default under <code class= is stored by default under <code class=
"filename">AppData/Local/Recoll</code> inside the user "filename">AppData/Local/Recoll</code> inside the user
@ -546,12 +543,18 @@ alink="#0000FF">
the <span class="guilabel">Top directories</span> the <span class="guilabel">Top directories</span>
section).</p> section).</p>
<p>Also be aware that you may need to install the <p>Also be aware that, on Unix/Linux, you may need to
appropriate <a class="link" href="#RCL.INSTALL.EXTERNAL" install the appropriate <a class="link" href=
title="5.2.&nbsp;Supporting packages">supporting "#RCL.INSTALL.EXTERNAL" title=
applications</a> for document types that need them (for "5.2.&nbsp;Supporting packages">supporting applications</a>
example <span class="application">antiword</span> for for document types that need them (for example <span class=
<span class="application">Microsoft Word</span> files).</p> "application">antiword</span> for <span class=
"application">Microsoft Word</span> files).</p>
<p>The <span class="application">Recoll</span> installation
for <span class="application">Windows</span> is
self-contained and includes most useful auxiliary programs.
You will just need to install Python 2.7.</p>
</div> </div>
<div class="sect1"> <div class="sect1">
@ -978,8 +981,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idp54428144" id= <h3 class="title"><a name="idp65068656" id=
"idp54428144"></a>2.1.3.&nbsp;Document types</h3> "idp65068656"></a>2.1.3.&nbsp;Document types</h3>
</div> </div>
</div> </div>
</div> </div>
@ -1072,8 +1075,8 @@ indexedmimetypes = application/pdf
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idp54447824" id= <h3 class="title"><a name="idp65088336" id=
"idp54447824"></a>2.1.4.&nbsp;Indexing "idp65088336"></a>2.1.4.&nbsp;Indexing
failures</h3> failures</h3>
</div> </div>
</div> </div>
@ -1113,8 +1116,8 @@ indexedmimetypes = application/pdf
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idp48768496" id= <h3 class="title"><a name="idp65095792" id=
"idp48768496"></a>2.1.5.&nbsp;Recovery</h3> "idp65095792"></a>2.1.5.&nbsp;Recovery</h3>
</div> </div>
</div> </div>
</div> </div>
@ -4562,36 +4565,46 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r
<h2 class="title" style="clear: both"><a name= <h2 class="title" style="clear: both"><a name=
"RCL.SEARCH.SYNONYMS" id= "RCL.SEARCH.SYNONYMS" id=
"RCL.SEARCH.SYNONYMS"></a>3.4.&nbsp;Using Synonyms "RCL.SEARCH.SYNONYMS"></a>3.4.&nbsp;Using Synonyms
(<span class="application">Recoll</span> 1.22 and (1.22)</h2>
later)</h2>
</div> </div>
</div> </div>
</div> </div>
<p>There are a number of different uses for synonyms in <p><b>Term synonyms:&nbsp;</b>there are a number of ways to
text search. They can be used at index time (either to use term synonyms for searching text:</p>
increase or decrease the number of indexed terms), or at
query time, to reduce user terms to a set of canonical
ones, or to expand queries to match texts containing
synonyms of the user terms.</p>
<p>Only the last approach is used in <span class= <div class="itemizedlist">
"application">Recoll</span>. Synonym groups can be defined <ul class="itemizedlist" style="list-style-type: disc;">
so that a user query term which is found to be part of a <li class="listitem">
synonym group will be optionally expanded into an OR query <p>At index creation time, they can be used to alter
for all synonyms.</p> the indexed terms, either increasing or decreasing
their number, by expanding the original terms to all
synonyms, or by reducing all synonym terms to a
canonical one.</p>
</li>
<p>What is it good for ? The synonyms function is probably <li class="listitem">
not going to help you find your letters to Mr. Smith. It is <p>At query time, they can be used to match texts
best used for domain-specific searches. For example, it was containing terms which are synonyms of the ones
initially suggested by a user performing searches among specified by the user, either by expanding the query
historical documents: the synonyms file would contains for all synonyms, or by reducing the user entry to
nicknames and aliases for each of the persons of canonical terms (the latter only works if the
interest.</p> corresponding processing has been performed while
creating the index).</p>
</li>
</ul>
</div>
<p>In practise, synonym groups are defined inside ordinary <p><span class="application">Recoll</span> only uses
text files. Each line in the file defines a group. synonyms at query time. A user query term which part of a
Example:</p> synonym group will be optionally expanded into an
<code class="literal">OR</code> query for all terms in the
group.</p>
<p>Synonym groups are defined inside ordinary text files.
Each line in the file defines a group.</p>
<p>Example:</p>
<pre class="programlisting"> <pre class="programlisting">
hi hello "good morning" hi hello "good morning"
@ -4601,29 +4614,39 @@ bye goodbye "see you" \
</pre> </pre>
<p>As usual lines beginning with a <code class= <p>As usual, lines beginning with a <code class=
"literal">#</code> are comments, empty lines are ignored, "literal">#</code> are comments, empty lines are ignored,
and lines can be continued by ending them with a and lines can be continued by ending them with a
backslash.</p> backslash.</p>
<p>The synonyms are searched for matches with user terms
after these are stem-expanded, but the contents of the
synonyms file itself is not subjected to stem expansion
(1.22). This means that a match will not be found if the
form present in the synonyms file is not present anywhere
in the document set.</p>
<p>Multi-word synonyms are supported, but be aware that <p>Multi-word synonyms are supported, but be aware that
these will generate phrase queries, which may degrade these will generate phrase queries, which may degrade
performance (and also, no stemming).</p> performance and will disable stemming expansion for the
phrase terms.</p>
<p>A synonyms file can be specified in the GUI preferences, <p>The synonyms file can be specified in the <span class=
or as an option to <span class= "guilabel">Search parameters</span> tab of the <span class=
"command"><strong>recollq</strong></span>.</p> "guilabel">GUI configuration</span> <span class=
"guilabel">Preferences</span> menu entry, or as an option
for command-line searches.</p>
<p>This feature is new in <span class= <p>Once the file is defined, the use of synonyms can be
"application">Recoll</span> 1.22 and will probably need to enabled or disabled directly from the <span class=
be refined after some user feedback.</p> "guilabel">Preferences</span> menu.</p>
<p>The synonyms are searched for matches with user terms
after the latter are stem-expanded, but the contents of the
synonyms file itself is not subjected to stem expansion.
This means that a match will not be found if the form
present in the synonyms file is not present anywhere in the
document set.</p>
<p>The synonyms function is probably not going to help you
find your letters to Mr. Smith. It is best used for
domain-specific searches. For example, it was initially
suggested by a user performing searches among historical
documents: the synonyms file would contains nicknames and
aliases for each of the persons of interest.</p>
</div> </div>
<div class="sect1"> <div class="sect1">

View file

@ -57,10 +57,8 @@
<application>MS-Windows</application>. Many references in this <application>MS-Windows</application>. Many references in this
manual, especially file locations, are specific to Unix, and not manual, especially file locations, are specific to Unix, and not
valid on &WIN;. Some described features are also not available on valid on &WIN;. Some described features are also not available on
&WIN;.</para> &WIN;. The manual will be progressively updated. Until this happens,
most references to shared files can be translated by looking under
<para>The manual will be progressively updated for &WIN;. Until this
happens, most references to files can be translated by looking under
the Recoll installation directory (esp. the the Recoll installation directory (esp. the
<filename>Share</filename> subdirectory). The user configuration is <filename>Share</filename> subdirectory). The user configuration is
stored by default under <filename>AppData/Local/Recoll</filename> stored by default under <filename>AppData/Local/Recoll</filename>
@ -87,11 +85,16 @@
</menuchoice>, then adjust the <guilabel>Top </menuchoice>, then adjust the <guilabel>Top
directories</guilabel> section).</para> directories</guilabel> section).</para>
<para>Also be aware that you may need to install the <para>Also be aware that, on Unix/Linux, you may need to install the
appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting
applications</link> for document types that need them (for applications</link> for document types that need them (for
example <application>antiword</application> for example <application>antiword</application> for
<application>Microsoft Word</application> files).</para> <application>Microsoft Word</application> files).</para>
<para>The &RCL; installation for &WIN; is self-contained and includes
most useful auxiliary programs. You will just need to install Python
2.7.</para>
</sect1> </sect1>
<sect1 id="RCL.INTRODUCTION.SEARCH"> <sect1 id="RCL.INTRODUCTION.SEARCH">
@ -3062,28 +3065,32 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r
</sect1> </sect1>
<sect1 id="RCL.SEARCH.SYNONYMS"> <sect1 id="RCL.SEARCH.SYNONYMS">
<title>Using Synonyms (&RCL; 1.22 and later)</title> <title>Using Synonyms (1.22)</title>
<para>There are a number of different uses for synonyms in text <formalpara><title>Term synonyms:</title>
search. They can be used at index time (either to increase or decrease <para>there are a number of ways to use term synonyms for searching text:
the number of indexed terms), or at query time, to reduce user terms to <itemizedlist>
a set of canonical ones, or to expand queries to match texts containing <listitem><para>At index creation time, they can be used to alter the
synonyms of the user terms.</para> indexed terms, either increasing or decreasing their number, by
expanding the original terms to all synonyms, or by
reducing all synonym terms to a canonical one.</para></listitem>
<listitem><para>At query time, they can be used to match texts
containing terms which are synonyms of the ones specified by the user,
either by expanding the query for all synonyms, or by reducing the user
entry to canonical terms (the latter only works if the corresponding
processing has been performed while creating the index).</para></listitem>
</itemizedlist>
</para>
</formalpara>
<para>Only the last approach is used in &RCL;. Synonym groups can be <para>&RCL; only uses synonyms at query time. A user query term which
defined so that a user query term which is found to be part of a part of a synonym group will be optionally expanded into an
synonym group will be optionally expanded into an OR query for all <literal>OR</literal> query for all terms in the group.</para>
synonyms.</para>
<para>What is it good for ? The synonyms function is probably not going <para>Synonym groups are defined inside ordinary text files. Each line
to help you find your letters to Mr. Smith. It is best used for in the file defines a group.</para>
domain-specific searches. For example, it was initially suggested by a
user performing searches among historical documents: the synonyms file
would contains nicknames and aliases for each of the persons of
interest.</para>
<para>In practise, synonym groups are defined inside ordinary text <para>Example:
files. Each line in the file defines a group. Example:
<programlisting> <programlisting>
hi hello "good morning" hi hello "good morning"
@ -3091,26 +3098,37 @@ hi hello "good morning"
bye goodbye "see you" \ bye goodbye "see you" \
"au revoir" "au revoir"
</programlisting> </programlisting>
As usual lines beginning with a <literal>#</literal> are comments, </para>
<para>As usual, lines beginning with a <literal>#</literal> are comments,
empty lines are ignored, and lines can be continued by ending them with empty lines are ignored, and lines can be continued by ending them with
a backslash. a backslash.
</para> </para>
<para>The synonyms are searched for matches with user terms after these
are stem-expanded, but the contents of the synonyms file itself is not
subjected to stem expansion (1.22). This means that a match
will not be found if the form present in the synonyms file is not
present anywhere in the document set.</para>
<para>Multi-word synonyms are supported, but be aware that these will <para>Multi-word synonyms are supported, but be aware that these will
generate phrase queries, which may degrade performance (and also, no generate phrase queries, which may degrade performance and will disable
stemming).</para> stemming expansion for the phrase terms.</para>
<para>A synonyms file can be specified in the GUI preferences, or as an <para>The synonyms file can be specified in the <guilabel>Search
option to <command>recollq</command>.</para> parameters</guilabel> tab of the <guilabel>GUI configuration</guilabel>
<guilabel>Preferences</guilabel> menu entry, or as an option for
command-line searches.</para>
<para>Once the file is defined, the use of synonyms can be enabled or
disabled directly from the <guilabel>Preferences</guilabel>
menu.</para>
<para>This feature is new in &RCL; 1.22 and will probably need to be <para>The synonyms are searched for matches with user terms after the
refined after some user feedback.</para> latter are stem-expanded, but the contents of the synonyms file itself
is not subjected to stem expansion. This means that a match will not be
found if the form present in the synonyms file is not present anywhere
in the document set.</para>
<para>The synonyms function is probably not going to help you find your
letters to Mr. Smith. It is best used for domain-specific searches. For
example, it was initially suggested by a user performing searches among
historical documents: the synonyms file would contains nicknames and
aliases for each of the persons of interest.</para>
</sect1> </sect1>