This commit is contained in:
Jean-Francois Dockes 2015-08-06 08:26:39 +02:00
commit e9e1c6ea6d
14 changed files with 158 additions and 84 deletions

View file

@ -1018,6 +1018,14 @@ Chapter 5. Installation and configuration
Maximum handler execution time, after which it is aborted. Some
postscript programs just loop...
filtermaxmbytes
Recoll 1.20.7 and later. Maximum handler memory utilisation. This
uses setrlimit(RLIMIT_AS) on most systems (total virtual memory
space size limit). Some programs may start with 500 MBytes of
mapped shared libraries, so take this into account when choosing a
value. The default is a liberal 2000MB.
filtersdir
A directory to search for the external input handler scripts used

View file

@ -1858,25 +1858,23 @@ Chapter 3. Searching
third option has been available in recent releases and is probably now
the best one: use PRE tags with line wrapping.
o Use desktop preferences to choose document editor: if this is checked,
the xdg-open utility will be used to open files when you click the
Open link in the result list, instead of the application defined in
mimeview. xdg-open will in term use your desktop preferences to choose
an appropriate application.
o Choose editor applicationsr: this opens a dialog which allows you to
select the application to be used to open each MIME type. The default
is nornally to use the xdg-open utility, but you can override it.
o Exceptions: when using the desktop preferences for opening documents,
these are MIME types that will still be opened according to Recoll
preferences. This is useful for passing parameters like page numbers
or search strings to applications that support them (e.g. evince).
This cannot be done with xdg-open which only supports passing one
parameter.
o Exceptions: even wen xdg-open is used by default for opening
documents, you can set exceptions for MIME types that will still be
opened according to Recoll preferences. This is useful for passing
parameters like page numbers or search strings to applications that
support them (e.g. evince). This cannot be done with xdg-open which
only supports passing one parameter.
o Choose editor applications this will let you choose the command
started by the Open links inside the result list, for specific
document types.
o Document filter choice style: this will let you choose if the document
categories are displayed as a list or a set of buttons, or a menu.
o Display category filter as toolbar... this will let you choose if the
document categories are displayed as a list or a set of buttons.
o Start with simple search mode: this lets you choose the value of the
simple search type on program startup. Either a fixed value (e.g.
Query Language, or the value in use when the program last exited.
o Auto-start simple search on white space entry: if this is checked, a
search will be executed each time you enter a space in the simple
@ -2159,7 +2157,10 @@ Chapter 3. Searching
recollq is not built by default. You can use the Makefile in the query
directory to build it. This is a very simple program, and if you can
program a little c++, you may find it useful to taylor its output format
to your needs.
to your needs. Not that recollq is only really useful on systems where the
Qt libraries (or even the X11 ones) are not available. Otherwise, just use
recoll -t, which takes the exact same parameters and options which are
described for recollq
recollq has a man page (not installed by default, look in the doc/man
directory). The Usage string is as follows:
@ -4286,6 +4287,14 @@ Chapter 5. Installation and configuration
Maximum handler execution time, after which it is aborted. Some
postscript programs just loop...
filtermaxmbytes
Recoll 1.20.7 and later. Maximum handler memory utilisation. This
uses setrlimit(RLIMIT_AS) on most systems (total virtual memory
space size limit). Some programs may start with 500 MBytes of
mapped shared libraries, so take this into account when choosing a
value. The default is a liberal 2000MB.
filtersdir
A directory to search for the external input handler scripts used

View file

@ -709,6 +709,12 @@ bool TextSplit::text_to_words(const string &in)
// confusing.
// ie "MySQL manual" is matched by "MySQL manual" and
// "my sql manual" but not "mysql manual"
// A possibility would be to emit both my and sql at the
// same position. All non-phrase searches would work, and
// both "MySQL manual" and "mysql manual" phrases would
// match too. "my sql manual" would not match, but this is
// not an issue.
case A_ULETTER:
if (m_span.length() &&
charclasses[(unsigned char)m_span[m_span.length() - 1]] ==

View file

@ -2917,7 +2917,11 @@ MimeType=*/*
use the <filename>Makefile</filename> in the
<filename>query</filename> directory to build it. This is a very
simple program, and if you can program a little c++, you may find it
useful to taylor its output format to your needs.</para>
useful to taylor its output format to your needs. Not that recollq is
only really useful on systems where the Qt libraries (or even the X11
ones) are not available. Otherwise, just use <literal>recoll
-t</literal>, which takes the exact same parameters and options which
are described for <command>recollq</command></para>
<para><command>recollq</command> has a man page (not installed by
default, look in the <filename>doc/man</filename> directory). The

View file

@ -114,19 +114,26 @@ def main (args):
except getopt.GetoptError:
error("error parsing input options\n")
usage(exname)
return
return false
status = True
try:
dumper = PPTDumper(args[0], globals.params)
if not dumper.dump():
error("ppt-dump: dump error " + args[0] + "\n")
status = False
except:
error("ppt-dump: FAILURE (bad format?) " + args[0] + "\n")
status = False
if globals.params.dumpText:
print(globals.textdump.replace("\r", "\n"))
return(status)
if __name__ == '__main__':
main(sys.argv)
if main(sys.argv):
sys.exit(0)
else:
sys.exit(1)
# vim:set filetype=python shiftwidth=4 softtabstop=4 expandtab:

View file

@ -28,9 +28,8 @@
#include <string>
#include <iostream>
#include <map>
#ifndef NO_NAMESPACES
using namespace std;
#endif /* NO_NAMESPACES */
#include "cstr.h"
#include "internfile.h"
@ -550,6 +549,10 @@ bool FileInterner::dijontorcl(Rcl::Doc& doc)
// doc with an ipath, not the last one which is usually text/plain We
// also set the author and modification time from the last doc which
// has them.
//
// The stack can contain objects with an ipath element (corresponding
// to actual embedded documents), and, at the top, elements without an
// ipath element, corresponding to format translations of the last doc.
//
// The docsize is fetched from the first element without an ipath
// (first non container). If the last element directly returns
@ -579,7 +582,8 @@ void FileInterner::collectIpathAndMT(Rcl::Doc& doc) const
const map<string, string>& docdata = (*hit)->get_meta_data();
if (getKeyValue(docdata, cstr_dj_keyipath, ipathel)) {
if (!ipathel.empty()) {
// We have a non-empty ipath
// Non-empty ipath. This stack element is for an
// actual embedded document, not a format translation.
hasipath = true;
getKeyValue(docdata, cstr_dj_keymt, doc.mimetype);
getKeyValue(docdata, cstr_dj_keyfn, doc.meta[Rcl::Doc::keyfn]);
@ -593,8 +597,18 @@ void FileInterner::collectIpathAndMT(Rcl::Doc& doc) const
getKeyValue(docdata, cstr_dj_keydocsize, doc.fbytes);
doc.ipath += cstr_isep;
}
getKeyValue(docdata, cstr_dj_keyauthor, doc.meta[Rcl::Doc::keyau]);
getKeyValue(docdata, cstr_dj_keymd, doc.dmtime);
// We set the author field from the innermost doc which has
// one: allows finding, e.g. an image attachment having no
// metadata by a search on the sender name. Only do this for
// actually embedded documents (avoid replacing values from
// metacmds for the topmost one). For a topmost doc, author
// will be merged by dijontorcl() later on. About same for
// dmtime, but an external value will be replaced, not
// augmented if dijontorcl() finds an internal value.
if (hasipath) {
getKeyValue(docdata, cstr_dj_keyauthor, doc.meta[Rcl::Doc::keyau]);
getKeyValue(docdata, cstr_dj_keymd, doc.dmtime);
}
}
// Trim empty tail elements in ipath.
@ -878,12 +892,6 @@ FileInterner::Status FileInterner::internfile(Rcl::Doc& doc, const string& ipath
return FIAgain;
}
// Temporary while we fix backend things
static string urltolocalpath(string url)
{
return url.substr(7, string::npos);
}
bool FileInterner::tempFileForMT(TempFile& otemp, RclConfig* cnf,
const string& mimetype)
{

View file

@ -235,8 +235,11 @@ Usage(void)
int main(int argc, char **argv)
{
// If "-t" is present at all, we don't do the GUI thing and pass the
// whole to recollq for command line / pipe usage.
// if we are named recollq or option "-t" is present at all, we
// don't do the GUI thing and pass the whole to recollq for
// command line / pipe usage.
if (!strcmp(argv[0], "recollq"))
exit(recollq(&theconfig, argc, argv));
for (int i = 0; i < argc; i++) {
if (!strcmp(argv[i], "-t")) {
exit(recollq(&theconfig, argc, argv));

View file

@ -85,8 +85,7 @@ void RclMain::showFragButs()
connect(fragbuts, SIGNAL(fragmentsChanged()),
this, SLOT(onFragmentsChanged()));
} else {
delete fragbuts;
fragbuts = 0;
deleteZ(fragbuts);
}
} else {
// Close and reopen, in hope that makes us visible...

View file

@ -279,6 +279,9 @@ void RclMain::init()
QKeySequence seq("Ctrl+Shift+s");
QShortcut *sc = new QShortcut(seq, this);
connect(sc, SIGNAL (activated()), sSearch, SLOT (takeFocus()));
QKeySequence seql("Ctrl+l");
sc = new QShortcut(seql, this);
connect(sc, SIGNAL (activated()), sSearch, SLOT (takeFocus()));
connect(&m_watcher, SIGNAL(fileChanged(QString)),
this, SLOT(idxStatus()));

View file

@ -12,8 +12,15 @@
using namespace std;
// #define LOG_PARSER
#ifdef LOG_PARSER
#define LOGP(X) {cerr << X;}
#else
#define LOGP(X)
#endif
int yylex(yy::parser::semantic_type *, yy::parser::location_type *,
WasaParserDriver *);
WasaParserDriver *);
void yyerror(char const *);
static void qualify(Rcl::SearchDataClauseDist *, const string &);
@ -46,8 +53,8 @@ static void addSubQuery(WasaParserDriver *d,
%type <sd> query
%type <str> complexfieldname
/* Non operator tokens need precedence because of the possibility of
concatenation which needs to have lower prec than OR */
/* Non operator tokens need precedence because of the possibility of
concatenation which needs to have lower prec than OR */
%left <str> WORD
%left <str> QUOTED
%left <str> QUALIFIERS
@ -60,13 +67,14 @@ static void addSubQuery(WasaParserDriver *d,
topquery: query
{
LOGP("END PARSING\n");
d->m_result = $1;
}
query:
query query %prec UCONCAT
{
//cerr << "q: query query" << endl;
LOGP("q: query query\n");
Rcl::SearchData *sd = new Rcl::SearchData(Rcl::SCLT_AND, d->m_stemlang);
addSubQuery(d, sd, $1);
addSubQuery(d, sd, $2);
@ -74,7 +82,7 @@ query query %prec UCONCAT
}
| query AND query
{
//cerr << "q: query AND query" << endl;
LOGP("q: query AND query\n");
Rcl::SearchData *sd = new Rcl::SearchData(Rcl::SCLT_AND, d->m_stemlang);
addSubQuery(d, sd, $1);
addSubQuery(d, sd, $3);
@ -82,7 +90,7 @@ query query %prec UCONCAT
}
| query OR query
{
//cerr << "q: query OR query" << endl;
LOGP("q: query OR query\n");
Rcl::SearchData *top = new Rcl::SearchData(Rcl::SCLT_AND, d->m_stemlang);
Rcl::SearchData *sd = new Rcl::SearchData(Rcl::SCLT_OR, d->m_stemlang);
addSubQuery(d, sd, $1);
@ -92,13 +100,13 @@ query query %prec UCONCAT
}
| '(' query ')'
{
//cerr << "q: ( query )" << endl;
LOGP("q: ( query )\n");
$$ = $2;
}
|
fieldexpr %prec UCONCAT
{
//cerr << "q: fieldexpr" << endl;
LOGP("q: fieldexpr\n");
Rcl::SearchData *sd = new Rcl::SearchData(Rcl::SCLT_AND, d->m_stemlang);
d->addClause(sd, $1);
$$ = sd;
@ -107,12 +115,12 @@ fieldexpr %prec UCONCAT
fieldexpr: term
{
// cerr << "fe: simple fieldexpr: " << $1->gettext() << endl;
LOGP("fe: simple fieldexpr: " << $1->gettext() << endl);
$$ = $1;
}
| complexfieldname EQUALS term
{
// cerr << "fe: " << *$1 << " = " << $3->gettext() << endl;
LOGP("fe: " << *$1 << " = " << $3->gettext() << endl);
$3->setfield(*$1);
$3->setrel(Rcl::SearchDataClause::REL_EQUALS);
$$ = $3;
@ -120,7 +128,7 @@ fieldexpr: term
}
| complexfieldname CONTAINS term
{
// cerr << "fe: " << *$1 << " : " << $3->gettext() << endl;
LOGP("fe: " << *$1 << " : " << $3->gettext() << endl);
$3->setfield(*$1);
$3->setrel(Rcl::SearchDataClause::REL_CONTAINS);
$$ = $3;
@ -128,7 +136,7 @@ fieldexpr: term
}
| complexfieldname SMALLER term
{
// cerr << "fe: " << *$1 << " < " << $3->gettext() << endl;
LOGP(cerr << "fe: " << *$1 << " < " << $3->gettext() << endl);
$3->setfield(*$1);
$3->setrel(Rcl::SearchDataClause::REL_LT);
$$ = $3;
@ -136,7 +144,7 @@ fieldexpr: term
}
| complexfieldname SMALLEREQ term
{
// cerr << "fe: " << *$1 << " <= " << $3->gettext() << endl;
LOGP("fe: " << *$1 << " <= " << $3->gettext() << endl);
$3->setfield(*$1);
$3->setrel(Rcl::SearchDataClause::REL_LTE);
$$ = $3;
@ -144,7 +152,7 @@ fieldexpr: term
}
| complexfieldname GREATER term
{
// cerr << "fe: " << *$1 << " > " << $3->gettext() << endl;
LOGP("fe: " << *$1 << " > " << $3->gettext() << endl);
$3->setfield(*$1);
$3->setrel(Rcl::SearchDataClause::REL_GT);
$$ = $3;
@ -152,7 +160,7 @@ fieldexpr: term
}
| complexfieldname GREATEREQ term
{
// cerr << "fe: " << *$1 << " >= " << $3->gettext() << endl;
LOGP("fe: " << *$1 << " >= " << $3->gettext() << endl);
$3->setfield(*$1);
$3->setrel(Rcl::SearchDataClause::REL_GTE);
$$ = $3;
@ -160,7 +168,7 @@ fieldexpr: term
}
| '-' fieldexpr
{
// cerr << "fe: - fieldexpr[" << $2->gettext() << "]" << endl;
LOGP("fe: - fieldexpr[" << $2->gettext() << "]" << endl);
$2->setexclude(true);
$$ = $2;
}
@ -170,13 +178,13 @@ fieldexpr: term
complexfieldname:
WORD
{
// cerr << "cfn: WORD" << endl;
LOGP("cfn: WORD" << endl);
$$ = $1;
}
|
complexfieldname CONTAINS WORD
{
// cerr << "cfn: complexfieldname ':' WORD" << endl;
LOGP("cfn: complexfieldname ':' WORD" << endl);
$$ = new string(*$1 + string(":") + *$3);
delete $1;
delete $3;
@ -185,7 +193,7 @@ complexfieldname CONTAINS WORD
term:
WORD
{
//cerr << "term[" << *$1 << "]" << endl;
LOGP("term[" << *$1 << "]" << endl);
$$ = new Rcl::SearchDataClauseSimple(Rcl::SCLT_AND, *$1);
delete $1;
}
@ -197,13 +205,13 @@ WORD
qualquote:
QUOTED
{
// cerr << "QUOTED[" << *$1 << "]" << endl;
LOGP("QUOTED[" << *$1 << "]" << endl);
$$ = new Rcl::SearchDataClauseDist(Rcl::SCLT_PHRASE, *$1, 0);
delete $1;
}
| QUOTED QUALIFIERS
{
// cerr << "QUOTED[" << *$1 << "] QUALIFIERS[" << *$2 << "]" << endl;
LOGP("QUOTED[" << *$1 << "] QUALIFIERS[" << *$2 << "]" << endl);
Rcl::SearchDataClauseDist *cl =
new Rcl::SearchDataClauseDist(Rcl::SCLT_PHRASE, *$1, 0);
qualify(cl, *$2);
@ -318,8 +326,9 @@ static int parseString(WasaParserDriver *d, yy::parser::semantic_type *yylval)
break;
case '"':
/* End of string. Look for qualifiers */
while ((c = d->GETCHAR()) && !isspace(c))
while ((c = d->GETCHAR()) && (isalnum(c) || c == '.'))
d->qualifiers().push_back(c);
d->UNGETCHAR(c);
goto out;
default:
value->push_back(c);

View file

@ -91,11 +91,11 @@ bool SearchData::maybeAddAutoPhrase(Rcl::Db& db, double freqThreshold)
string field;
vector<string> words;
// Walk the clause list. If we find any non simple clause or different
// field names, bail out.
// Walk the clause list. If this is not an AND list, we find any
// non simple clause or different field names, bail out.
for (qlist_it_t it = m_query.begin(); it != m_query.end(); it++) {
SClType tp = (*it)->m_tp;
if (tp != SCLT_AND && tp != SCLT_OR) {
if (tp != SCLT_AND) {
LOGDEB2(("SearchData::maybeAddAutoPhrase: wrong tp %d\n", tp));
return false;
}

View file

@ -121,10 +121,10 @@ subdirectory, because of all the places they're referred from
<p><a href="recoll-1.20.6.tar.gz">recoll-1.20.6.tar.gz</a>.</p>
<h3>Release 1.21.0</h3>
<h3>Release 1.21.1</h3>
<p>Not the right choice if you are after complete stability:
<a href="recoll-1.21.0.tar.gz">recoll-1.21.0.tar.gz</a>. See what's
<a href="recoll-1.21.1.tar.gz">recoll-1.21.1.tar.gz</a>. See what's
new in the <a href="release-1.21.html">release notes</a>.</p>
<!--

View file

@ -7,12 +7,12 @@
== Introduction
Recoll is a big process which executes many others, mostly for extracting
text from documents. Some of the executed processes are quite short-lived,
and the time used by the process execution machinery can actually dominate
the time used to translate data. This document explores possible approaches
to improving performance without adding excessive complexity or damaging
reliability.
The Recoll indexer, *recollindex*, is a big process which executes many
others, mostly for extracting text from documents. Some of the executed
processes are quite short-lived, and the time used by the process execution
machinery can actually dominate the time used to translate data. This
document explores possible approaches to improving performance without
adding excessive complexity or damaging reliability.
Studying fork/exec performance is not exactly a new venture, and there are
many texts which address the subject. While researching, though, I found
@ -32,9 +32,10 @@ identical processes.
space initialized from an executable file, inheriting some of the resources
under various conditions.
As processes became bigger the copy-before-discard operation wasted
significant resources, and was optimized using two methods (at very
different points in time):
This was all fine with the small processes of the first Unix systems, but
as time progressed, processes became bigger and the copy-before-discard
operation was found to waste significant resources. It was optimized using
two methods (at very different points in time):
- The first approach was to supplement +fork()+ with the +vfork()+ call, which
is similar but does not duplicate the address space: the new process
@ -176,7 +177,7 @@ a single thread, and +fork()+ if it ran multiple ones.
After another careful look at the code, I could see few issues with
using +vfork()+ in the multithreaded indexer, so this was committed.
The only change necessary was to get rid on an implementation of the
The only change necessary was to get rid of an implementation of the
lacking Linux +closefrom()+ call (used to close all open descriptors above a
given value). The previous Recoll implementation listed the +/proc/self/fd+
directory to look for open descriptors but this was unsafe because of of
@ -200,13 +201,14 @@ same times as the +fork()+/+vfork()+ options.
The tests were performed on an Intel Core i5 750 (4 cores, 4 threads).
The last line is just for the fun: *recollindex* 1.18 (single-threaded)
needed almost 6 times as long to process the same files...
It would be painful to play it safe and discard the 60% reduction in
execution time offered by using +vfork()+.
execution time offered by using +vfork()+, so this was adopted for Recoll
1.21. To this day, no problems were discovered, but, still crossing
fingers...
To this day, no problems were discovered, but, still crossing fingers...
The last line in the table is just for the fun: *recollindex* 1.18
(single-threaded) needed almost 6 times as long to process the same
files...
////
Objections to vfork:

View file

@ -1,7 +1,7 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Recoll 1.20 series release notes</title>
<title>Recoll 1.21 series release notes</title>
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description"
content="recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
@ -23,7 +23,7 @@
</div>
<div class="content">
<h1>Release notes for Recoll 1.20.x</h1>
<h1>Release notes for Recoll 1.21.x</h1>
<h2>Caveats</h2>
@ -55,8 +55,23 @@
see the manual</a>). If you do so, you must then reset the
index.</p>
<h2>Minor releases</h2>
<ul>
<li>1.21.1:
<ul>
<li>Force memory usage limits on external filters.</li>
<li>GUI: add Ctrl+l as a shortcut to return focus to the
search entry (compat with web browsers).</li>
<li>result list popup allows saving results from web cache
to files.</li>
<li>The web history indexer also processes non-html files
(e.g.: pdfs).</li>
</ul>
</li>
</ul>
<h2>Changes in Recoll 1.21</h2>
<h2>Changes in Recoll 1.21.0</h2>
<ul>
<li>Allow saving queries to files and reloading them
@ -71,9 +86,10 @@
<li>Improve indexing speed by always using vfork() for
spawning external commands.</li>
<li>The pdf filter gains the capability to run OCR (tesseract) on
image-only files.</li>
<li>Improved check about when we should try to uncompress
stuff. Will eliminate some of the most dreadful case of
image-only files. This happens automatically on image-only
pdfs if tesseract is available.</li>
<li>Improved checks about when we should try to uncompress
stuff. Will eliminate some of the most dreadful cases of
recollindex having an impact on system performance.</li>
<li>Warn if non-existent paths are listed in the configuration
file (help with typos).</li>