114 lines
3.6 KiB
HTML
114 lines
3.6 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>RECOLL: a personal text search system for
|
|
Unix/Linux</title>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta name="Author" content="Jean-Francois Dockes">
|
|
<meta name="Description" content=
|
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
<meta name="Keywords" content=
|
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
<meta http-equiv="Content-language" content="en">
|
|
<meta http-equiv="content-type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<meta name="robots" content="All,Index,Follow">
|
|
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div class="rightlinks">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
<li><a href="pics/index.html">Screenshots</a></li>
|
|
<li><a href="download.html">Downloads</a></li>
|
|
<li><a href="doc.html">Documentation</a></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="content">
|
|
|
|
<h1>Recoll: Indexing performance and index sizes</h1>
|
|
|
|
<p>The time needed to index a given set of documents, and the
|
|
resulting index size depend of many factors, such as file size
|
|
and proportion of actual text content for the index size, cpu
|
|
speed, available memory, average file size and format for the
|
|
speed of indexing.</p>
|
|
|
|
<p>We try here to give a number of reference points which can
|
|
be used to roughly estimate the resources needed to create and
|
|
store an index. Obviously, your data set will never fit one of
|
|
the samples, so the results cannot be exactly predicted.</p>
|
|
|
|
<p>The following data was obtained on a machine with a 1800 Mhz
|
|
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
|
|
disk, running Suse 10.1.</p>
|
|
|
|
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
|
executed with the default flush threshold value.
|
|
The process memory usage is the one given by <b>ps</b></p>
|
|
|
|
<table border=1>
|
|
<thead>
|
|
<tr>
|
|
<th>Data</th>
|
|
<th>Data size</th>
|
|
<th>Indexing time</th>
|
|
<th>Index size</th>
|
|
<th>Peak process memory usage</th>
|
|
</tr>
|
|
<tbody>
|
|
<tr>
|
|
<td>Random pdfs harvested on Google</td>
|
|
<td>1.7 GB, 3564 files</td>
|
|
<td>27 mn</td>
|
|
<td>230 MB</td>
|
|
<td>225 MB</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Ietf mailing list archive</td>
|
|
<td>211 MB, 44,000 messages</td>
|
|
<td>8 mn</td>
|
|
<td>350 MB</td>
|
|
<td>90 MB</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Partial Wikipedia dump</td>
|
|
<td>15 GB, one million files</td>
|
|
<td>6H30</td>
|
|
<td>10 GB</td>
|
|
<td>324 MB</td>
|
|
</tr>
|
|
<tr>
|
|
<!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
|
|
<td>Random pdfs harvested on Google<br>
|
|
Recoll 1.9, <em>idxflushmb</em> set to 10</td>
|
|
<td>1.7 GB, 3564 files</td>
|
|
<td>25 mn</td>
|
|
<td>262 MB</td>
|
|
<td>65 MB</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<p>Notice how the index size for the mail archive is bigger than
|
|
the data size. Myriads of small pure text documents will do
|
|
this. The factor of expansion would be even much worse with
|
|
compressed folders of course (the test was on uncompressed
|
|
data).</p>
|
|
|
|
<p>The last test was performed with Recoll 1.9.0 which has an
|
|
ajustable flush threshold (<em>idxflushmb</em> parameter), here
|
|
set to 10 MB. Notice the much lower peak memory usage, with no
|
|
performance degradation. The resulting index is bigger though,
|
|
the exact reason is not known to me, possibly because of
|
|
additional fragmentation </p>
|
|
</p>
|
|
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|