recoll/website/perfs.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
  <head>
    <title>RECOLL: a personal text search system for
    Unix/Linux</title>
    <meta name="generator" content="HTML Tidy, see www.w3.org">
    <meta name="Author" content="Jean-Francois Dockes">
    <meta name="Description" content=
    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
    <meta name="Keywords" content=
      "full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
    <meta http-equiv="Content-language" content="en">
    <meta http-equiv="content-type" content=
    "text/html; charset=iso-8859-1">
    <meta name="robots" content="All,Index,Follow">
    <link type="text/css" rel="stylesheet" href="styles/style.css">
  </head>

  <body>

    <div class="rightlinks">
      <ul>
	<li><a href="index.html">Home</a></li>
	<li><a href="pics/index.html">Screenshots</a></li>
	<li><a href="download.html">Downloads</a></li>
	<li><a href="doc.html">Documentation</a></li>
      </ul>
    </div>

    <div class="content">

      <h1>Recoll: Indexing performance and index sizes</h1>

      <p>The time needed to index a given set of documents, and the
	resulting index size depend of many factors, such as file size
	and proportion of actual text content for the index size, cpu
	speed, available memory, average file size and format for the
	speed of indexing.</p>

      <p>We try here to give a number of reference points which can
	be used to roughly estimate the resources needed to create and
	store an index. Obviously, your data set will never fit one of
	the samples, so the results cannot be exactly predicted.</p>

      <p>The following very old data was obtained on a machine with a
        1800 Mhz
	AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
	disk, running Suse 10.1. More recent data follows.</p>

      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
	executed with the default flush threshold value.
	The process memory usage is the one given by <b>ps</b></p>

      <table border=1>
	<thead>
	  <tr>
	    <th>Data</th>
	    <th>Data size</th>
	    <th>Indexing time</th>
	    <th>Index size</th>
	    <th>Peak process memory usage</th>
	  </tr>
	<tbody>
	  <tr>
	    <td>Random pdfs harvested on Google</td>
	    <td>1.7 GB, 3564 files</td>
	    <td>27 mn</td>
	    <td>230 MB</td>
	    <td>225 MB</td>
	  </tr>
	  <tr>
	    <td>Ietf mailing list archive</td>
	    <td>211 MB, 44,000 messages</td>
	    <td>8 mn</td>
	    <td>350 MB</td>
	    <td>90 MB</td>
	  </tr>
	  <tr>
	    <td>Partial Wikipedia dump</td>
	    <td>15 GB, one million files</td>
	    <td>6H30</td>
	    <td>10 GB</td>
	    <td>324 MB</td>
	  </tr>
	  <tr>
	    <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
	    <td>Random pdfs harvested on Google<br>
	    Recoll 1.9, <em>idxflushmb</em> set to 10</td>
	    <td>1.7 GB, 3564 files</td>
	    <td>25 mn</td>
	    <td>262 MB</td>
	    <td>65 MB</td>
	  </tr>
	</tbody>
      </table>

      <p>Notice how the index size for the mail archive is bigger than
	the data size. Myriads of small pure text documents will do
	this. The factor of expansion would be even much worse with
	compressed folders of course (the test was on uncompressed
	data).</p>

      <p>The last test was performed with Recoll 1.9.0 which has an
	ajustable flush threshold (<em>idxflushmb</em> parameter), here
	set to 10 MB. Notice the much lower peak memory usage, with no
	performance degradation. The resulting index is bigger though,
	the exact reason is not known to me, possibly because of
	additional fragmentation </p>

      <p>There is more recent performance data (2012) at the end of
        the <a href="idxthreads/threadingRecoll.html">article about
          converting Recoll indexing to multithreading</a></p>

      <p>Update, March 2016: I took another sample of PDF performance
        data on a more modern machine, with Recoll multithreading turned
        on. The machine has an Intel Core I7-4770T Cpu, which has 4
        physical cores, and supports hyper-threading for a total of 8
        threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
        fanless, this is not a "beast" computer).</p>

      <table border=1>
	<thead>
	  <tr>
	    <th>Data</th>
	    <th>Data size</th>
	    <th>Indexing time</th>
	    <th>Index size</th>
	    <th>Peak process memory usage</th>
	  </tr>
	<tbody>
	  <tr>
	    <td>Random pdfs harvested on Google<br>
	    Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
	    parameters 6/4/1</td>
	    <td>11 GB, 5320 files</td>
	    <td>3 mn 15 S</td>
	    <td>400 MB</td>
	    <td>545 MB</td>
	  </tr>
	</tbody>
      </table>

      <p>The indexing process used 21 mn of CPU during these 3mn15 of
        real time, we are not letting these cores stay idle
        much... The improvement compared to the numbers above is quite
        spectacular (a factor of 11, approximately), mostly due to the
        multiprocessing, but also to the faster CPU and the SSD
        storage. Note that the peak memory value is for the
        recollindex process, and does not take into account the
        multiple Python and pdftotext instances (which are relatively
        small but things add up...).</p>

      <h5>Improving indexing performance with hardware:</h5>
      <p>I think
      that the following multi-step approach has a good chance to
        improve performance:
        <ul>
          <li>Check that multithreading is enabled (it is, by default
            with recent Recoll versions).</li>
          <li>Increase the flush threshold until the machine begins to
            have memory issues. Maybe add memory.</li>
          <li>Store the index on an SSD. If possible, also store the
            data on an SSD. Actually, when using many threads, it is
            probably almost more important to have the data on an
            SSD.</li>
          <li>If you have many files which will need temporary copies
            (email attachments, archive members, compressed files): use
            a memory temporary directory. Add memory.</li>
          <li>More CPUs...</li>
        </ul>
      </p>

      <p>At some point, the index writing may become the
        bottleneck. As far as I can think, the only possible approach
        then is to partition the index.</p>

    </div>
  </body>
</html>