1290 lines
43 KiB
HTML
1290 lines
43 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
|
|
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
|
|
<meta name="generator" content="AsciiDoc 8.6.9" />
|
|
<title>Converting Recoll indexing to multithreading</title>
|
|
<style type="text/css">
|
|
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
|
|
|
|
/* Default font. */
|
|
body {
|
|
font-family: Georgia,serif;
|
|
}
|
|
|
|
/* Title font. */
|
|
h1, h2, h3, h4, h5, h6,
|
|
div.title, caption.title,
|
|
thead, p.table.header,
|
|
#toctitle,
|
|
#author, #revnumber, #revdate, #revremark,
|
|
#footer {
|
|
font-family: Arial,Helvetica,sans-serif;
|
|
}
|
|
|
|
body {
|
|
margin: 1em 5% 1em 5%;
|
|
}
|
|
|
|
a {
|
|
color: blue;
|
|
text-decoration: underline;
|
|
}
|
|
a:visited {
|
|
color: fuchsia;
|
|
}
|
|
|
|
em {
|
|
font-style: italic;
|
|
color: navy;
|
|
}
|
|
|
|
strong {
|
|
font-weight: bold;
|
|
color: #083194;
|
|
}
|
|
|
|
h1, h2, h3, h4, h5, h6 {
|
|
color: #527bbd;
|
|
margin-top: 1.2em;
|
|
margin-bottom: 0.5em;
|
|
line-height: 1.3;
|
|
}
|
|
|
|
h1, h2, h3 {
|
|
border-bottom: 2px solid silver;
|
|
}
|
|
h2 {
|
|
padding-top: 0.5em;
|
|
}
|
|
h3 {
|
|
float: left;
|
|
}
|
|
h3 + * {
|
|
clear: left;
|
|
}
|
|
h5 {
|
|
font-size: 1.0em;
|
|
}
|
|
|
|
div.sectionbody {
|
|
margin-left: 0;
|
|
}
|
|
|
|
hr {
|
|
border: 1px solid silver;
|
|
}
|
|
|
|
p {
|
|
margin-top: 0.5em;
|
|
margin-bottom: 0.5em;
|
|
}
|
|
|
|
ul, ol, li > p {
|
|
margin-top: 0;
|
|
}
|
|
ul > li { color: #aaa; }
|
|
ul > li > * { color: black; }
|
|
|
|
.monospaced, code, pre {
|
|
font-family: "Courier New", Courier, monospace;
|
|
font-size: inherit;
|
|
color: navy;
|
|
padding: 0;
|
|
margin: 0;
|
|
}
|
|
pre {
|
|
white-space: pre-wrap;
|
|
}
|
|
|
|
#author {
|
|
color: #527bbd;
|
|
font-weight: bold;
|
|
font-size: 1.1em;
|
|
}
|
|
#email {
|
|
}
|
|
#revnumber, #revdate, #revremark {
|
|
}
|
|
|
|
#footer {
|
|
font-size: small;
|
|
border-top: 2px solid silver;
|
|
padding-top: 0.5em;
|
|
margin-top: 4.0em;
|
|
}
|
|
#footer-text {
|
|
float: left;
|
|
padding-bottom: 0.5em;
|
|
}
|
|
#footer-badges {
|
|
float: right;
|
|
padding-bottom: 0.5em;
|
|
}
|
|
|
|
#preamble {
|
|
margin-top: 1.5em;
|
|
margin-bottom: 1.5em;
|
|
}
|
|
div.imageblock, div.exampleblock, div.verseblock,
|
|
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
|
|
div.admonitionblock {
|
|
margin-top: 1.0em;
|
|
margin-bottom: 1.5em;
|
|
}
|
|
div.admonitionblock {
|
|
margin-top: 2.0em;
|
|
margin-bottom: 2.0em;
|
|
margin-right: 10%;
|
|
color: #606060;
|
|
}
|
|
|
|
div.content { /* Block element content. */
|
|
padding: 0;
|
|
}
|
|
|
|
/* Block element titles. */
|
|
div.title, caption.title {
|
|
color: #527bbd;
|
|
font-weight: bold;
|
|
text-align: left;
|
|
margin-top: 1.0em;
|
|
margin-bottom: 0.5em;
|
|
}
|
|
div.title + * {
|
|
margin-top: 0;
|
|
}
|
|
|
|
td div.title:first-child {
|
|
margin-top: 0.0em;
|
|
}
|
|
div.content div.title:first-child {
|
|
margin-top: 0.0em;
|
|
}
|
|
div.content + div.title {
|
|
margin-top: 0.0em;
|
|
}
|
|
|
|
div.sidebarblock > div.content {
|
|
background: #ffffee;
|
|
border: 1px solid #dddddd;
|
|
border-left: 4px solid #f0f0f0;
|
|
padding: 0.5em;
|
|
}
|
|
|
|
div.listingblock > div.content {
|
|
border: 1px solid #dddddd;
|
|
border-left: 5px solid #f0f0f0;
|
|
background: #f8f8f8;
|
|
padding: 0.5em;
|
|
}
|
|
|
|
div.quoteblock, div.verseblock {
|
|
padding-left: 1.0em;
|
|
margin-left: 1.0em;
|
|
margin-right: 10%;
|
|
border-left: 5px solid #f0f0f0;
|
|
color: #888;
|
|
}
|
|
|
|
div.quoteblock > div.attribution {
|
|
padding-top: 0.5em;
|
|
text-align: right;
|
|
}
|
|
|
|
div.verseblock > pre.content {
|
|
font-family: inherit;
|
|
font-size: inherit;
|
|
}
|
|
div.verseblock > div.attribution {
|
|
padding-top: 0.75em;
|
|
text-align: left;
|
|
}
|
|
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
|
|
div.verseblock + div.attribution {
|
|
text-align: left;
|
|
}
|
|
|
|
div.admonitionblock .icon {
|
|
vertical-align: top;
|
|
font-size: 1.1em;
|
|
font-weight: bold;
|
|
text-decoration: underline;
|
|
color: #527bbd;
|
|
padding-right: 0.5em;
|
|
}
|
|
div.admonitionblock td.content {
|
|
padding-left: 0.5em;
|
|
border-left: 3px solid #dddddd;
|
|
}
|
|
|
|
div.exampleblock > div.content {
|
|
border-left: 3px solid #dddddd;
|
|
padding-left: 0.5em;
|
|
}
|
|
|
|
div.imageblock div.content { padding-left: 0; }
|
|
span.image img { border-style: none; vertical-align: text-bottom; }
|
|
a.image:visited { color: white; }
|
|
|
|
dl {
|
|
margin-top: 0.8em;
|
|
margin-bottom: 0.8em;
|
|
}
|
|
dt {
|
|
margin-top: 0.5em;
|
|
margin-bottom: 0;
|
|
font-style: normal;
|
|
color: navy;
|
|
}
|
|
dd > *:first-child {
|
|
margin-top: 0.1em;
|
|
}
|
|
|
|
ul, ol {
|
|
list-style-position: outside;
|
|
}
|
|
ol.arabic {
|
|
list-style-type: decimal;
|
|
}
|
|
ol.loweralpha {
|
|
list-style-type: lower-alpha;
|
|
}
|
|
ol.upperalpha {
|
|
list-style-type: upper-alpha;
|
|
}
|
|
ol.lowerroman {
|
|
list-style-type: lower-roman;
|
|
}
|
|
ol.upperroman {
|
|
list-style-type: upper-roman;
|
|
}
|
|
|
|
div.compact ul, div.compact ol,
|
|
div.compact p, div.compact p,
|
|
div.compact div, div.compact div {
|
|
margin-top: 0.1em;
|
|
margin-bottom: 0.1em;
|
|
}
|
|
|
|
tfoot {
|
|
font-weight: bold;
|
|
}
|
|
td > div.verse {
|
|
white-space: pre;
|
|
}
|
|
|
|
div.hdlist {
|
|
margin-top: 0.8em;
|
|
margin-bottom: 0.8em;
|
|
}
|
|
div.hdlist tr {
|
|
padding-bottom: 15px;
|
|
}
|
|
dt.hdlist1.strong, td.hdlist1.strong {
|
|
font-weight: bold;
|
|
}
|
|
td.hdlist1 {
|
|
vertical-align: top;
|
|
font-style: normal;
|
|
padding-right: 0.8em;
|
|
color: navy;
|
|
}
|
|
td.hdlist2 {
|
|
vertical-align: top;
|
|
}
|
|
div.hdlist.compact tr {
|
|
margin: 0;
|
|
padding-bottom: 0;
|
|
}
|
|
|
|
.comment {
|
|
background: yellow;
|
|
}
|
|
|
|
.footnote, .footnoteref {
|
|
font-size: 0.8em;
|
|
}
|
|
|
|
span.footnote, span.footnoteref {
|
|
vertical-align: super;
|
|
}
|
|
|
|
#footnotes {
|
|
margin: 20px 0 20px 0;
|
|
padding: 7px 0 0 0;
|
|
}
|
|
|
|
#footnotes div.footnote {
|
|
margin: 0 0 5px 0;
|
|
}
|
|
|
|
#footnotes hr {
|
|
border: none;
|
|
border-top: 1px solid silver;
|
|
height: 1px;
|
|
text-align: left;
|
|
margin-left: 0;
|
|
width: 20%;
|
|
min-width: 100px;
|
|
}
|
|
|
|
div.colist td {
|
|
padding-right: 0.5em;
|
|
padding-bottom: 0.3em;
|
|
vertical-align: top;
|
|
}
|
|
div.colist td img {
|
|
margin-top: 0.3em;
|
|
}
|
|
|
|
@media print {
|
|
#footer-badges { display: none; }
|
|
}
|
|
|
|
#toc {
|
|
margin-bottom: 2.5em;
|
|
}
|
|
|
|
#toctitle {
|
|
color: #527bbd;
|
|
font-size: 1.1em;
|
|
font-weight: bold;
|
|
margin-top: 1.0em;
|
|
margin-bottom: 0.1em;
|
|
}
|
|
|
|
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
|
|
margin-top: 0;
|
|
margin-bottom: 0;
|
|
}
|
|
div.toclevel2 {
|
|
margin-left: 2em;
|
|
font-size: 0.9em;
|
|
}
|
|
div.toclevel3 {
|
|
margin-left: 4em;
|
|
font-size: 0.9em;
|
|
}
|
|
div.toclevel4 {
|
|
margin-left: 6em;
|
|
font-size: 0.9em;
|
|
}
|
|
|
|
span.aqua { color: aqua; }
|
|
span.black { color: black; }
|
|
span.blue { color: blue; }
|
|
span.fuchsia { color: fuchsia; }
|
|
span.gray { color: gray; }
|
|
span.green { color: green; }
|
|
span.lime { color: lime; }
|
|
span.maroon { color: maroon; }
|
|
span.navy { color: navy; }
|
|
span.olive { color: olive; }
|
|
span.purple { color: purple; }
|
|
span.red { color: red; }
|
|
span.silver { color: silver; }
|
|
span.teal { color: teal; }
|
|
span.white { color: white; }
|
|
span.yellow { color: yellow; }
|
|
|
|
span.aqua-background { background: aqua; }
|
|
span.black-background { background: black; }
|
|
span.blue-background { background: blue; }
|
|
span.fuchsia-background { background: fuchsia; }
|
|
span.gray-background { background: gray; }
|
|
span.green-background { background: green; }
|
|
span.lime-background { background: lime; }
|
|
span.maroon-background { background: maroon; }
|
|
span.navy-background { background: navy; }
|
|
span.olive-background { background: olive; }
|
|
span.purple-background { background: purple; }
|
|
span.red-background { background: red; }
|
|
span.silver-background { background: silver; }
|
|
span.teal-background { background: teal; }
|
|
span.white-background { background: white; }
|
|
span.yellow-background { background: yellow; }
|
|
|
|
span.big { font-size: 2em; }
|
|
span.small { font-size: 0.6em; }
|
|
|
|
span.underline { text-decoration: underline; }
|
|
span.overline { text-decoration: overline; }
|
|
span.line-through { text-decoration: line-through; }
|
|
|
|
div.unbreakable { page-break-inside: avoid; }
|
|
|
|
|
|
/*
|
|
* xhtml11 specific
|
|
*
|
|
* */
|
|
|
|
div.tableblock {
|
|
margin-top: 1.0em;
|
|
margin-bottom: 1.5em;
|
|
}
|
|
div.tableblock > table {
|
|
border: 3px solid #527bbd;
|
|
}
|
|
thead, p.table.header {
|
|
font-weight: bold;
|
|
color: #527bbd;
|
|
}
|
|
p.table {
|
|
margin-top: 0;
|
|
}
|
|
/* Because the table frame attribute is overriden by CSS in most browsers. */
|
|
div.tableblock > table[frame="void"] {
|
|
border-style: none;
|
|
}
|
|
div.tableblock > table[frame="hsides"] {
|
|
border-left-style: none;
|
|
border-right-style: none;
|
|
}
|
|
div.tableblock > table[frame="vsides"] {
|
|
border-top-style: none;
|
|
border-bottom-style: none;
|
|
}
|
|
|
|
|
|
/*
|
|
* html5 specific
|
|
*
|
|
* */
|
|
|
|
table.tableblock {
|
|
margin-top: 1.0em;
|
|
margin-bottom: 1.5em;
|
|
}
|
|
thead, p.tableblock.header {
|
|
font-weight: bold;
|
|
color: #527bbd;
|
|
}
|
|
p.tableblock {
|
|
margin-top: 0;
|
|
}
|
|
table.tableblock {
|
|
border-width: 3px;
|
|
border-spacing: 0px;
|
|
border-style: solid;
|
|
border-color: #527bbd;
|
|
border-collapse: collapse;
|
|
}
|
|
th.tableblock, td.tableblock {
|
|
border-width: 1px;
|
|
padding: 4px;
|
|
border-style: solid;
|
|
border-color: #527bbd;
|
|
}
|
|
|
|
table.tableblock.frame-topbot {
|
|
border-left-style: hidden;
|
|
border-right-style: hidden;
|
|
}
|
|
table.tableblock.frame-sides {
|
|
border-top-style: hidden;
|
|
border-bottom-style: hidden;
|
|
}
|
|
table.tableblock.frame-none {
|
|
border-style: hidden;
|
|
}
|
|
|
|
th.tableblock.halign-left, td.tableblock.halign-left {
|
|
text-align: left;
|
|
}
|
|
th.tableblock.halign-center, td.tableblock.halign-center {
|
|
text-align: center;
|
|
}
|
|
th.tableblock.halign-right, td.tableblock.halign-right {
|
|
text-align: right;
|
|
}
|
|
|
|
th.tableblock.valign-top, td.tableblock.valign-top {
|
|
vertical-align: top;
|
|
}
|
|
th.tableblock.valign-middle, td.tableblock.valign-middle {
|
|
vertical-align: middle;
|
|
}
|
|
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
|
|
vertical-align: bottom;
|
|
}
|
|
|
|
|
|
/*
|
|
* manpage specific
|
|
*
|
|
* */
|
|
|
|
body.manpage h1 {
|
|
padding-top: 0.5em;
|
|
padding-bottom: 0.5em;
|
|
border-top: 2px solid silver;
|
|
border-bottom: 2px solid silver;
|
|
}
|
|
body.manpage h2 {
|
|
border-style: none;
|
|
}
|
|
body.manpage div.sectionbody {
|
|
margin-left: 3em;
|
|
}
|
|
|
|
@media print {
|
|
body.manpage div#toc { display: none; }
|
|
}
|
|
|
|
|
|
</style>
|
|
<script type="text/javascript">
|
|
/*<+'])');
|
|
// Function that scans the DOM tree for header elements (the DOM2
|
|
// nodeIterator API would be a better technique but not supported by all
|
|
// browsers).
|
|
var iterate = function (el) {
|
|
for (var i = el.firstChild; i != null; i = i.nextSibling) {
|
|
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
|
|
var mo = re.exec(i.tagName);
|
|
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
|
|
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
|
|
}
|
|
iterate(i);
|
|
}
|
|
}
|
|
}
|
|
iterate(el);
|
|
return result;
|
|
}
|
|
|
|
var toc = document.getElementById("toc");
|
|
if (!toc) {
|
|
return;
|
|
}
|
|
|
|
// Delete existing TOC entries in case we're reloading the TOC.
|
|
var tocEntriesToRemove = [];
|
|
var i;
|
|
for (i = 0; i < toc.childNodes.length; i++) {
|
|
var entry = toc.childNodes[i];
|
|
if (entry.nodeName.toLowerCase() == 'div'
|
|
&& entry.getAttribute("class")
|
|
&& entry.getAttribute("class").match(/^toclevel/))
|
|
tocEntriesToRemove.push(entry);
|
|
}
|
|
for (i = 0; i < tocEntriesToRemove.length; i++) {
|
|
toc.removeChild(tocEntriesToRemove[i]);
|
|
}
|
|
|
|
// Rebuild TOC entries.
|
|
var entries = tocEntries(document.getElementById("content"), toclevels);
|
|
for (var i = 0; i < entries.length; ++i) {
|
|
var entry = entries[i];
|
|
if (entry.element.id == "")
|
|
entry.element.id = "_toc_" + i;
|
|
var a = document.createElement("a");
|
|
a.href = "#" + entry.element.id;
|
|
a.appendChild(document.createTextNode(entry.text));
|
|
var div = document.createElement("div");
|
|
div.appendChild(a);
|
|
div.className = "toclevel" + entry.toclevel;
|
|
toc.appendChild(div);
|
|
}
|
|
if (entries.length == 0)
|
|
toc.parentNode.removeChild(toc);
|
|
},
|
|
|
|
|
|
/////////////////////////////////////////////////////////////////////
|
|
// Footnotes generator
|
|
/////////////////////////////////////////////////////////////////////
|
|
|
|
/* Based on footnote generation code from:
|
|
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
|
|
*/
|
|
|
|
footnotes: function () {
|
|
// Delete existing footnote entries in case we're reloading the footnodes.
|
|
var i;
|
|
var noteholder = document.getElementById("footnotes");
|
|
if (!noteholder) {
|
|
return;
|
|
}
|
|
var entriesToRemove = [];
|
|
for (i = 0; i < noteholder.childNodes.length; i++) {
|
|
var entry = noteholder.childNodes[i];
|
|
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
|
|
entriesToRemove.push(entry);
|
|
}
|
|
for (i = 0; i < entriesToRemove.length; i++) {
|
|
noteholder.removeChild(entriesToRemove[i]);
|
|
}
|
|
|
|
// Rebuild footnote entries.
|
|
var cont = document.getElementById("content");
|
|
var spans = cont.getElementsByTagName("span");
|
|
var refs = {};
|
|
var n = 0;
|
|
for (i=0; i<spans.length; i++) {
|
|
if (spans[i].className == "footnote") {
|
|
n++;
|
|
var note = spans[i].getAttribute("data-note");
|
|
if (!note) {
|
|
// Use [\s\S] in place of . so multi-line matches work.
|
|
// Because JavaScript has no s (dotall) regex flag.
|
|
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
|
|
spans[i].innerHTML =
|
|
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
|
|
"' title='View footnote' class='footnote'>" + n + "</a>]";
|
|
spans[i].setAttribute("data-note", note);
|
|
}
|
|
noteholder.innerHTML +=
|
|
"<div class='footnote' id='_footnote_" + n + "'>" +
|
|
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
|
|
n + "</a>. " + note + "</div>";
|
|
var id =spans[i].getAttribute("id");
|
|
if (id != null) refs["#"+id] = n;
|
|
}
|
|
}
|
|
if (n == 0)
|
|
noteholder.parentNode.removeChild(noteholder);
|
|
else {
|
|
// Process footnoterefs.
|
|
for (i=0; i<spans.length; i++) {
|
|
if (spans[i].className == "footnoteref") {
|
|
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
|
|
href = href.match(/#.*/)[0]; // Because IE return full URL.
|
|
n = refs[href];
|
|
spans[i].innerHTML =
|
|
"[<a href='#_footnote_" + n +
|
|
"' title='View footnote' class='footnote'>" + n + "</a>]";
|
|
}
|
|
}
|
|
}
|
|
},
|
|
|
|
install: function(toclevels) {
|
|
var timerId;
|
|
|
|
function reinstall() {
|
|
asciidoc.footnotes();
|
|
if (toclevels) {
|
|
asciidoc.toc(toclevels);
|
|
}
|
|
}
|
|
|
|
function reinstallAndRemoveTimer() {
|
|
clearInterval(timerId);
|
|
reinstall();
|
|
}
|
|
|
|
timerId = setInterval(reinstall, 500);
|
|
if (document.addEventListener)
|
|
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
|
|
else
|
|
window.onload = reinstallAndRemoveTimer;
|
|
}
|
|
|
|
}
|
|
asciidoc.install();
|
|
/*]]>*/
|
|
</script>
|
|
</head>
|
|
<body class="article">
|
|
<div id="header">
|
|
<h1>Converting Recoll indexing to multithreading</h1>
|
|
<span id="author">Jean-François Dockès</span><br />
|
|
<span id="email"><code><<a href="mailto:jfd@recoll.org">jfd@recoll.org</a>></code></span><br />
|
|
<span id="revdate">2012-12-03</span>
|
|
</div>
|
|
<div id="content">
|
|
<div class="sect1">
|
|
<h2 id="_abstract">Abstract</h2>
|
|
<div class="sectionbody">
|
|
<div class="paragraph"><p>This relates lessons learned while modifying <strong>Recoll</strong> indexing to be
|
|
multithreaded. I am by no means a threaded applications expert, so that a
|
|
few of the observations I made whole doing this may be of use to other
|
|
novices.</p></div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_introduction">Introduction</h2>
|
|
<div class="sectionbody">
|
|
<div class="paragraph"><p><a href="http://www.recoll.org"><strong>Recoll</strong></a> is a document indexing application, it
|
|
allows you to find documents by specifying search terms.</p></div>
|
|
<div class="paragraph"><p>The documents need to be <em>indexed</em> for searches to be fast. In a nutshell,
|
|
we convert the different document formats to text, then split the text into
|
|
terms and remember where those occur. This is a time-consuming operation.</p></div>
|
|
<div class="paragraph"><p>Up to version 1.18 <strong>Recoll</strong> indexing is single-threaded: routines which
|
|
call each other sequentially.</p></div>
|
|
<div class="paragraph"><p>In most personal indexer contexts, it is also CPU-bound. There is a lot of
|
|
conversion work necessary for turning those PDF (or other) files into
|
|
appropriately cleaned up pure text, then split it into terms and update the
|
|
index. Given the relatively modest amount of data, and the speed of
|
|
storage, I/O issues are secondary.</p></div>
|
|
<div class="paragraph"><p>Looking at the <em>CPU idle</em> <strong>top</strong> output stuck at 75% on my quad-core CPU,
|
|
while waiting for the indexing to finish, was frustrating, and I was
|
|
tempted to find a way to keep those other cores at temperature and shorten
|
|
the waiting.</p></div>
|
|
<div class="paragraph"><p>For some usages, the best way to accomplish this may be to just partition
|
|
the index and independantly start indexing on different configurations,
|
|
using multiple processes to better utilize the available processing power.</p></div>
|
|
<div class="paragraph"><p>This is not an universal solution though, as it is complicated to set up,
|
|
not optimal in general for indexing performance, and not always optimal
|
|
either at query time.</p></div>
|
|
<div class="paragraph"><p>The most natural way to improve indexing times is to increase CPU
|
|
utilization by using multiple threads inside an indexing process.</p></div>
|
|
<div class="paragraph"><p>Something similar had been done with earlier versions of the <strong>Recoll</strong> GUI,
|
|
which had an internal indexing thread. This had been a frequent source of
|
|
trouble though, and linking the GUI and indexing process lifetimes was a
|
|
bad idea, so, in recent versions, the indexing is always performed by an
|
|
external process. Still, this experience had put in light most of the
|
|
problem areas, and prepared the code for further work.</p></div>
|
|
<div class="paragraph"><p>It should be noted that, as <code>recollindex</code> is both <em>nice</em>'d and <em>ionice</em>'d
|
|
as a lowest priority process, it will only use free computing power on the
|
|
machine, and will step down as soon as anything else wants to work.</p></div>
|
|
<div class="sidebarblock">
|
|
<div class="content">
|
|
<div class="paragraph"><p>The only case where you may notice that the indexing is at work
|
|
is when the machine is short on memory and things (such as
|
|
your Web browser) get swapped-out while you are not actively using
|
|
them. You then notice a long delay when you want to start, because they
|
|
need to be swapped back in. There is little which can be done about
|
|
this. Setting <em>idxflushmb</em> to a low value may help in some cases (depending
|
|
on the document sizes). May I also suggest in this case that, if your
|
|
machine can take more memory, it may be a good idea to procure some, as
|
|
memory is nowadays quite cheap, and memory-starved machines are not fun.</p></div>
|
|
</div></div>
|
|
<div class="paragraph"><p>In general, augmenting the machine utilisation by <code>recollindex</code> just does
|
|
not change its responsiveness. My PC has a an Intel Pentium Core i5 750 (4
|
|
cores, no hyperthreading), which is far from being a high performance CPU
|
|
(nowadays…), and I often forget that I am running indexing tests, it is
|
|
just not noticeable. The machine does have a lot of memory though (12GB).</p></div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_the_recoll_indexing_processing_flow">The Recoll indexing processing flow</h2>
|
|
<div class="sectionbody">
|
|
<div class="imageblock" style="float:right;">
|
|
<div class="content">
|
|
<img src="nothreads.png" alt="Basic flow" />
|
|
</div>
|
|
</div>
|
|
<div class="paragraph"><p>There are 4 main steps in the <code>recollindex</code> processing pipeline:</p></div>
|
|
<div class="olist arabic"><ol class="arabic">
|
|
<li>
|
|
<p>
|
|
Find the file
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Convert it to text
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Process the text (split, strip etc.) and create a <strong>Xapian</strong> document
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Update the index
|
|
</p>
|
|
</li>
|
|
</ol></div>
|
|
<div class="paragraph"><p>The first step, walking the file system (or some other data source), is
|
|
usually much faster than the others, and we just leave it alone to be
|
|
performed by the main thread. It outputs file names (and the associated
|
|
<strong>POSIX</strong> <em>stat</em> data).</p></div>
|
|
<div class="paragraph"><p>The last step, <strong>Xapian</strong> index updating, can only be single-threaded.</p></div>
|
|
<div class="paragraph"><p>The first idea is to change the indexing pipeline so that each step is
|
|
performed by an independant worker thread, passing its output to the next
|
|
thread, in assembly-line fashion.</p></div>
|
|
<div class="paragraph"><p>In order to achieve this, we need to decouple the different phases. They
|
|
are normally linked by procedure calls, which we replace with a job
|
|
control object: the <em>WorkQueue</em>.</p></div>
|
|
<div class="sect2">
|
|
<h3 id="_the_workqueue">The WorkQueue</h3>
|
|
<div class="paragraph"><p>The <em>WorkQueue</em> object is implemented by a reasonably simple class, which
|
|
manages an input queue on which client append jobs, and a set of worker
|
|
threads, which retrieve and perform the jobs, and whose lifetime are
|
|
managed by the <em>WorkQueue</em> object. The
|
|
<a href="https://bitbucket.org/medoc/recoll/src/f06f3aba912045d6ad52e9a0fd930b95e363fd10/src/utils/workqueue.h?at=default">implementation</a> is straightforward with
|
|
<strong>POSIX</strong> threads synchronization functions and C++ <strong>STL</strong> data structures.</p></div>
|
|
<div class="paragraph"><p>In practise it proved quite simple to modify existing code to create a job
|
|
object and put it on the queue, instead of calling the downstream routine
|
|
with the job parameters, <em>while keeping the capacity to call the downstream
|
|
routine directly</em>. The kind of coupling is determined either by compilation
|
|
flags (for global disabling/enabling of multithreading), or according to
|
|
configuration data, which allows experimenting with different threads
|
|
arrangements just by changing parameters in a file, without recompiling.</p></div>
|
|
<div class="paragraph"><p>Each <em>WorkQueue</em> accepts two parameters: the length of the input queue
|
|
(before a client will block when trying to add a job), and the number of
|
|
worker threads. Both parameters can be set in the <strong>Recoll</strong> configuration
|
|
file for each of the three queues used in the indexing pipeline. Setting
|
|
the queue length to -1 will disable the corresponding queue (using a direct
|
|
call instead).</p></div>
|
|
<div style="clear:both;"></div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_the_assembly_line">The Assembly Line</h2>
|
|
<div class="sectionbody">
|
|
<div class="imageblock" style="float:right;">
|
|
<div class="content">
|
|
<img src="assembly.png" alt="Assembly line" />
|
|
</div>
|
|
</div>
|
|
<div class="paragraph"><p>So the first idea is to create 3 explicit threads to manage the file
|
|
conversion, the term generation, and the <strong>Xapian</strong> index update. The first
|
|
thread prepares a file, passes it on to the term generation thread, and
|
|
immediately goes back to work on the next file, etc.</p></div>
|
|
<div class="paragraph"><p>The presumed advantage of this method is that the different stages, which
|
|
perform disjointed processing, should share little, so that we can hope to
|
|
minimize the changes necessitated by the threads interactions.</p></div>
|
|
<div class="paragraph"><p>However some changes to the code were needed to make this work (and a few
|
|
bugs were missed, which only became apparent at later stages, confirming
|
|
that the <em>low interaction</em> idea was not completely false).</p></div>
|
|
<div class="sect2">
|
|
<h3 id="_converting_to_multithreading_what_to_look_for">Converting to multithreading: what to look for</h3>
|
|
<div class="paragraph"><p>I am probably stating the obvious here, but when preparing a program for
|
|
multi-threading, problems can only arise where non-constant data is
|
|
accessed by different threads.</p></div>
|
|
<div class="paragraph"><p>Once you have solved the core problems posed by the obvious data that needs
|
|
to be shared, you will be left to deal with less obvious, hidden,
|
|
interactions inside the program.</p></div>
|
|
<div class="paragraph"><p>Classically this would concern global or static data, but in a C++ program,
|
|
class members will be a concern if a single object can be accessed by
|
|
several threads.</p></div>
|
|
<div class="paragraph"><p>Hunting for static data inside a program of non trivial size is not always
|
|
obvious. Two approaches can be used: hunting for the <em>static</em> keyword in
|
|
source code, or looking at global and static data symbols in <strong>nm</strong> output.</p></div>
|
|
<div class="paragraph"><p>Once found, there are mostly three types of static/global data:</p></div>
|
|
<div class="ulist"><ul>
|
|
<li>
|
|
<p>
|
|
Things that need to be eliminated: for example, routines can be made
|
|
reentrant by letting the caller supply a storage buffer instead of using
|
|
an internal static one (which was a bad idea in the first place
|
|
anyway).
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Things that need to be protected: sometimes, the best approach is just
|
|
to protect the access with a mutex lock. It is trivial to encapsulate
|
|
the locks in C++ objects to use the "Resource Acquisition is
|
|
Initialization" idiom, easily making sure that locks are freed when
|
|
exiting the critical section. A very basic
|
|
<a href="https://bitbucket.org/medoc/recoll/src/f06f3aba9120/src/utils/ptmutex.h?at=default">example of implementation</a>
|
|
can be found in the <strong>Recoll</strong> source code.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Things which can stay: this is mostly initialization data such as value
|
|
tables which are computed once, and then stay logically constant during
|
|
program execution. In order to be sure of a correct single-threaded
|
|
initialization, it is best to explicitly initialize the modules or
|
|
functions that use this kind of data in the main thread when the program
|
|
starts.
|
|
</p>
|
|
</li>
|
|
</ul></div>
|
|
</div>
|
|
<div class="sect2">
|
|
<h3 id="_assembly_line_approach_the_results">Assembly line approach: the results</h3>
|
|
<div class="paragraph"><p>Unfortunately, the assembly line approach yields very modest improvements
|
|
when used inside <strong>Recoll</strong> indexing. The reason, is that this method needs
|
|
stages of equivalent complexity to be efficient. If one of the stages
|
|
dominates the others, its thread will be the only one active at any time,
|
|
and little will be gained.</p></div>
|
|
<div class="paragraph"><p>What is especially problematic is that the balance between tasks need not
|
|
only exist on average, but also for the majority of individual jobs.</p></div>
|
|
<div class="paragraph"><p>For <strong>Recoll</strong> indexing, even if the data preparation and index update steps
|
|
are often of the same order of magnitude <em>on average</em>, their balance
|
|
depends a lot on the kind of data being processed, so that things are
|
|
usually unbalanced at any given time: the index update thread is mostly
|
|
idle while processing PDF files, and the data preparation has little to do
|
|
when working on HTML or plain text.</p></div>
|
|
<div class="paragraph"><p>In practice, very modest indexing time improvements from 5% to 15% were
|
|
achieved with this method.</p></div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_the_next_step_multi_stage_parallelism">The next step: multi-stage parallelism</h2>
|
|
<div class="sectionbody">
|
|
<div class="imageblock" style="float:right;">
|
|
<div class="content">
|
|
<img src="multipara.png" alt="Multi-stage parallelism" />
|
|
</div>
|
|
</div>
|
|
<div class="paragraph"><p>Given the limitations of the assembly line approach, the next step in the
|
|
transformation of <strong>Recoll</strong> indexing was to enable full parallelism wherever
|
|
possible.</p></div>
|
|
<div class="paragraph"><p>Of the four processing steps (see figures), two are not candidates for
|
|
parallelization:</p></div>
|
|
<div class="ulist"><ul>
|
|
<li>
|
|
<p>
|
|
File system walking is so fast compared to the other steps that using
|
|
several threads would make no sense (it would also quite probably become
|
|
IO bound if we tried anyway).
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The <strong>Xapian</strong> library index updating code is not designed for
|
|
multi-threading and must stay protected from multiple accesses.
|
|
</p>
|
|
</li>
|
|
</ul></div>
|
|
<div class="paragraph"><p>The two other steps are good candidates.</p></div>
|
|
<div class="paragraph"><p>Most of the work to make <strong>Recoll</strong> code reentrant had been performed for the
|
|
previous transformation. Going full-parallel only implied protecting the
|
|
data structures that needed to be shared by the threads performing a given
|
|
processing step.</p></div>
|
|
<div class="paragraph"><p>Just for the anecdotic value, a list of the elements that needed mutexes:</p></div>
|
|
<div class="ulist"><ul>
|
|
<li>
|
|
<p>
|
|
Filter subprocesses cache: some file conversion subprocesses may be
|
|
expensive (starting a Python process is no piece of cake), so they are
|
|
cached for reuse after they are done translating a file. The shared cache
|
|
needs protection.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Status updates: an object used to update the current file name and indexing
|
|
status to a shared file.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Missing store: the list of missing helper programs
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The readonly <strong>Xapian</strong> database object: a Xapian::Database object which is
|
|
used for checking the validity of current index data against a file’s
|
|
last modification date.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Document existence map: a bit array used to store an existence bit about
|
|
every document, and purge the disappeared at the end of the indexing
|
|
pass. This is accessed both from the file conversion and database update
|
|
code, so it also needed protection in the previous assembly line
|
|
approach.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Mbox offsets cache. Used to store the offsets of individual messages
|
|
inside <strong>mbox</strong> files.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
<strong>iconv</strong> control blocks: these are cached for reuse in several places, and
|
|
need protection. Actually, it might be better in multithreading context
|
|
to just suppress the reuse and locking. Rough tests seem to indicate that
|
|
the impact on overall performance is small, but this might change with
|
|
higher parallelism (or not…).
|
|
</p>
|
|
</li>
|
|
</ul></div>
|
|
<div class="paragraph"><p>The <strong>Recoll</strong> configuration also used to be managed by a single shared
|
|
object, which is mutable as values may depend on what area of the
|
|
file-system we are exploring, so that the object is stateful and updated as
|
|
we change directories. The choice made here was to duplicate the object
|
|
where needed (each indexing thread gets its own). This gave rise to the
|
|
sneakiest bug in the whole transformation (see further down).</p></div>
|
|
<div class="paragraph"><p>Having a dynamic way to define the threads configuration makes it easy to
|
|
experiment. For example, the following data defines the configuration that
|
|
was finally found to be best overall on my hardware:</p></div>
|
|
<div class="literalblock">
|
|
<div class="content">
|
|
<pre><code>thrQSizes = 2 2 2
|
|
thrTCounts = 4 2 1</code></pre>
|
|
</div></div>
|
|
<div class="paragraph"><p>This is using 3 queues of depth 2, 4 threads working on file conversion, 2
|
|
on text splitting and other document processing, and 1 on Xapian updating
|
|
(no choice here).</p></div>
|
|
<div style="clear:both;"></div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_bench_results">Bench results</h2>
|
|
<div class="sectionbody">
|
|
<div class="paragraph"><p>So the big question after all the work: was it worth it ? I could only get
|
|
a real answer when the program stopped crashing, so this took some time and
|
|
a little faith, but the answer is positive, as far as I’m
|
|
concerned. Performance has improved significantly and this was a fun
|
|
project.</p></div>
|
|
<div class="tableblock">
|
|
<table rules="all"
|
|
width="70%"
|
|
frame="border"
|
|
cellspacing="0" cellpadding="4">
|
|
<caption class="title">Table 1. Results on a variety of file system areas:</caption>
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<thead>
|
|
<tr>
|
|
<th align="left" valign="top">Area </th>
|
|
<th align="left" valign="top">Seconds before </th>
|
|
<th align="left" valign="top">Seconds after</th>
|
|
<th align="left" valign="top"> Percent Improvement</th>
|
|
<th align="left" valign="top"> Speed Factor</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">home</p></td>
|
|
<td align="left" valign="top"><p class="table">12742</p></td>
|
|
<td align="left" valign="top"><p class="table">6942</p></td>
|
|
<td align="left" valign="top"><p class="table">46%</p></td>
|
|
<td align="left" valign="top"><p class="table">1.8</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">mail</p></td>
|
|
<td align="left" valign="top"><p class="table">2700</p></td>
|
|
<td align="left" valign="top"><p class="table">1563</p></td>
|
|
<td align="left" valign="top"><p class="table">58%</p></td>
|
|
<td align="left" valign="top"><p class="table">1.7</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">projets</p></td>
|
|
<td align="left" valign="top"><p class="table">5022</p></td>
|
|
<td align="left" valign="top"><p class="table">1970</p></td>
|
|
<td align="left" valign="top"><p class="table">61%</p></td>
|
|
<td align="left" valign="top"><p class="table">2.5</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">pdf</p></td>
|
|
<td align="left" valign="top"><p class="table">2164</p></td>
|
|
<td align="left" valign="top"><p class="table">770</p></td>
|
|
<td align="left" valign="top"><p class="table">64%</p></td>
|
|
<td align="left" valign="top"><p class="table">2.8</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">otherhtml</p></td>
|
|
<td align="left" valign="top"><p class="table">5593</p></td>
|
|
<td align="left" valign="top"><p class="table">4014</p></td>
|
|
<td align="left" valign="top"><p class="table">28%</p></td>
|
|
<td align="left" valign="top"><p class="table">1.4</p></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<div class="tableblock">
|
|
<table rules="all"
|
|
width="70%"
|
|
frame="border"
|
|
cellspacing="0" cellpadding="4">
|
|
<caption class="title">Table 2. Characteristics of the data</caption>
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<col width="20%" />
|
|
<thead>
|
|
<tr>
|
|
<th align="left" valign="top">Area </th>
|
|
<th align="left" valign="top"> Files MB </th>
|
|
<th align="left" valign="top"> Files </th>
|
|
<th align="left" valign="top"> DB MB </th>
|
|
<th align="left" valign="top"> Documents</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">home</p></td>
|
|
<td align="left" valign="top"><p class="table">64106</p></td>
|
|
<td align="left" valign="top"><p class="table">44897</p></td>
|
|
<td align="left" valign="top"><p class="table">1197</p></td>
|
|
<td align="left" valign="top"><p class="table">104797</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">mail</p></td>
|
|
<td align="left" valign="top"><p class="table">813</p></td>
|
|
<td align="left" valign="top"><p class="table">232</p></td>
|
|
<td align="left" valign="top"><p class="table">663</p></td>
|
|
<td align="left" valign="top"><p class="table">47267</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">projets</p></td>
|
|
<td align="left" valign="top"><p class="table">2056</p></td>
|
|
<td align="left" valign="top"><p class="table">34504</p></td>
|
|
<td align="left" valign="top"><p class="table">549</p></td>
|
|
<td align="left" valign="top"><p class="table">40281</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">pdf</p></td>
|
|
<td align="left" valign="top"><p class="table">1123</p></td>
|
|
<td align="left" valign="top"><p class="table">1139</p></td>
|
|
<td align="left" valign="top"><p class="table">111</p></td>
|
|
<td align="left" valign="top"><p class="table">1139</p></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left" valign="top"><p class="table">otherhtml</p></td>
|
|
<td align="left" valign="top"><p class="table">3442</p></td>
|
|
<td align="left" valign="top"><p class="table">223007</p></td>
|
|
<td align="left" valign="top"><p class="table">2080</p></td>
|
|
<td align="left" valign="top"><p class="table">221890</p></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<div class="paragraph"><p><em>home</em> is my home directory. The high megabyte value is due to a number of
|
|
very big and not indexed <strong>VirtualBox</strong> images. Otherwise, it’s a wide
|
|
mix of source files, email, miscellaneous documents and ebooks.</p></div>
|
|
<div class="paragraph"><p><em>mail</em> is my mail directory, full of <strong>mbox</strong> files.</p></div>
|
|
<div class="paragraph"><p><em>projets</em> mostly holds source files, and a number of documents.</p></div>
|
|
<div class="paragraph"><p><em>pdf</em> holds random <strong>pdf</strong> files harvested on the internets. The performance
|
|
is quite spectacular, because most of the processing time goes to
|
|
converting them to text, and this is done in parallel. Probably could be
|
|
made a bit faster with more cores, until we hit the <strong>Xapian</strong> update speed
|
|
limit.</p></div>
|
|
<div class="paragraph"><p><em>otherhtml</em> holds myriad of small html files, mostly from
|
|
<strong>wikipedia</strong>. The improvement is not great here because a lot of time is
|
|
spent in the single-threaded <strong>Xapian</strong> index update.</p></div>
|
|
<div class="paragraph"><p>The tests were made with queue depths of 2 on all queues, and 4 threads
|
|
working on the file conversion step, 2 on the term generation.</p></div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_a_variation_linear_parallelism">A variation: linear parallelism</h2>
|
|
<div class="sectionbody">
|
|
<div class="paragraph"><p>Once past the assembly-line idea, another possible transformation would be
|
|
to get rid of the two downstream queues, and just create a job for each
|
|
file and let it go to the end (using a mutex to protect accesses to the
|
|
writable <strong>Xapian</strong> database).</p></div>
|
|
<div class="paragraph"><p>With the current <strong>Recoll</strong> code, this can be defined by the following
|
|
parameters (one can also use a deeper front queue, this changes little):</p></div>
|
|
<div class="literalblock">
|
|
<div class="content">
|
|
<pre><code>thrQSizes = 2 -1 -1
|
|
thrTCounts = 4 0 0</code></pre>
|
|
</div></div>
|
|
<div class="paragraph"><p>In practise, the performance is close to the one for the multistage
|
|
version.</p></div>
|
|
<div class="paragraph"><p>If we were to hard-code this approach, this would be a simpler
|
|
modification, necessitating less changes to the code, but it has a slight
|
|
inconvenient: when working on a single big multi-document file, no
|
|
parallelism at all can be obtained. In this situation, the multi-stage
|
|
approach brings us back to the assembly-line behaviour, so the improvements
|
|
are not great, but they do exist.</p></div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<h2 id="_miscellany">Miscellany</h2>
|
|
<div class="sectionbody">
|
|
<div class="sect2">
|
|
<h3 id="_the_big_gotcha_my_stack_dump_staring_days">The big gotcha: my stack dump staring days</h3>
|
|
<div class="paragraph"><p>Overall, debugging the modified program was reasonably
|
|
straightforward. Data access synchronization issues mostly provoke dynamic
|
|
data corruption, which can be beastly to debug. I was lucky enough that
|
|
most crashes occurred in the code that was actually related to the
|
|
corrupted data, not in some randomly located and unrelated dynamic memory
|
|
user, so that the issues were reasonably easy to find.</p></div>
|
|
<div class="paragraph"><p>One issue though kept me working for a few days. The indexing process kept
|
|
crashing randomly at an interval of a few thousands documents, segfaulting
|
|
on a bad pointer. An access to the configuration data structure seemed to
|
|
be involved, but, as each thread was supposed to have its own copy, I was
|
|
out of ideas.</p></div>
|
|
<div class="paragraph"><p>After reviewing all the uses for the configuration data (there are quite a
|
|
few), the problem was finally revealed to lie with the filter process
|
|
cache. Each filter structure stored in the cache stores a pointer to a
|
|
configuration structure. This belonged to the thread which initially
|
|
created the filter. But the filter would often be reused by a different
|
|
thread, with the consequence that the configuration object was now accessed
|
|
and modified by two unsynchronized threads… Resetting the config pointer
|
|
at the time of filter reuse was the
|
|
<a href="https://bitbucket.org/medoc/recoll/commits/943de4b78818079b0eb6ffd0fcbdfdd0746b4a40">ridiculously
|
|
simple (almost)single-line fix</a> to this evasive problem.</p></div>
|
|
<div class="paragraph"><p>Looking at multi-threaded stack dumps is mostly fun for people with several
|
|
heads, which is unfortunately not my case, so I was quite elated when this
|
|
was over.</p></div>
|
|
</div>
|
|
<div class="sect2">
|
|
<h3 id="_fork_performance_issues">Fork performance issues</h3>
|
|
<div class="paragraph"><p>On a quite unrelated note, something that I discovered while evaluating the
|
|
program performance is that forking a big process like <code>recollindex</code> can be
|
|
quite expensive. Even if the memory space of the forked process is not
|
|
copied (it’s Copy On Write, and we write very little before the following
|
|
exec), just duplicating the memory maps can be slow when the process uses a
|
|
few hundred megabytes.</p></div>
|
|
<div class="paragraph"><p>I modified the single-threaded version of <code>recollindex</code> to use <strong>vfork</strong>
|
|
instead of <strong>fork</strong>, but this can’t be used with multiple threads (no
|
|
modification of the process memory space is allowed in the child between
|
|
<strong>vfork</strong> and <strong>exec</strong>, so we’d have to have a way to suspend all the threads
|
|
first).</p></div>
|
|
<div class="paragraph"><p>I did not implement a solution to this issue, and I don’t think
|
|
that a simple one exists. The workaround is to use modest <strong>Xapian</strong> flush
|
|
values to prevent the process from becoming too big.</p></div>
|
|
<div class="paragraph"><p>A longer time solution would be to implement a small slave process to do
|
|
the executing of ephemeral external commands.</p></div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div id="footnotes"><hr /></div>
|
|
<div id="footer">
|
|
<div id="footer-text">
|
|
Last updated 2016-05-08 08:30:29 CEST
|
|
</div>
|
|
</div>
|
|
</body>
|
|
</html>
|