First of all, GNU mifluz is at alpha stage.
The purpose of GNU mifluz is to provide a C++ library to
build and query a full text inverted index. It is dynamically
updatable, scalable (up to 1Tb indexes), uses a controlled
amount of memory, shares index files and memory cache among
processes or threads and compresses index files to 50% of the
raw data. The structure of the index is configurable at
runtime and allows inclusion of relevance ranking information.
The query functions do not require to load all the occurences
of a searched term. They consume very few resources and many
searches can be run in parallel.
Implementing a library that manages an inverted index is
a very easy task when there is a small number of words and
documents. It becomes a lot harder when dealing with a large
number of words and documents. GNU mifluz has been
designed with the further upper limits in mind : 500 million
documents, 50 giga words, 20 million document updates per day.
GNU mifluz has two main characteristics : it is
very simple (one might say stupid :-) and uses 50% of the size
of the indexed text for the index. It is simple because it
provides only a few basic functionalities. It does not contain
document parsers (HTML, PDF etc...). It does not contain a
full text query parser. It does not provide result display
functions or other user friendly stuff. It only provides
functions to store word occurences and retrieve them. The fact
that it uses 50% of the size of the indexed text is rather
atypical. Most well known full text indexing systems only use
30%. The advantage GNU mifluz has over most full text
indexing systems is that it is fully dynamic (update, delete,
insert), uses only a controled amount of memory while
resolving a query, has higher upper limits and has a simple
storage scheme. Consuming more disk space allows all this.