Node:Top, Next:Introduction, Previous:(dir), Up:(dir)
Mifluz
Mifluz is a full text indexing library.
Node:Introduction,
Next:Architecture,
Previous:Top, Up:Top
Introduction
First of all, mifluz is at beta stage.
This program is part of the GNU project, released under the
aegis of GNU.
The purpose of mifluz is to provide a C++ library
to store a full text inverted index. To put it briefly, it allows
storage of occurrences of words in such a way that they can later
be searched. The basic idea of an inverted index is to associate
each unique word with a list of documents in which they appear.
This list can then be searched to locate the documents containing a
specific word.
Implementing a library that manages an inverted index is a very
easy task when there is a small number of words and documents. It
becomes a lot harder when dealing with a large number of words and
documents. mifluz has been designed with the further
upper limits in mind : 500 million documents, 100 giga words, 18
million document updates per day. In the present state of
mifluz , it is possible to store 100 giga words using
600 giga bytes. The best average insertion rate observed as of
today 4000 key/sec on a 1 giga byte index.
mifluz has two main characteristics : it is very
simple (one might say stupidly simple :-) and uses 100% of the size
of the indexed text for the index. It is simple because it provides
only a few basic functions. It does not contain document parsers
(HTML, PDF etc...). It does not contain a full text query parser.
It does not provide result display functions or other user friendly
stuff. It only provides functions to store word occurrences and
retrieve them. The fact that it uses 100% of the size of the
indexed text is rather atypical. Most well known full text indexing
systems only use 30%. The advantage mifluz has over
most full text indexing systems is that it is fully dynamic
(update, delete, insert), uses only a controlled amount of memory
while resolving a query, has higher upper limits and has a simple
storage scheme. This is achieved by consuming more disk space.
Node:Architecture,
Next:Constraints, Previous:Introduction, Up:Top
Architecture
In the following figure you can see the place of
mifluz in an hypothetical full text indexing system.
-
Query
- Resolve full text queries. The optimization makes sure the
least frequent terms are scanned first and that redundant query
specifications are merged together.
Mifluz
- Manage efficient storage of the inverted index permanent
data.
Parser Switch
- Transform raw documents into list of terms.
Indexer
- Call the Parser Switch to get a list of terms and feed it to
mifluz .
Node:Constraints,
Next:Document name
scheme, Previous:Architecture, Up:Top
Constraints
The following list shows all the constraints imposed by
mifluz . It can also be seen as a list of functions
provided by mifluz that is more general than the API
specification.
-
Now Available
-
- In-place dynamic update of the index.
- Use in memory cache to perform heavy index updates without
stressing the disk too much.
- The library can be linked in an C or C++ application,
dynamically or statically.
- The memory usage is completely controlled. The application can
specify the maximum total memory usage. The application can specify
that the memory cache will be shared among processes.
- The library is thread safe.
Future
-
- Transaction logs for backup recovery.
- Index integrity check and repair function.
- Indexing up to 500 million documents and support up to 18
million document updates per 24h. The average size of a document is
4 kilo bytes and contains 200 indexable words.
Constraints and Limitations
-
- No atomic data is bigger than a size known in advance. This
postulate is essential for disk storage optimization. If an atomic
data may have a size of 10Mb, it is impossible to guarantee that a
query/indexing process controls the memory it's using.
An atomic datum is something that must be manipulated as whole,
with no possibility of splitting it into smaller parts. For
instance a posting (Word, document identifier and position) is an
atomic datum: to manipulate it in memory it has to reside
completely in memory. By contrast a postings list is not atomic.
Manipulating a postings list can be done without loading all the
postings list in memory.
- The cost of an update is O(log m(N)) where m is the average
number of entries in a page and N the total number of pages. This
figure has to be considered when the pages are in memory or on
disk.
- The inverted index data is sorted to fit the most typical
search pattern. The structure of the inverted index key can be
defined at run time to fit a usage pattern.
- No lock mechanism is provided beyond an individual word
occurrence. It is assumed that the library is linked in a central
server that serializes all the requests or in a program that
provides its own lock mechanism.
Node:Document name scheme, Next:Data Storage Spec,
Previous:Constraints,
Up:Top
Document name scheme
In all of the literature dealing with full text indexing a
collection of documents is considered to be a flat set of documents
containing words. Each document has a unique name. The inverted
index associates terms found in the documents with a list of unique
document names.
We found it more interesting to consider that the document names
have a hierarchical structure, just like path names in file
systems. The main difference is that each component of the document
name (think path name in file system) may contain terms.
As shown in the figure above we can consider that the first
component of the document name is the name of a collection, the
second the logical name of a set of documents within the
collection, the third the name of the document, the fourth the name
of a part of the document.
This logical structure may be applied to URLs in the following
way : there is only one collection, it contains servers (document
sets) containing URLs (documents) containing tags such as TITLE
(document parts).
This logical structure may be also be applied to databases in
the following way : there is one collection for each database, it
contains tables (document set) containing fields (document)
containing records (document part).
What does this imply for full text indexing ? Instead of having
only one dictionary to map the document name to a numerical
identifier (this is needed to compress the postings for a term), we
must have a dictionary for each level of the hierarchy.
Using the database example again:
- A dictionary for database names
- A dictionary for table names
- A dictionary for field names
- Since records are already identified by a number, no dictionary
is needed.
When coding the document identifier in the postings for a term,
we have to code a list of numerical identifiers instead of a single
numerical identifier. Alternatively one could see the document
identifier as an aribtrary precision number sliced in parts.
The advantage of this document naming scheme are:
- A
uniq query operator can be trivially
implemented. This is mostly useful to answer a query such as : I
want URLs matching the word foo but I only want to see one URL for
a given server (avoid the problem of having the first 40 URLs for a
request on the same server).
- The posting lists are traditionally ordered according to the
document number. This is a must to have an efficient query
mechanism. With a hierachical document name, each level of the
hierarchy is sorted. Therefore the postings are sorted in multiple
ways: sorted by collection first, then document set, then document
part.
- Searching document paths is facilitated by the structure of the
key. For instance: I only want to search TITLEs.
Of course, the suggested hierarchy semantic is not mandatory and
may be redefined according to sorting needs. For instance a
relevance ranking algorithm can lead to a relevance ranking number
being inserted into the hierarchy.
The space overhead implied by this name scheme is quite small
for databases and URL pools. The big dictionary for URL pools maps
URL to identifiers. The dictionary for tags (TITLE etc..) is only
10-50 at most. The dictionary for site names (www.domain.com) will
be ~1/100 of the dictionary for URLs, assuming you have 100 URLs
for a given site. For databases the situation is even better: the
big dictionary would be the dictionary mapping rowids to numerical
identifiers. But since rowids are already numerical we don't need
this. We only need the database name, field name and table name
dictionaries and they are small. Since we are able to encode small
numbers using only a few bits in postings, the overhead of
hierarchical names is acceptable.
Node:Data Storage Spec, Next:Cache tuning, Previous:Document name
scheme, Up:Top
Data Storage Spec
Efficient management of the data storage space is an important
issue of the management of inverted indexes. The needs of an
inverted index are very similar to the needs of a regular file
system. We need:
- A cache associated with an LRU list to keep the most frequently
used entries in memory.
- To group postings into pages of fixed size to optimize I/O on
disk.
- A locking mechanism to prevent race conditions between threads
or multiple processes accessing the same data.
- A transaction system to ensure data integrity and atomicity of
logical operations.
- Transparent compression of pages to reduce I/O bottleneck for
large volumes of data and reduce disk usage as a bonus.
- To create indexes using up to 1 tera bytes.
All these functionalities are provided by file systems and
kernel services. Since we also wanted the mifluz
library to be portable we chose the Berkeley DB library that
implements all the services above. The transparent compression is
not part of Berkeley DB and is implemented as a patch to Berkeley
DB (version 3.1.14).
Based on these low level services, Bekeley DB also implements a
Btree structure that mifluz used to store the
postings. Each posting is an entry in the Btree structure. Indexing
100 million words implies creating 100 million entries in the
Btree. When transparent compression is used and assuming we have 6
byte words and a document identifier using 7 * 8 bits, the average
disk size used per entry is 6 bytes.
Unique word statistics are also stored in the inverted index.
For each unique word, an entry is created in a dictionnary and
associated with a serial number (the word identifier and the total
number of occurrences.
Node:Cache tuning,
Next:Key
Specification, Previous:Data Storage Spec, Up:Top
Cache tuning
The cache memory used by mifluz has a tremendous
impact on performance. It is set by the
wordlist_cache_size attribute (see WordList(3) and
mifluz(3)). It holds pages from the inverted index in memory
(uncompressed if the file is compressed) to reduce disk access.
Pages migrate from disk to memory using a LRU.
Each page in the cache is really a node of the B-Tree used to
store the inverted index entries. The internal pages are
intermediate nodes that mifluz must traverse each time
a key is searched. It is therefore very important to keep them in
memory. Fortunately they only count for 1% of the total size of the
index, at most. The size of the cache must at least include enough
space for the internal pages.
The other factors that must be taken into account in sizing the
cache are highly dependant on the application. A typical case is
insertion of many random words in the index. In this case two
factors are of special importance:
-
repartition of unique words
- When filling an inverted index it is very likely that the
dictionary of unique words occuring in the index is limited. Let's
say you have 1 000 000 unique words in a 100 000 000 occurrences
index. Now assume that 90 000 000 occurrences are only using 20 000
unique words, that is 90% of the index is filled with 2% of the
complete vocabulary. If you are in this situation, the indexing
process will spend 90% of its time updating 20 000 pages. If you
can afford 20 000 * pagesize bytes of cache, you will have the
maximum insertion rate.
The general rule is : estimate or calculate how many unique
words fill 90% of your index. Multiply this number by the pagesize
and increase your cache by that amount. See
wordlist_page_size attribute in WordList(3) or
mifluz(3).
order of numbers following the key
- The cache calculation above is fine as long as the words
inserted are associated with increasing numbers in the key. If the
numbers following the word in the key are random, the cache
efficiency will be reduced. Where possible the application should
therefore make sure that when inserting two identical words, the
first is followed by a number that is lower than the second. In
other words, insert
foo 100
foo 103
rather than
foo 103
foo 100
This hint must not be considered in isolation but with careful
analysis of the distribution of the key components (word and
numbers). For instance it does not matter much if a random key
follows the word as long as the range of values of the number is
small.
The conclusion is that the cache size should be at least 1% of
the total index size (uncompressed) plus a number of bytes that
depends on the usage pattern.
Node:Key
Specification, Next:Internals, Previous:Cache tuning, Up:Top
Key Specification
The key structure is what uniquely identifies each word that is
inserted in the inverted index. A key is made of a string (which is
the word being indexed), and a document identifier (which is really
a list of numbers), as discussed above.
The exact structure of the inverted index key must be specified
in the configuration parameter
"wordlist_wordkey_description" . See the WordKeyInfo(3)
manual page for more information on the format.
We will focus on three examples that illustrate common
usage.
First example: a very simple inverted index would be to
associate each word occurrence to an URL (coded as a 32 bit
number). The key description would be:
Word 8/URL 32
Second example: if building a full text index of the content of
a database, you need to know in which field, table and record the
word appeared. This makes three numbers for the document id.
Only a few bits are needed to encode the field and table name
(let's say you have a maximum of 16 field names and 16 table names,
4 bits each is enough). The record number uses 24 bits because we
know we won't have more than 16 M records.
The structure of the key would then be:
Word 8/Table 4/Field 4/Record 32
When you have more than one field involved in a key you must
chose the order in which they appear. It is mandatory that the
Word is first. It is the part of the key that has
highest precedence when sorting. The fields that follow have lower
and lower precedence.
Third example: we go back to the first example and imagine we
have a relevance ranking function that calculates a value for each
word occurrence. By inserting this relevance ranking value in the
inverted index key, all the occurrences will be sorted with the
most relevant first.
Word 8/Rank 5/URL 32
Node:Internals, Next:Development, Previous:Key Specification,
Up:Top
Internals
Node:Compression,
Previous:Internals, Up:Internals
Compression
Compressing the index reduces disk space consumption and speeds
up the indexing by reducing I/O.
Compressing at the mifluz level would imply
choosing complicated key structures, slowing down and complexifying
insert and delete operations. We have chosen to do the compression
within Berkeley DB in the memory pool subsystem. Berkeley DB keeps
fixed size pages in a memory cache, when it is full it writes the
least recently used pages to disk. When a page is needed Berkeley
DB looks for it in memory and retrieves it from disk if its not in
memory. The compression/uncompression occurs when a page moves
between the memory pool and the disk.
Node:Berkeley DB Compression,
Next:Page
compression in Mifluz, Previous:Compression, Up:Compression
Compression inside Berekeley DB
Berkeley DB uses fixed size pages. Suppose, for example that our
compression algorithm can compress by a factor of 8 in most cases,
we use a disk page size that's 1/8 of the memory page size. However
there are exceptions. Some pages won't compress well and therefore
won't fit on one disk page. Extra pages are therefore allocated and
are linked into a chained list. Allocating extra pages implies that
some pages may become free as a result of a better compression.
Node:Page compression in
Mifluz, Previous:Berkeley DB Compression,
Up:Compression
Page compression in Mifluz
The mifluz classes WordDBCompress and
WordBitCompress do the compression/decompression work. From the
list of keys stored in a page it extracts several lists of numbers.
Each list of numbers has common statistical properties that allow
good compression.
The WordDBCompress_compress_c and WordDBCompress_uncompress_c
functions are C callbacks that are called by the the page
compression code in BerkeleyDB. The C callbacks then call the
WordDBCompress compress/uncompress methods. The WordDBCompress
creates a WordBitCompress object that acts as a buffer holding the
compressed stream.
Compression algorithm.
Most DB pages contain redundant data because mifluz
chose to store one word occurrence per entry. Because of this
choice the pages have a very simple structure.
Here is a real world example of what a page can look like: (key
structure: word identifier + 4 numerical fields)
756 1 4482 1 10b
756 1 4482 1 142
756 1 4484 1 40
756 1 449f 1 11e
756 1 4545 1 11
756 1 45d3 1 545
756 1 45e0 1 7e5
756 1 45e2 1 830
756 1 45e8 1 545
756 1 45fe 1 ec
756 1 4616 1 395
756 1 461a 1 1eb
756 1 4631 1 49
756 1 4634 1 48
.... etc ....
To compress we chose to only code differences between adjacent
entries. A flag is stored for each entry indicating which fields
have changed. When a field is different from the previous one, the
compression stores the difference which is likely to be small since
the entries are sorted.
The basic idea is to build columns of numbers, one for each
field, and then compress them individually. One can see that the
first and second columns will compress very well since all the
values are the same. The third column will also compress well since
the differences between the numbers are small, leading to a small
set of numbers.
Node:Development,
Next:Reference, Previous:Internals, Up:Top
Development
The development of mifluz is shared between
Senga (www.senga.org) and the Ht://dig
Group (dev.htdig.org). Part of the distribution comes from the
Ht://dig CVS tree and part from the Senga
CVS tree. The idea is to share efforts between two development
groups that have very similar needs. Since Senga and
Ht://dig are both developped under the GPL licence,
such cooperation occurs naturally.
To compile a program using the mifluz library use
something that looks like the following:
gcc -o word -I/usr/local/include -L/usr/local/lib -lmifluz word.cc
Node:Reference, Next:Concept Index, Previous:Development, Up:Top
Reference
Node:htdb_dump, Next:htdb_stat, Previous:Reference, Up:Reference
htdb_dump
Node:htdb_dump
NAME, Next:htdb_dump
SYNOPSIS, Previous:htdb_dump, Up:htdb_dump
htdb_dump NAME
dump the content of an inverted index in Berkeley DB fashion
Node:htdb_dump SYNOPSIS, Next:htdb_dump
DESCRIPTION, Previous:htdb_dump NAME, Up:htdb_dump
htdb_dump SYNOPSIS
htdb_dump [-klNpWz] [-S pagesize] [-C cachesize] [-d ahr] [-f file] [-h home] [-s subdb] db_file
Node:htdb_dump DESCRIPTION, Next:htdb_dump OPTIONS,
Previous:htdb_dump
SYNOPSIS, Up:htdb_dump
htdb_dump DESCRIPTION
htdb_dump is a slightly modified version of the standard
Berkeley DB db_dump utility.
The htdb_dump utility reads the database file
db_file and writes it to the standard output using
a portable flat-text format understood by the
htdb_load utility. The argument
db_file must be a file produced using the Berkeley
DB library functions.
Node:htdb_dump OPTIONS, Next:htdb_dump
ENVIRONMENT, Previous:htdb_dump DESCRIPTION, Up:htdb_dump
htdb_dump OPTIONS
-
-W
Initialize WordContext(3) before dumping. With the
-z flag allows to dump inverted indexes using the
mifluz(3) specific compression scheme. The MIFLUZ_CONFIG
environment variable must be set to a file containing the mifluz(3)
configuration.
-
-z
The db_file is compressed. If
-W is given the mifluz(3) specific compression
scheme is used. Otherwise the default gzip compression scheme is
used.
-
-d
Dump the specified database in a format helpful for debugging
the Berkeley DB library routines.
- a
Display all information.
- h
Display only page headers.
- r
Do not display the free-list or pages on the free list. This
mode is used by the recovery tests.
The output format of the -d option is not standard
and may change, without notice, between releases of the Berkeley DB
library.
-
-f
Write to the specified file instead of to the
standard output.
-
-h
Specify a home directory for the database. As Berkeley DB
versions before 2.0 did not support the concept of a database
home.
-
-k
Dump record numbers from Queue and Recno databases as
keys.
-
-l
List the subdatabases stored in the database.
-
-N
Do not acquire shared region locks while running. Other problems
such as potentially fatal errors in Berkeley DB will be ignored as
well. This option is intended only for debugging errors and should
not be used under any other circumstances.
-
-p
If characters in either the key or data items are printing
characters (as defined by isprint (3)), use
printing characters in file to represent them.
This option permits users to use standard text editors and tools to
modify the contents of databases.
Note, different systems may have different notions as to what
characters are considered printing characters , and
databases dumped in this manner may be less portable to external
systems.
-
-s
Specify a subdatabase to dump. If no subdatabase is specified,
all subdatabases found in the database are dumped.
-
-V
Write the version number to the standard output and exit.
Dumping and reloading Hash databases that use user-defined hash
functions will result in new databases that use the default hash
function. While using the default hash function may not be optimal
for the new database, it will continue to work correctly.
Dumping and reloading Btree databases that use user-defined
prefix or comparison functions will result in new databases that
use the default prefix and comparison functions. In this
case, it is quite likely that the database will be damaged beyond
repair permitting neither record storage or retrieval.
The only available workaround for either case is to modify the
sources for the htdb_load utility to load the database
using the correct hash, prefix and comparison functions.
Node:htdb_dump ENVIRONMENT,
Previous:htdb_dump
OPTIONS, Up:htdb_dump
htdb_dump ENVIRONMENT
DB_HOME If the -h option is
not specified and the environment variable DB_HOME is set, it is
used as the path of the database home.
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:htdb_stat, Next:htdb_load, Previous:htdb_dump, Up:Reference
htdb_stat
Node:htdb_stat
NAME, Next:htdb_stat
SYNOPSIS, Previous:htdb_stat, Up:htdb_stat
htdb_stat NAME
displays statistics for Berkeley DB environments.
Node:htdb_stat SYNOPSIS, Next:htdb_stat
DESCRIPTION, Previous:htdb_stat NAME, Up:htdb_stat
htdb_stat SYNOPSIS
htdb_stat [-celmNtzW] [-C Acfhlmo] [-d file [-s file]] [-h home] [-M Ahlm]
Node:htdb_stat DESCRIPTION, Next:htdb_stat OPTIONS,
Previous:htdb_stat
SYNOPSIS, Up:htdb_stat
htdb_stat DESCRIPTION
htdb_stat is a slightly modified version of the standard
Berkeley DB db_stat utility which displays statistics for Berkeley
DB environments.
Node:htdb_stat OPTIONS, Next:htdb_stat
ENVIRONMENT, Previous:htdb_stat DESCRIPTION, Up:htdb_stat
htdb_stat OPTIONS
-
-W
Initialize WordContext(3) before gathering statistics. With the
-z flag allows to gather statistics on inverted
indexes generated with the mifluz(3) specific compression scheme.
The MIFLUZ_CONFIG environment variable must be set to a file
containing the mifluz(3) configuration.
-
-z
The file is compressed. If -W
is given the mifluz(3) specific compression scheme is used.
Otherwise the default gzip compression scheme is used.
-
-C
Display internal information about the lock region. (The output
from this option is often both voluminous and meaningless, and is
intended only for debugging.)
-
A
Display all information.
-
c
Display lock conflict matrix.
-
f
Display lock and object free lists.
-
l
Display lockers within hash chains.
-
m
Display region memory information.
-
o
Display objects within hash chains.
-
-c
Display lock region statistics.
-
-d
Display database statistics for the specified database. If the
database contains subdatabases, the statistics are for the database
or subdatabase specified, and not for the database as a
whole.
-
-e
Display current environment statistics.
-
-h
Specify a home directory for the database.
-
-l
Display log region statistics.
-
-M
Display internal information about the shared memory buffer
pool. (The output from this option is often both voluminous and
meaningless, and is intended only for debugging.)
-
A
Display all information.
-
h
Display buffers within hash chains.
-
l
Display buffers within LRU chains.
-
m
Display region memory information.
-
-m
Display shared memory buffer pool statistics.
-
-N
Do not acquire shared region locks while running. Other problems
such as potentially fatal errors in Berkeley DB will be ignored as
well. This option is intended only for debugging errors and should
not be used under any other circumstances.
-
-s
Display database statistics for the specified subdatabase of the
database specified with the -d flag.
-
-t
Display transaction region statistics.
-
-V
Write the version number to the standard output and exit.
Only one set of statistics is displayed for each run, and the
last option specifying a set of statistics takes precedence.
Values smaller than 10 million are generally displayed without
any special notation. Values larger than 10 million are normally
displayed as <number>M .
The htdb_stat utility attaches to one or more of the Berkeley DB
shared memory regions. In order to avoid region corruption, it
should always be given the chance to detach and exit gracefully. To
cause htdb_stat to clean up after itself and exit, send it an
interrupt signal (SIGINT).
Node:htdb_stat ENVIRONMENT,
Previous:htdb_stat
OPTIONS, Up:htdb_stat
htdb_stat ENVIRONMENT
DB_HOME If the -h option is
not specified and the environment variable DB_HOME is set, it is
used as the path of the database home.
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:htdb_load, Next:mifluzdump, Previous:htdb_stat, Up:Reference
htdb_load
Node:htdb_load
NAME, Next:htdb_load
SYNOPSIS, Previous:htdb_load, Up:htdb_load
htdb_load NAME
displays statistics for Berkeley DB environments.
Node:htdb_load SYNOPSIS, Next:htdb_load
DESCRIPTION, Previous:htdb_load NAME, Up:htdb_load
htdb_load SYNOPSIS
htdb_load [-nTzW] [-c name=value] [-f file] [-h home] [-C cachesize] [-t btree | hash | recno] db_file
Node:htdb_load DESCRIPTION, Next:htdb_load OPTIONS,
Previous:htdb_load
SYNOPSIS, Up:htdb_load
htdb_load DESCRIPTION
The htdb_load utility reads from the standard input and loads it
into the database db_file . The database
db_file is created if it does not already
exist.
The input to htdb_load must be in the output format specified by
the htdb_dump utility, or as specified for the -T
below.
Node:htdb_load OPTIONS, Next:htdb_load KEYWORDS,
Previous:htdb_load DESCRIPTION, Up:htdb_load
htdb_load OPTIONS
-
-W
Initialize WordContext(3) before loading. With the
-z flag allows to load inverted indexes using the
mifluz(3) specific compression scheme. The MIFLUZ_CONFIG
environment variable must be set to a file containing the mifluz(3)
configuration.
-
-z
The db_file is compressed. If
-W is given the mifluz(3) specific compression
scheme is used. Otherwise the default gzip compression scheme is
used.
-
-c
Specify configuration options for the DB structure ignoring any
value they may have based on the input. The command-line format is
name=value . See Supported Keywords
for a list of supported words for the -c
option.
-
-f
Read from the specified input file instead of
from the standard input.
-
-h
Specify a home directory for the database. If a home directory
is specified, the database environment is opened using the
DB_INIT_LOCK , DB_INIT_LOG ,
DB_INIT_MPOOL , DB_INIT_TXN and
DB_USE_ENVIRON flags to DBENV->open. This means
that htdb_load can be used to load data into databases while they
are in use by other processes. If the DBENV->open call fails, or
if no home directory is specified, the database is still updated,
but the environment is ignored, e.g., no locking is done.
-
-n
Do not overwrite existing keys in the database when loading into
an already existing database. If a key/data pair cannot be loaded
into the database for this reason, a warning message is displayed
on the standard error output and the key/data pair are
skipped.
-
-T
The -T option allows non-Berkeley DB
applications to easily load text files into databases.
If the database to be created is of type Btree or Hash, or the
keyword keys is specified as set, the input must
be paired lines of text, where the first line of the pair is the
key item, and the second line of the pair is its corresponding data
item. If the database to be created is of type Queue or Recno and
the keywork keys is not set, the input must be
lines of text, where each line is a new data item for the
database.
A simple escape mechanism, where newline and backslash (\)
characters are special, is applied to the text input. Newline
characters are interpreted as record separators. Backslash
characters in the text will be interpreted in one of two ways: if
the backslash character precedes another backslash character, the
pair will be interpreted as a literal backslash. If the backslash
character precedes any other character, the two characters
following the backslash will be interpreted as hexadecimal
specification of a single character, e.g., \0a is a newline
character in the ASCII character set.
For this reason, any backslash or newline characters that
naturally occur in the text input must be escaped to avoid
misinterpretation by htdb_load
If the -T option is specified, the underlying
access method type must be specified using the -t
option.
-
-t
Specify the underlying access method. If no -t
option is specified, the database will be loaded into a database of
the same type as was dumped, e.g., a Hash database will be created
if a Hash database was dumped.
Btree and Hash databases may be converted from one to the other.
Queue and Recno databases may be converted from one to the other.
If the -k option was specified on the call to
htdb_dump then Queue and Recno databases may be converted to Btree
or Hash, with the key being the integer record number.
-
-V
Write the version number to the standard output and exit.
The htdb_load utility attaches to one or more of the Berkeley DB
shared memory regions. In order to avoid region corruption, it
should always be given the chance to detach and exit gracefully. To
cause htdb_load to clean up after itself and exit, send it an
interrupt signal (SIGINT).
The htdb_load utility exits 0 on success, 1 if one or more
key/data pairs were not loaded into the database because the key
already existed, and >1 if an error occurs.
Node:htdb_load KEYWORDS, Next:htdb_load
ENVIRONMENT, Previous:htdb_load OPTIONS, Up:htdb_load
htdb_load KEYWORDS
The following keywords are supported for the -c
command-line option to the htdb_load utility. See DB->open for
further discussion of these keywords and what values should be
specified.
The parenthetical listing specifies how the value part of the
name=value pair is interpreted. Items listed as
(boolean) expect value to be 1 (set) or
0 (unset). Items listed as (number) convert value
to a number. Items listed as (string) use the string value without
modification.
bt_minkey (number)
- The minimum number of keys per page.
db_lorder (number)
- The byte order for integers in the stored database
metadata.
db_pagesize (number)
- The size of pages used for nodes in the tree, in bytes.
duplicates (boolean)
- The value of the DB_DUP flag.
h_ffactor (number)
- The density within the Hash database.
h_nelem (number)
- The size of the Hash database.
keys (boolean)
- Specify if keys are present for Queue or Recno databases.
re_len (number)
- Specify fixed-length records of the specified length.
re_pad (string)
- Specify the fixed-length record pad character.
recnum (boolean)
- The value of the DB_RECNUM flag.
renumber (boolean)
- The value of the DB_RENUMBER flag.
subdatabase (string)
- The subdatabase to load.
Node:htdb_load ENVIRONMENT,
Previous:htdb_load
KEYWORDS, Up:htdb_load
htdb_load ENVIRONMENT
DB_HOME If the -h option is
not specified and the environment variable DB_HOME is set, it is
used as the path of the database home.
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:mifluzdump, Next:mifluzload, Previous:htdb_load, Up:Reference
mifluzdump
Node:mifluzdump
NAME, Next:mifluzdump SYNOPSIS, Previous:mifluzdump, Up:mifluzdump
mifluzdump NAME
dump the content of an inverted index.
Node:mifluzdump SYNOPSIS, Next:mifluzdump
DESCRIPTION, Previous:mifluzdump NAME, Up:mifluzdump
mifluzdump SYNOPSIS
mifluzdump file
Node:mifluzdump DESCRIPTION, Next:mifluzdump
ENVIRONMENT, Previous:mifluzdump SYNOPSIS, Up:mifluzdump
mifluzdump DESCRIPTION
mifluzdump writes on stdout a complete ascii
description of the file inverted index using the
WordList::Write method.
Node:mifluzdump ENVIRONMENT,
Previous:mifluzdump DESCRIPTION, Up:mifluzdump
mifluzdump ENVIRONMENT
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:mifluzload, Next:mifluzsearch, Previous:mifluzdump, Up:Reference
mifluzload
Node:mifluzload
NAME, Next:mifluzload SYNOPSIS, Previous:mifluzload, Up:mifluzload
mifluzload NAME
load the content of an inverted index.
Node:mifluzload SYNOPSIS, Next:mifluzload
DESCRIPTION, Previous:mifluzload NAME, Up:mifluzload
mifluzload SYNOPSIS
mifluzload file
Node:mifluzload DESCRIPTION, Next:mifluzload
ENVIRONMENT, Previous:mifluzload SYNOPSIS, Up:mifluzload
mifluzload DESCRIPTION
mifluzload reads from stdout a complete ascii
description of the file inverted index using the
WordList::Read method.
Node:mifluzload ENVIRONMENT,
Previous:mifluzload DESCRIPTION, Up:mifluzload
mifluzload ENVIRONMENT
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:mifluzsearch,
Next:mifluzdict, Previous:mifluzload, Up:Reference
mifluzsearch
Node:mifluzsearch NAME, Next:mifluzsearch
SYNOPSIS, Previous:mifluzsearch, Up:mifluzsearch
mifluzsearch NAME
search the content of an inverted index.
Node:mifluzsearch SYNOPSIS, Next:mifluzsearch
DESCRIPTION, Previous:mifluzsearch NAME, Up:mifluzsearch
mifluzsearch SYNOPSIS
mifluzsearch -f words [options]
Node:mifluzsearch DESCRIPTION,
Next:mifluzsearch
ENVIRONMENT, Previous:mifluzsearch SYNOPSIS, Up:mifluzsearch
mifluzsearch DESCRIPTION
mifluzsearch searches a mifluz index for documents matching a
Alt*Vista expression (simple syntax).
Debugging information interpretation. A cursor is open in the
index for every word and they are stored in a list. The list of
cursors is always processed in the same order, as a single link
list. With -v, each block is an individual action on behalf of the
word shown on the first line. The last line of the block is the
conclusion of the action described in the block. REDO means the
same cursor must be examined again because the conditions have
changed. RESTART means we go back to the first cursor in the list
because it may not match the new conditions anymore. NEXT means the
cursor and all the cursors before it match the conditions and we
may proceed to the next cursor. ATEND means the cursor cannot match
the conditions because it is at the end of the index.
Node:mifluzsearch ENVIRONMENT,
Previous:mifluzsearch DESCRIPTION,
Up:mifluzsearch
mifluzsearch ENVIRONMENT
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:mifluzdict, Next:WordContext, Previous:mifluzsearch, Up:Reference
mifluzdict
Node:mifluzdict
NAME, Next:mifluzdict SYNOPSIS, Previous:mifluzdict, Up:mifluzdict
mifluzdict NAME
dump the dictionnary of an inverted index.
Node:mifluzdict SYNOPSIS, Next:mifluzdict
DESCRIPTION, Previous:mifluzdict NAME, Up:mifluzdict
mifluzdict SYNOPSIS
mifluzdict file
Node:mifluzdict DESCRIPTION, Next:mifluzdict
ENVIRONMENT, Previous:mifluzdict SYNOPSIS, Up:mifluzdict
mifluzdict DESCRIPTION
mifluzdict writes on stdout a complete ascii
description of the file inverted index using the
WordList::Write method.
Node:mifluzdict ENVIRONMENT,
Previous:mifluzdict DESCRIPTION, Up:mifluzdict
mifluzdict ENVIRONMENT
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz.
Node:WordContext,
Next:WordList, Previous:mifluzdict, Up:Reference
WordContext
Node:WordContext NAME, Next:WordContext SYNOPSIS,
Previous:WordContext,
Up:WordContext
WordContext NAME
read configuration and setup mifluz context.
Node:WordContext SYNOPSIS, Next:WordContext
DESCRIPTION, Previous:WordContext NAME, Up:WordContext
WordContext SYNOPSIS
#include <mifluz.h>
WordContext context;
Node:WordContext DESCRIPTION,
Next:WordContext
CONFIGURATION, Previous:WordContext SYNOPSIS, Up:WordContext
WordContext DESCRIPTION
The WordContext object must be the first object created. All
other objects (WordList, WordReference, WordKey and WordRecord) are
allocated via the corresponding methods of WordContext (List, Word,
Key and Record respectively).
The WordContext object contains a Configuration
object that holds the configuration parameters used by the
instance. If a configuration parameter is changed, the
ReInitialize method should be called to take them in
account.
Node:WordContext CONFIGURATION,
Next:WordContext
METHODS, Previous:WordContext DESCRIPTION,
Up:WordContext
WordContext CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_monitor {true|false} (default false)
- If true create a
WordMonitor instance to gather
statistics and build reports.
Node:WordContext METHODS, Next:WordContext
ENVIRONMENT, Previous:WordContext CONFIGURATION,
Up:WordContext
WordContext METHODS
WordContext()
- Constructor. Read the configuration parameters from the
environment. If the environment variable
MIFLUZ_CONFIG is set to a pathname, read it as a
configuration file. If MIFLUZ_CONFIG is not set,
try to read the
~/.mifluz configuration file or
/usr/etc/mifluz.conf . See the mifluz manual page for
a complete list of the configuration attributes.
WordContext(const Configuration &config)
- Constructor. The config argument must contain
all the configuration parameters, no configuration file is loaded
from the environment.
WordContext(const ConfigDefaults *array)
- Constructor. The array argument holds
configuration parameters that will override their equivalent in the
configuration file read from the environment.
void Initialize(const Configuration
&config)
- Initialize the WordContext object. This method is called by
every constructor.
When calling Initialize a second time, one must
ensure that all WordList and WordCursor objects have been
destroyed. WordList and WordCursor internal state depends on the
current WordContext that will be lost by a second call.
For those interested by the internals, the
Initialize function maintains a Berkeley DB
environment (DB_ENV) in the following way:
First invocation:
Initialize -> new DB_ENV (thru WordDBInfo)
Second invocation:
Initialize -> delete DB_ENV -> new DB_ENV (thru WordDBInfo)
int Initialize(const ConfigDefaults* config_defaults =
0)
- Initialize the WordContext object. Build a
Configuration object from the file pointed to by the
MIFLUZ_CONFIG environment variable or ~/.mifluz or
/usr/etc/mifluz.conf. The config_defaults
argument, if provided, is passed to the Configuration
object using the Defaults method. The
Initialize(const Configuration &) method is
then called with the Configuration object. Return OK
if success, NOTOK otherwise. Refer to the
Configuration description for more information.
int ReInitialize()
- Destroy internal state except the
Configuration
object and rebuild it. May be used when the configuration is
changed to take these changes in account. Return OK if success,
NOTOK otherwise.
const WordType& GetType() const
- Return the WordType data member of the current
object as a const.
WordType& GetType()
- Return the WordType data member of the current
object.
const WordKeyInfo& GetKeyInfo() const
- Return the WordKeyInfo data member of the
current object as a const.
WordKeyInfo& GetKeyInfo()
- Return the WordKeyInfo data member of the
current object.
const WordRecordInfo& GetRecordInfo()
const
- Return the WordRecordInfo data member of the
current object as a const.
WordRecordInfo& GetRecordInfo()
- Return the WordRecordInfo data member of the
current object.
const WordDBInfo& GetDBInfo() const
- Return the WordDBInfo data member of the
current object as a const.
WordDBInfo& GetDBInfo()
- Return the WordDBInfo data member of the
current object.
const WordMonitor* GetMonitor() const
- Return the WordMonitor data member of the
current object as a const. The pointer may be NULL if the
word_monitor attribute is false.
WordMonitor* GetMonitor()
- Return the WordMonitor data member of the
current object. The pointer may be NULL if the word_monitor
attribute is false.
const Configuration& GetConfiguration()
const
- Return the Configuration data member of the
current object as a const.
Configuration& GetConfiguration()
- Return the Configuration data member of the
current object.
WordList* List()
- Return a new WordList object, using the
WordList(WordContext*) constructor. It is the responsibility of the
caller to delete this object before the WordContext object is
deleted. Refer to the wordlist_multi configuration
parameter to know the exact type of the object created.
WordReference* Word()
- Return a new WordReference object, using the
WordReference(WordContext*) constructor. It is the responsibility
of the caller to delete this object before the WordContext object
is deleted.
WordReference* Word(const String& key0, const
String& record0)
- Return a new WordReference object, using the
WordReference(WordContext*, const String&, const& String)
constructor. It is the responsibility of the caller to delete this
object before the WordContext object is deleted.
WordReference* Word(const String& word)
- Return a new WordReference object, using the
WordReference(WordContext*, const String&) constructor. It is
the responsibility of the caller to delete this object before the
WordContext object is deleted.
WordRecord* Record()
- Return a new WordRecord object, using the
WordRecord(WordContext*) constructor. It is the responsibility of
the caller to delete this object before the WordContext object is
deleted.
WordKey* Key()
- Return a new WordKey object, using the
WordKey(WordContext*) constructor. It is the responsibility of the
caller to delete this object before the WordContext object is
deleted.
WordKey* Key(const String& word)
- Return a new WordKey object, using the
WordKey(WordContext*, const String&) constructor. It is the
responsibility of the caller to delete this object before the
WordContext object is deleted.
WordKey* Key(const WordKey& other)
- Return a new WordKey object, using the
WordKey(WordContext*, const WordKey&) constructor. It is the
responsibility of the caller to delete this object before the
WordContext object is deleted.
static String ConfigFile()
- Return the full pathname of the configuration file. The
configuration file lookup first searches for the file pointed by
the MIFLUZ_CONFIG environment variable then
~/.mifluz and finally
/usr/etc/mifluz.conf . If no configuration file is
found, return the empty string.
Node:WordContext ENVIRONMENT,
Previous:WordContext METHODS, Up:WordContext
WordContext ENVIRONMENT
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz. or
/usr/etc/mifluz.conf
Node:WordList, Next:WordDict, Previous:WordContext, Up:Reference
WordList
Node:WordList
NAME, Next:WordList
SYNOPSIS, Previous:WordList, Up:WordList
WordList NAME
abstract class to manage and use an inverted index file.
Node:WordList SYNOPSIS, Next:WordList DESCRIPTION,
Previous:WordList
NAME, Up:WordList
WordList SYNOPSIS
#include <mifluz.h>
WordContext context;
WordList* words = context->List();
delete words;
Node:WordList DESCRIPTION, Next:WordList
CONFIGURATION, Previous:WordList SYNOPSIS, Up:WordList
WordList DESCRIPTION
WordList is the mifluz equivalent of a database
handler. Each WordList object is bound to an inverted index file
and implements the operations to create it, fill it with word
occurrences and search for an entry matching a given criterion.
WordList is an abstract class and cannot be instanciated. The
List method of the class WordContext will create
an instance using the appropriate derived class, either WordListOne
or WordListMulti. Refer to the corresponding manual pages for more
information on their specific semantic.
When doing bulk insertions, mifluz creates temporary files that
contain the entries to be inserted in the index. Those files are
typically named indexC00000000 . The maximum size of
the temporary file is wordlist_cache_size / 2.
When the maximum size of the temporary file is reached, mifluz
creates another temporary file named indexC00000001 .
The process continues until mifluz created 50 temporary file. At
this point it merges all temporary files into one that replaces the
first indexC00000000 . Then it continues to create
temporary file again and keeps following this algorithm until the
bulk insertion is finished. When the bulk insertion is finished,
mifluz has one big file named indexC00000000 that
contains all the entries to be inserted in the index. mifluz
inserts all the entries from indexC00000000 into the
index and delete the temporary file when done. The insertion will
be fast since all the entries in indexC00000000 are
already sorted.
The parameter wordlist_cache_max can be used to
prevent the temporary files to grow indefinitely. If the total
cumulated size of the indexC* files grow beyond this
parameter, they are merged into the main index and deleted. For
instance setting this parameter value to 500Mb garanties that the
total size of the indexC* files will not grow above
500Mb.
Node:WordList CONFIGURATION, Next:WordList METHODS,
Previous:WordList
DESCRIPTION, Up:WordList
WordList CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_extend {true|false} (default false)
- If true maintain reference count of unique
words. The Noccurrence method gives access to this
count.
wordlist_verbose <number> (default 0)
- Set the verbosity level of the WordList class.
1 walk logic
2 walk logic details
3 walk logic lots of details
wordlist_page_size <bytes> (default
8192)
- Berkeley DB page size (see Berkeley DB documentation)
wordlist_cache_size <bytes> (default
500K)
- Berkeley DB cache size (see Berkeley DB documentation) Cache
makes a huge difference in performance. It must be at least 2% of
the expected total data size. Note that if compression is activated
the data size is eight times larger than the actual file size. In
this case the cache must be scaled to 2% of the data size, not 2%
of the file size. See Cache tuning in the mifluz
guide for more hints. See WordList(3) for the rationale behind
cache file handling.
wordlist_cache_max <bytes> (default 0)
- Maximum size of the cumulated cache files generated when doing
bulk insertion with the BatchStart() function.
When this limit is reached, the cache files are all merged into the
inverted index. The value 0 means infinite size allowed. See
WordList(3) for the rationale behind cache file handling.
wordlist_cache_inserts {true|false} (default
false)
- If true all Insert calls are cached in memory.
When the WordList object is closed or a different access method is
called the cached entries are flushed in the inverted index.
wordlist_compress {true|false} (default
false)
- Activate compression of the index. The resulting index is eight
times smaller than the uncompressed index.
Node:WordList
METHODS, Previous:WordList CONFIGURATION, Up:WordList
WordList METHODS
inline WordContext* GetContext()
- Return a pointer to the WordContext object used to create this
instance.
inline const WordContext* GetContext() const
- Return a pointer to the WordContext object used to create this
instance as a const.
virtual inline int Override(const WordReference&
wordRef)
- Insert wordRef in index. If the
Key() part of the wordRef exists in
the index, override it. Returns OK on success, NOTOK on
error.
virtual int Exists(const WordReference&
wordRef)
- Returns OK if wordRef exists in the index,
NOTOK otherwise.
inline int Exists(const String& word)
- Returns OK if word exists in the index, NOTOK
otherwise.
virtual int WalkDelete(const WordReference&
wordRef)
- Delete all entries in the index whose key matches the
Key() part of wordRef , using the
Walk method. Returns the number of entries
successfully deleted.
virtual int Delete(const WordReference&
wordRef)
- Delete the entry in the index that exactly matches the
Key() part of wordRef. Returns OK if
deletion is successfull, NOTOK otherwise.
virtual int Open(const String& filename, int
mode)
- Open inverted index filename.
mode may be
O_RDONLY or
O_RDWR. If mode is O_RDWR it can be or'ed
with O_TRUNC to reset the content of an existing
inverted index. Return OK on success, NOTOK otherwise.
virtual int Close()
- Close inverted index. Return OK on success, NOTOK
otherwise.
virtual unsigned int Size() const
- Return the size of the index in pages.
virtual int Pagesize() const
- Return the page size
virtual WordDict *Dict()
- Return a pointer to the inverted index dictionnary.
const String& Filename() const
- Return the filename given to the last call to Open.
int Flags() const
- Return the mode given to the last call to Open.
inline List *Find(const WordReference&
wordRef)
- Returns the list of word occurrences exactly matching the
Key() part of wordRef. The
List returned contains pointers to
WordReference objects. It is the responsibility of the
caller to free the list. See List.h header for usage.
inline List *FindWord(const String& word)
- Returns the list of word occurrences exactly matching the
word. The
List returned contains
pointers to WordReference objects. It is the
responsibility of the caller to free the list. See List.h header
for usage.
virtual List *operator [] (const WordReference&
wordRef)
- Alias to the Find method.
inline List *operator [] (const String&
word)
- Alias to the FindWord method.
virtual List *Prefix (const WordReference&
prefix)
- Returns the list of word occurrences matching the
Key() part of wordRef. In the
Key() , the string (accessed with
GetWord() ) matches any string that begins with it.
The List returned contains pointers to
WordReference objects. It is the responsibility of the
caller to free the list.
inline List *Prefix (const String&
prefix)
- Returns the list of word occurrences matching the
word. In the
Key() , the string
(accessed with GetWord() ) matches any string that
begins with it. The List returned contains pointers to
WordReference objects. It is the responsibility of the
caller to free the list.
virtual List *Words()
- Returns a list of all unique words contained in the inverted
index. The
List returned contains pointers to
String objects. It is the responsibility of the caller
to free the list. See List.h header for usage.
virtual List *WordRefs()
- Returns a list of all entries contained in the inverted index.
The
List returned contains pointers to
WordReference objects. It is the responsibility of the
caller to free the list. See List.h header for usage.
virtual WordCursor *Cursor(wordlist_walk_callback_t
callback, Object *callback_data)
- Create a cursor that searches all the occurrences in the
inverted index and call ncallback with
ncallback_data for every match.
virtual WordCursor *Cursor(const WordKey &searchKey,
int action = HTDIG_WORDLIST_WALKER)
- Create a cursor that searches all the occurrences in the
inverted index and that match nsearchKey. If
naction is set to HTDIG_WORDLIST_WALKER calls
searchKey.callback with
searchKey.callback_data for every match. If
naction is set to HTDIG_WORDLIST_COLLECT push each
match in searchKey.collectRes data member as a
WordReference object. It is the responsibility of
the caller to free the searchKey.collectRes
list.
virtual WordCursor *Cursor(const WordKey &searchKey,
wordlist_walk_callback_t callback, Object *
callback_data)
- Create a cursor that searches all the occurrences in the
inverted index and that match nsearchKey and calls
ncallback with ncallback_data for
every match.
virtual WordKey Key(const String&
bufferin)
- Create a WordKey object and return it. The
bufferin argument is used to initialize the key,
as in the WordKey::Set method. The first component of
bufferin must be a word that is translated to the
corresponding numerical id using the WordDict::Serial method.
virtual WordReference Word(const String& bufferin,
int exists = 0)
- Create a WordReference object and return it. The
bufferin argument is used to initialize the
structure, as in the WordReference::Set method. The first component
of bufferin must be a word that is translated to
the corresponding numerical id using the WordDict::Serial method.
If the exists argument is set to 1, the method
WordDict::SerialExists is used instead, that is no serial is
assigned to the word if it does not already have one. Before
translation the word is normalized using the WordType::Normalize
method. The word is saved using the WordReference::SetWord
method.
virtual WordReference WordExists(const String&
bufferin)
- Alias for Word(bufferin, 1).
virtual void BatchStart()
- Accelerate bulk insertions in the inverted index. All insertion
done with the Override method are batched instead
of being updating the inverted index immediately. No update of the
inverted index file is done before the BatchEnd
method is called.
virtual void BatchEnd()
- Terminate a bulk insertion started with a call to the
BatchStart method. When all insertions are done
the AllRef method is called to restore
statistics.
virtual int Noccurrence(const String& key, unsigned
int& noccurrence) const
- Return in noccurrence the number of
occurrences of the string contained in the
GetWord()
part of key. Returns OK on success, NOTOK
otherwise.
virtual int Write(FILE* f)
- Write on file descriptor f an ASCII
description of the index. Each line of the file contains a
WordReference ASCII description. Return OK on success,
NOTOK otherwise.
virtual int WriteDict(FILE* f)
- Write on file descriptor f the complete
dictionnary with statistics. Return OK on success, NOTOK
otherwise.
virtual int Read(FILE* f)
- Read
WordReference ASCII descriptions from
f , returns the number of inserted WordReference
or < 0 if an error occurs. Invalid descriptions are ignored as
well as empty lines.
Node:WordDict, Next:WordListOne, Previous:WordList, Up:Reference
WordDict
Node:WordDict
NAME, Next:WordDict
SYNOPSIS, Previous:WordDict, Up:WordDict
WordDict NAME
manage and use an inverted index dictionary.
Node:WordDict SYNOPSIS, Next:WordDict DESCRIPTION,
Previous:WordDict
NAME, Up:WordDict
WordDict SYNOPSIS
#include <mifluz.h>
WordList* words = ...;
WordDict* dict = words->Dict();
Node:WordDict DESCRIPTION, Next:WordDict METHODS,
Previous:WordDict
SYNOPSIS, Up:WordDict
WordDict DESCRIPTION
WordDict maps strings to unique identifiers and frequency in the
inverted index. Whenever a new word is found, the WordDict class
can be asked to assign it a serial number. When doing so, an entry
is created in the dictionary with a frequency of zero. The
application may then increment or decrement the frequency to
reflect the inverted index content.
The serial numbers range from 1 to 2^32 inclusive.
A WordDict object is automatically created by the WordList
object and should not be created directly by the application.
Node:WordDict
METHODS, Previous:WordDict DESCRIPTION, Up:WordDict
WordDict METHODS
WordDict()
- Private constructor.
int Initialize(WordList* words)
- Bind the object a WordList inverted index. Return OK on
success, NOTOK otherwise.
int Open()
- Open the underlying Berkeley DB sub-database. The enclosing
file is given by the
words data member. Return OK on
success, NOTOK otherwise.
int Remove()
- Destroy the underlying Berkeley DB sub-database. Return OK on
success, NOTOK otherwise.
int Close()
- Close the underlying Berkeley DB sub-database. Return OK on
success, NOTOK otherwise.
int Serial(const String& word, unsigned int&
serial)
- If the word argument exists in the
dictionnary, return its serial number in the
serial argument. If it does not already exists,
assign it a serial number, create an entry with a frequency of zero
and return the new serial in the serial argument.
Return OK on success, NOTOK otherwise.
int SerialExists(const String& word, unsigned
int& serial)
- If the word argument exists in the
dictionnary, return its serial number in the
serial argument. If it does not exists set the
serial argument to WORD_DICT_SERIAL_INVALID.
Return OK on success, NOTOK otherwise.
int SerialRef(const String& word, unsigned int&
serial)
- Short hand for Serial() followed by Ref(). Return OK on
success, NOTOK otherwise.
int Noccurrence(const String& word, unsigned int&
noccurrence) const
- Return the frequency of the word argument in
the noccurrence argument. Return OK on success,
NOTOK otherwise.
int Normalize(String& word) const
- Short hand for
words->GetContext()->GetType()->Normalize(word). Return OK
on success, NOTOK otherwise.
int Ref(const String& word)
- Short hand for Incr(word, 1)
int Incr(const String& word, unsigned int
incr)
- Add incr to the frequency of the
word . Return OK on success, NOTOK
otherwise.
int Unref(const String& word)
- Short hand for Decr(word, 1)
int Decr(const String& word, unsigned int
decr)
- Subtract decr to the frequency of the
word . If the frequency becomes lower or equal to
zero, remove the entry from the dictionnary and lose the
association between the word and its serial number. Return OK on
success, NOTOK otherwise.
int Put(const String& word, unsigned int
noccurrence)
- Set the frequency of word with the value of
the noccurrence argument.
int Exists(const String& word) const
- Return true if word exists in the dictionnary,
false otherwise.
List* Words() const
- Return a pointer to the associated WordList object.
WordDictCursor* Cursor() const
- Return a cursor to sequentially walk the dictionnary using the
Next method.
int Next(WordDictCursor* cursor, String& word,
WordDictRecord& record)
- Return the next entry in the dictionnary. The
cursor argument must have been created using the
Cursor method. The word is returned in the
word argument and the record is returned in the
record argument. On success the function returns
0, at the end of the dictionnary it returns DB_NOTFOUND. The
cursor argument is deallocated when the function
hits the end of the dictionnary or an error occurs.
WordDictCursor* CursorPrefix(const String& prefix)
const
- Return a cursor to sequentially walk the entries of the
dictionnary that start with the prefix argument,
using the NextPrefix method.
int NextPrefix(WordDictCursor* cursor, String& word,
WordDictRecord& record)
- Return the next prefix from the dictionnary. The
cursor argument must have been created using the
CursorPrefix method. The word is returned in the
word argument and the record is returned in the
record argument. The word is
guaranteed to start with the prefix specified to the
CursorPrefix method. On success the function
returns 0, at the end of the dictionnary it returns DB_NOTFOUND.
The cursor argument is deallocated when the
function hits the end of the dictionnary or an error occurs.
int Write(FILE* f)
- Dump the complete dictionary in the file descriptor
f. The format of the dictionary is
word
serial frequency , one by line.
Node:WordListOne,
Next:WordKey, Previous:WordDict, Up:Reference
WordListOne
Node:WordListOne NAME, Next:WordListOne SYNOPSIS,
Previous:WordListOne,
Up:WordListOne
WordListOne NAME
manage and use an inverted index file.
Node:WordListOne SYNOPSIS, Next:WordListOne
DESCRIPTION, Previous:WordListOne NAME, Up:WordListOne
WordListOne SYNOPSIS
#include <mifluz.h>
WordContext context;
WordList* words = context->List();
WordList* words = WordListOne(context)
Node:WordListOne DESCRIPTION,
Next:WordListOne
METHODS, Previous:WordListOne SYNOPSIS, Up:WordListOne
WordListOne DESCRIPTION
WordList is the mifluz equivalent of a database
handler. Each WordList object is bound to an inverted index file
and implements the operations to create it, fill it with word
occurrences and search for an entry matching a given criterion.
The general behavious of WordListOne is described in the
WordList manual page. It is prefered to create a WordListOne
instance by setting the wordlist_multi configuration
parameter to false and calling the
WordContext::List method.
Only the methods that differ from WordList are listed here. All
the methods of WordList are implemented by WordListOne and you
should refer to the manual page for more information.
The Cursor methods all return a WordCursorOne
instance cast to a WordCursor object.
Node:WordListOne METHODS, Previous:WordListOne
DESCRIPTION, Up:WordListOne
WordListOne METHODS
WordListOne(WordContext* ncontext)
- Constructor. Build inverted index handling object using run
time configuration parameters listed in the
CONFIGURATION section of the
WordList manual page.
int DeleteCursor(WordDBCursor& cursor)
- Delete the inverted index entry currently pointed to by the
cursor. Returns 0 on success, Berkeley DB error
code on error. This is mainly useful when implementing a callback
function for a WordCursor.
Node:WordKey, Next:WordKeyInfo, Previous:WordListOne, Up:Reference
WordKey
Node:WordKey NAME,
Next:WordKey SYNOPSIS,
Previous:WordKey, Up:WordKey
WordKey NAME
inverted index key.
Node:WordKey
SYNOPSIS, Next:WordKey DESCRIPTION, Previous:WordKey NAME, Up:WordKey
WordKey SYNOPSIS
#include <WordKey.h>
#define WORD_KEY_DOCID 1
#define WORD_KEY_LOCATION 2
WordList* words = ...;
WordKey key = words->Key("word 100 20");
WordKey searchKey;
words->Dict()->SerialExists("dog", searchKey.Get(WORD_KEY_WORD));
searchKey.Set(WORD_KEY_LOCATION, 5);
WordCursor* cursor = words->Key(searchKey);
Node:WordKey DESCRIPTION, Next:WordKey ASCII
FORMAT, Previous:WordKey SYNOPSIS, Up:WordKey
WordKey DESCRIPTION
Describes the key used to store a entry in the inverted index.
Each field in the key has a bit in the set member
that says if it is set or not. This bit allows to say that a
particular field is undefined regardless of the actual
value stored. The methods IsDefined, SetDefined
and Undefined are used to manipulate the
defined status of a field. The Pack
and Unpack methods are used to convert to and from
the disk storage representation of the key.
Although constructors may be used, the prefered way to create a
WordKey object is by using the WordContext::Key
method.
The following constants are defined:
WORD_KEY_WORD
- the index of the word identifier with the key for Set and Get
methods.
WORD_KEY_VALUE_INVALID
- a value that is invalid for any field of the key.
Node:WordKey ASCII FORMAT, Next:WordKey METHODS,
Previous:WordKey
DESCRIPTION, Up:WordKey
WordKey ASCII FORMAT
The ASCII description is a string with fields separated by tabs
or white space.
Example: 200 <UNDEF> 1 4 2
Field 1: The word identifier or <UNDEF> if not defined
Field 2 to the end: numerical value of the field or <UNDEF> if
not defined
Node:WordKey
METHODS, Previous:WordKey ASCII FORMAT, Up:WordKey
WordKey METHODS
WordKey(WordContext* ncontext)
- Constructor. Build an empty key. The ncontext
argument must be a pointer to a valid WordContext object.
WordKey(WordContext* ncontext, const String&
desc)
- Constructor. Initialize from an ASCII description of a key. See
ASCII FORMAT section. The ncontext
argument must be a pointer to a valid WordContext object.
void Clear()
- Reset to empty key.
inline int NFields() const
- Convenience functions to access the total number of fields in a
key (see
WordKeyInfo(3) ).
inline WordKeyNum MaxValue(int position)
- Convenience functions to access the maximum possible value for
field at position. in a key (see
WordKeyInfo(3) ).
inline WordContext* GetContext()
- Return a pointer to the WordContext object used to create this
instance.
inline const WordContext* GetContext() const
- Return a pointer to the WordContext object used to create this
instance as a const.
inline WordKeyNum Get(int position) const
- Return value of numerical field at position as
const.
inline WordKeyNum& Get(int position)
- Return value of numerical field at
position.
inline const WordKeyNum & operator[] (int position)
const
- Return value of numerical field at position as
const.
inline WordKeyNum & operator[] (int
position)
- Return value of numerical field at
position.
inline void Set(int position, WordKeyNum val)
- Set value of numerical field at position to
val.
int IsDefined(int position) const
- Returns true if field at position is
defined , false otherwise.
void SetDefined(int position)
- Value in field position becomes
defined. A bit is set in the bit field describing the
defined/undefined state of the value and the actual value of the
field is not modified.
void Undefined(int position)
- Value in field position becomes
undefined. A bit is set in the bit field describing
the defined/undefined state of the value and the actual value of
the field is not modified.
int Set(const String& bufferin)
- Set the whole structure from ASCII string in
bufferin. See
ASCII FORMAT section.
Return OK if successfull, NOTOK otherwise.
int Get(String& bufferout) const
- Convert the whole structure to an ASCII string description in
bufferout. See
ASCII FORMAT section.
Return OK if successfull, NOTOK otherwise.
String Get() const
- Convert the whole structure to an ASCII string description and
return it. See
ASCII FORMAT section.
int Unpack(const char* string, int length)
- Set structure from disk storage format as found in
string buffer or length length.
Return OK if successfull, NOTOK otherwise.
inline int Unpack(const String& data)
- Set structure from disk storage format as found in
data string. Return OK if successfull, NOTOK
otherwise.
int Pack(String& data) const
- Convert object into disk storage format as found in and place
the result in data string. Return OK if
successfull, NOTOK otherwise.
int Merge(const WordKey& other)
- Copy each
defined field from other into the
object, if the corresponding field of the object is not defined.
Return OK if successfull, NOTOK otherwise.
int PrefixOnly()
- Undefine all fields found after the first undefined field. The
resulting key has a set of defined fields followed by undefined
fields. Returns NOTOK if the word is not defined because the
resulting key would be empty and this is considered an error.
Returns OK on success.
int SetToFollowing(int position =
WORD_FOLLOWING_MAX)
- Implement ++ on a key.
It behaves like arithmetic but follows these rules:
. Increment starts at field <position>
. If a field value overflows, increment field
position
- 1
. Undefined fields are ignored and their value untouched
. When a field is incremented all fields to the left are set to 0
If position is not specified it is equivalent to NFields() - 1. It
returns OK if successfull, NOTOK if position out
of range or WORD_FOLLOWING_ATEND if the maximum possible value was
reached.
int Filled() const
- Return true if all the fields are
defined , false
otherwise.
int Empty() const
- Return true if no fields are
defined , false
otherwise.
int Equal(const WordKey& other) const
- Return true if the object and other are equal.
Only fields defined in both keys are compared.
int ExactEqual(const WordKey& other)
const
- Return true if the object and other are equal.
All fields are compared. If a field is defined in
object and not defined in the object, the key are
not considered equal.
int Cmp(const WordKey& other) const
- Compare object and other as
in strcmp. Undefined fields are ignored. Returns a positive number
if object is greater than other ,
zero if they are equal, a negative number if
object is lower than other.
int PackEqual(const WordKey& other) const
- Return true if the object and other are equal.
The packed string are compared. An
undefined numerical
field will be 0 and therefore undistinguishable from a
defined field whose value is 0.
int Outbound(int position, int increment)
- Return true if adding increment in field at
position makes it overflow or underflow, false if
it fits.
int Overflow(int position, int increment)
- Return true if adding positive increment to
field at position makes it overflow, false if it
fits.
int Underflow(int position, int increment)
- Return true if subtracting positive increment
to field at position makes it underflow, false if
it fits.
int Prefix() const
- Return OK if the key may be used as a prefix for search. In
other words return OK if the fields set in the key are all
contiguous, starting from the first field. Otherwise returns
NOTOK
static int Compare(WordContext* context, const
String& a, const String& b)
- Compare a and b in the
Berkeley DB fashion. a and b are
packed keys. The semantics of the returned int is as of strcmp and
is driven by the key description found in
WordKeyInfo.
Returns a positive number if a is greater than
b , zero if they are equal, a negative number if
a is lower than b.
static int Compare(WordContext* context, const unsigned
char *a, int a_length, const unsigned char *b, int
b_length)
- Compare a and b in the
Berkeley DB fashion. a and b are
packed keys. The semantics of the returned int is as of strcmp and
is driven by the key description found in
WordKeyInfo.
Returns a positive number if a is greater than
b , zero if they are equal, a negative number if
a is lower than b.
int Diff(const WordKey& other, int& position,
int& lower)
- Compare object defined fields with other key
defined fields only, ignore fields that are not defined in object
or other. Return 1 if different 0 if equal. If
different, position is set to the field number
that differ, lower is set to 1 if Get(
position ) is lower than other.Get(
position ) otherwise lower is set to 0.
int Write(FILE* f) const
- Print object in ASCII form on f (uses
Get method). See ASCII FORMAT
section.
void Print() const
- Print object in ASCII form on stdout (uses
Get method). See ASCII FORMAT
section.
Node:WordKeyInfo,
Next:WordType, Previous:WordKey, Up:Reference
WordKeyInfo
Node:WordKeyInfo NAME, Next:WordKeyInfo SYNOPSIS,
Previous:WordKeyInfo,
Up:WordKeyInfo
WordKeyInfo NAME
information on the key structure of the inverted index.
Node:WordKeyInfo SYNOPSIS, Next:WordKeyInfo
DESCRIPTION, Previous:WordKeyInfo NAME, Up:WordKeyInfo
WordKeyInfo SYNOPSIS
Helper for the WordKey class.
Node:WordKeyInfo DESCRIPTION,
Next:WordKeyInfo
CONFIGURATION, Previous:WordKeyInfo SYNOPSIS, Up:WordKeyInfo
WordKeyInfo DESCRIPTION
Describe the structure of the index key ( WordKey
). The description includes the layout of the packed version stored
on disk.
Node:WordKeyInfo CONFIGURATION,
Previous:WordKeyInfo DESCRIPTION,
Up:WordKeyInfo
WordKeyInfo CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_wordkey_description <desc> (no
default)
- Describe the structure of the inverted index key. In the
following explanation of the
<desc> format,
mandatory words are in bold and values that must be replaced in
italic.
Word bits/name bits [/...]
The name is an alphanumerical symbolic name for the
key field. The bits is the number of bits required to
store this field. Note that all values are stored in unsigned
integers (unsigned int). Example:
Word 8/Document 16/Location 8
Node:WordType, Next:WordDBInfo, Previous:WordKeyInfo, Up:Reference
WordType
Node:WordType
NAME, Next:WordType
SYNOPSIS, Previous:WordType, Up:WordType
WordType NAME
defines a word in term of allowed characters, length etc.
Node:WordType SYNOPSIS, Next:WordType DESCRIPTION,
Previous:WordType
NAME, Up:WordType
WordType SYNOPSIS
Only called thru WordContext::Initialize()
Node:WordType DESCRIPTION, Next:WordType
CONFIGURATION, Previous:WordType SYNOPSIS, Up:WordType
WordType DESCRIPTION
WordType defines an indexed word and operations to validate a
word to be indexed. All words inserted into the mifluz
index are Normalize d before insertion. The
configuration options give some control over the definition of a
word.
Node:WordType CONFIGURATION, Next:WordType METHODS,
Previous:WordType
DESCRIPTION, Up:WordType
WordType CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_locale <locale> (default C)
- Set the locale of the program to locale . See
setlocale(3) for more information.
wordlist_allow_numbers {true|false} <number>
(default false)
- A digit is considered a valid character within a word if this
configuration parameter is set to
true otherwise it is
an error to insert a word containing digits. See the
Normalize method for more information.
wordlist_mimimun_word_length <number> (default
3)
- The minimum length of a word. See the
Normalize method for more information.
wordlist_maximum_word_length <number> (default
25)
- The maximum length of a word. See the
Normalize method for more information.
wordlist_allow_numbers {true|false} <number>
(default false)
- A digit is considered a valid character within a word if this
configuration parameter is set to
true otherwise it is
an error to insert a word containing digits. See the
Normalize method for more information.
wordlist_truncate {true|false} <number> (default
true)
- If a word is too long according to the
wordlist_maximum_word_length it is truncated if this
configuration parameter is true otherwise it is
considered an invalid word.
wordlist_lowercase {true|false} <number> (default
true)
- If a word contains upper case letters it is converted to
lowercase if this configuration parameter is true, otherwise it is
left untouched.
wordlist_valid_punctuation [characters] (default
none)
- A list of punctuation characters that may appear in a word.
These characters will be removed from the word before insertion in
the index.
Node:WordType
METHODS, Previous:WordType CONFIGURATION, Up:WordType
WordType METHODS
int Normalize(String &s) const
- Normalize a word according to configuration specifications and
builtin transformations. Every word inserted in
the inverted index goes thru this function. If a word is rejected
(return value has WORD_NORMALIZE_NOTOK bit set) it will not be
inserted in the index. If a word is accepted (return value has
WORD_NORMALIZE_OK bit set) it will be inserted in the index. In
addition to these two bits, informational values are stored that
give information on the processing done on the word. The bit field
values and their meanings are as follows:
WORD_NORMALIZE_TOOLONG
- the word length exceeds the value of the
wordlist_maximum_word_length configuration
parameter.
WORD_NORMALIZE_TOOSHORT
- the word length is smaller than the value of the
wordlist_minimum_word_length configuration
parameter.
WORD_NORMALIZE_CAPITAL
- the word contained capital letters and has been converted to
lowercase. This bit is only set if the
wordlist_lowercase configuration parameter is
true.
WORD_NORMALIZE_NUMBER
- the word contains digits and the configuration parameter
wordlist_allow_numbers is set to false.
WORD_NORMALIZE_CONTROL
- the word contains control characters.
WORD_NORMALIZE_BAD
- the word is listed in the file pointed by the
wordlist_bad_word_list configuration parameter.
WORD_NORMALIZE_NULL
- the word is a zero length string.
WORD_NORMALIZE_PUNCTUATION
- at least one character listed in the
wordlist_valid_punctuation attribute was removed from
the word.
WORD_NORMALIZE_NOALPHA
- the word does not contain any alphanumerical character.
static String NormalizeStatus(int flags)
- Returns a string explaining the return flags of the Normalize
method.
Node:WordDBInfo, Next:WordRecordInfo, Previous:WordType, Up:Reference
WordDBInfo
Node:WordDBInfo
NAME, Next:WordDBInfo SYNOPSIS, Previous:WordDBInfo, Up:WordDBInfo
WordDBInfo NAME
inverted index usage environment.
Node:WordDBInfo SYNOPSIS, Next:WordDBInfo
DESCRIPTION, Previous:WordDBInfo NAME, Up:WordDBInfo
WordDBInfo SYNOPSIS
Only called thru WordContext::Initialize()
Node:WordDBInfo DESCRIPTION, Next:WordDBInfo
CONFIGURATION, Previous:WordDBInfo SYNOPSIS, Up:WordDBInfo
WordDBInfo DESCRIPTION
The inverted indexes may be shared among processes/threads and
provide the appropriate locking to prevent mistakes. In addition
the memory cache used by WordList objects may be
shared by processes/threads, greatly reducing the memory needs in
multi-process applications. For more information about the shared
environment, check the Berkeley DB documentation.
Node:WordDBInfo CONFIGURATION,
Previous:WordDBInfo DESCRIPTION, Up:WordDBInfo
WordDBInfo CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_env_skip {true,false} (default
false)
- If true no environment is created at all. This must never be
used if a
WordList object is created. It may be useful
if only WordKey objects are used, for instance.
wordlist_env_share {true,false} (default
false)
- If true a sharable environment is open or created if none
exist.
wordlist_env_dir <directory> (default
.)
- Only valid if
wordlist_env_share set to
true. Specify the directory in which the sharable
environment will be created. All inverted indexes specified with a
non-absolute pathname will be created relative to this
directory.
Node:WordRecordInfo, Next:WordRecord, Previous:WordDBInfo, Up:Reference
WordRecordInfo
Node:WordRecordInfo NAME, Next:WordRecordInfo
SYNOPSIS, Previous:WordRecordInfo, Up:WordRecordInfo
WordRecordInfo NAME
information on the record structure of the inverted index.
Node:WordRecordInfo SYNOPSIS,
Next:WordRecordInfo
DESCRIPTION, Previous:WordRecordInfo NAME, Up:WordRecordInfo
WordRecordInfo SYNOPSIS
Only called thru WordContext::Initialize()
Node:WordRecordInfo DESCRIPTION,
Next:WordRecordInfo
CONFIGURATION, Previous:WordRecordInfo SYNOPSIS,
Up:WordRecordInfo
WordRecordInfo DESCRIPTION
The structure of a record is very limited. It can contain a
single integer value or a string.
Node:WordRecordInfo
CONFIGURATION, Previous:WordRecordInfo
DESCRIPTION, Up:WordRecordInfo
WordRecordInfo CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_wordrecord_description {NONE|DATA|STR} (no
default)
- NONE: the record is empty
DATA: the record contains an integer (unsigned int)
STR: the record contains a string (String)
Node:WordRecord, Next:WordReference, Previous:WordRecordInfo, Up:Reference
WordRecord
Node:WordRecord
NAME, Next:WordRecord SYNOPSIS, Previous:WordRecord, Up:WordRecord
WordRecord NAME
inverted index record.
Node:WordRecord SYNOPSIS, Next:WordRecord
DESCRIPTION, Previous:WordRecord NAME, Up:WordRecord
WordRecord SYNOPSIS
#include <WordRecord.h>
WordContext* context;
WordRecord* record = context->Record();
if(record->DefaultType() == WORD_RECORD_DATA) {
record->info.data = 120;
} else if(record->DefaultType() == WORD_RECORD_STR) {
record->info.str = "foobar";
}
delete record;
Node:WordRecord DESCRIPTION, Next:WordRecord ASCII
FORMAT, Previous:WordRecord SYNOPSIS, Up:WordRecord
WordRecord DESCRIPTION
The record can contain an integer, if the default record type
(see CONFIGURATION in WordKeyInfo ) is set to
DATA or a string if set to STR. If the
type is set to NONE the record does not contain any
usable information.
Although constructors may be used, the prefered way to create a
WordRecord object is by using the
WordContext::Record method.
Node:WordRecord ASCII FORMAT,
Next:WordRecord
METHODS, Previous:WordRecord DESCRIPTION, Up:WordRecord
WordRecord ASCII FORMAT
If default type is DATA it is the decimal
representation of an integer. If default type is NONE
it is the empty string.
Node:WordRecord METHODS, Previous:WordRecord ASCII
FORMAT, Up:WordRecord
WordRecord METHODS
inline WordRecord(WordContext* ncontext)
- Constructor. Build an empty record. The
ncontext argument must be a pointer to a valid
WordContext object.
inline void Clear()
- Reset to empty and set the type to the default specified in the
configuration.
inline int DefaultType()
- Return the default type WORD_RECORD_{DATA,STR,NONE}
inline int Pack(String& packed) const
- Convert the object to a representation for disk storage written
in the packed string. Return OK on success, NOTOK
otherwise.
inline int Unpack(const char* string, int
length)
- Alias for Unpack(String(string, length))
inline int Unpack(const String& packed)
- Read the object from a representation for disk storage
contained in the packed argument. Return OK on
success, NOTOK otherwise.
int Set(const String& bufferin)
- Set the whole structure from ASCII string description stored in
the bufferin argument. Return OK on success, NOTOK
otherwise.
int Get(String& bufferout) const
- Convert the whole structure to an ASCII string description and
return it in the bufferout argument. Return OK on
success, NOTOK otherwise.
String Get() const
- Convert the whole structure to an ASCII string description and
return it.
inline WordContext* GetContext()
- Return a pointer to the WordContext object used to create this
instance.
inline const WordContext* GetContext() const
- Return a pointer to the WordContext object used to create this
instance as a const.
int Write(FILE* f) const
- Print object in ASCII form on descriptor f
using the Get method.
Node:WordReference,
Next:WordCursor, Previous:WordRecord, Up:Reference
WordReference
Node:WordReference NAME, Next:WordReference
SYNOPSIS, Previous:WordReference, Up:WordReference
WordReference NAME
inverted index occurrence.
Node:WordReference SYNOPSIS, Next:WordReference
DESCRIPTION, Previous:WordReference NAME, Up:WordReference
WordReference SYNOPSIS
#include <WordReference.h>
WordContext* context;
WordReference* word = context->Word("word");
WordReference* word = context->Word();
WordReference* word = context->Word(WordKey("key 1 2"), WordRecord());
WordKey key = word->Key()
WordKey record = word->Record()
word->Clear();
delete word;
Node:WordReference DESCRIPTION,
Next:WordReference ASCII
FORMAT, Previous:WordReference SYNOPSIS, Up:WordReference
WordReference DESCRIPTION
A WordReference object is an agregate of a
WordKey object and a WordRecord
object.
Although constructors may be used, the prefered way to create a
WordReference object is by using the
WordContext::Word method.
Node:WordReference ASCII
FORMAT, Next:WordReference METHODS,
Previous:WordReference DESCRIPTION,
Up:WordReference
WordReference ASCII FORMAT
The ASCII description is a string with fields separated by tabs
or white space. It is made of the ASCII description of a
WordKey object immediately followed by the ASCII
description of a WordRecord object. See the
corresponding manual pages for more information.
Node:WordReference METHODS,
Previous:WordReference ASCII
FORMAT, Up:WordReference
WordReference METHODS
WordReference(WordContext* ncontext) :
- Constructor. Build an object with empty key and empty record.
The ncontext argument must be a pointer to a valid
WordContext object.
WordReference(WordContext* ncontext, const String&
key0, const String& record0) :
- Constructor. Build an object from disk representation of
key and record . The
ncontext argument must be a pointer to a valid
WordContext object.
WordReference(WordContext* ncontext, const String&
word) :
- Constructor. Build an object with key word set to
word and otherwise empty and empty record. The
ncontext argument must be a pointer to a valid
WordContext object.
void Clear()
- Reset to empty key and record
inline WordContext* GetContext()
- Return a pointer to the WordContext object used to create this
instance.
inline const WordContext* GetContext() const
- Return a pointer to the WordContext object used to create this
instance as a const.
inline String& GetWord()
- Return the word data member.
inline const String& GetWord() const
- Return the word data member as a const.
inline void SetWord(const String& nword)
- Set the word data member from the
nword argument.
WordKey& Key()
- Return the key object.
const WordKey& Key() const
- Return the key object as const.
WordRecord& Record()
- Return the record object.
const WordRecord& Record() const
- Return the record object as const.
void Key(const WordKey& arg)
- Copy arg in the key part of the object.
int KeyUnpack(const String& packed)
- Set key structure from disk storage format as found in
packed string. Return OK if successfull, NOTOK
otherwise.
String KeyPack() const
- Convert key object into disk storage format as found in return
the resulting string.
int KeyPack(String& packed) const
- Convert key object into disk storage format as found in and
place the result in packed string. Return OK if
successfull, NOTOK otherwise.
void Record(const WordRecord& arg)
- Copy arg in the record part of the
object.
int RecordUnpack(const String& packed)
- Set record structure from disk storage format as found in
packed string. Return OK if successfull, NOTOK
otherwise.
String RecordPack() const
- Convert record object into disk storage format as found in
return the resulting string.
int RecordPack(String& packed) const
- Convert record object into disk storage format as found in and
place the result in packed string. Return OK if
successfull, NOTOK otherwise.
inline int Pack(String& ckey, String& crecord)
const
- Short hand for KeyPack( ckey ) RecordPack(
crecord ).
int Unpack(const String& ckey, const String&
crecord)
- Short hand for KeyUnpack( ckey ) RecordUnpack(
crecord ).
int Merge(const WordReference& other)
- Merge key with other.Key() using the
WordKey::Merge method: key.Merge(other.Key()). See the
corresponding manual page for details. Copy other.record into the
record part of the object.
static WordReference Merge(const WordReference&
master, const WordReference& slave)
- Copy master before merging with
master. Merge( slave ) and return
the copy. Prevents alteration of master .
int Set(const String& bufferin)
- Set the whole structure from ASCII string in
bufferin . See
ASCII FORMAT section.
Return OK if successfull, NOTOK otherwise.
int Get(String& bufferout) const
- Convert the whole structure to an ASCII string description in
bufferout. See
ASCII FORMAT section.
Return OK if successfull, NOTOK otherwise.
String Get() const
- Convert the whole structure to an ASCII string description and
return it. See
ASCII FORMAT section.
int Write(FILE* f) const
- Print object in ASCII form on f (uses
Get method). See ASCII FORMAT
section.
void Print() const
- Print object in ASCII form on stdout (uses
Get method). See ASCII FORMAT
section.
Node:WordCursor, Next:WordCursorOne, Previous:WordReference, Up:Reference
WordCursor
Node:WordCursor
NAME, Next:WordCursor SYNOPSIS, Previous:WordCursor, Up:WordCursor
WordCursor NAME
abstract class to search and retrieve entries in a WordList
object.
Node:WordCursor SYNOPSIS, Next:WordCursor
DESCRIPTION, Previous:WordCursor NAME, Up:WordCursor
WordCursor SYNOPSIS
#include <WordList.h>
int callback(WordList *, WordDBCursor& , const WordReference *, Object &)
{
...
}
Object* data = ...
WordList *words = ...;
WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"), HTDIG_WORDLIST_COLLECTOR);
if(search->Walk() == NOTOK) bark;
List* results = search->GetResults();
WordCursor *search = words->Cursor(callback, data);
WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"));
WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"), callback, data);
WordCursor *search = words->Cursor(WordKey());
search->WalkInit();
if(search->WalkNext() == OK)
dosomething(search->GetFound());
search->WalkFinish();
Node:WordCursor DESCRIPTION, Next:WordCursor METHODS,
Previous:WordCursor
SYNOPSIS, Up:WordCursor
WordCursor DESCRIPTION
WordCursor is an iterator on an inverted index. It is created by
asking a WordList object with the Cursor.
There is no other way to create a WordCursor object. When the
Walk* methods return, the WordCursor object contains
the result of the search and status information that indicates if
it reached the end of the list (IsAtEnd() method).
The callback function that is called each time
a match is found takes the following arguments:
WordList* words pointer to the inverted index handle.
WordDBCursor& cursor to call Del() and delete the current match
WordReference* wordRef is the match
Object& data is the user data provided by the caller when
search began.
The WordKey object that specifies the search
criterion may be used as follows (assuming word is followed by
DOCID and LOCATION):
Ex1: WordKey() walk the entire list of
occurences.
Ex2: WordKey("word <UNDEF>
<UNDEF>") find all occurrences of word
.
Ex3: WordKey("meet <UNDEF> 1") find all
occurrences of meet that occur at LOCATION 1 in any
DOCID. This can be inefficient since the search has to scan all
occurrences of meet to find the ones that occur at
LOCATION 1.
Ex4: WordKey("meet 2 <UNDEF>") find all
occurrences of meet that occur in DOCID 2, at any
location.
WordList is an abstract class and cannot be instanciated. See
the WordCursorOne manual page for an actual implementation of a
WordCursor object.
Node:WordCursor METHODS, Previous:WordCursor
DESCRIPTION, Up:WordCursor
WordCursor METHODS
virtual void Clear() = 0
- Clear all data in object, set GetResult() data
to NULL but do not delete it (the application is responsible for
that).
virtual inline int IsA() const
- Returns the type of the object. May be overloaded by derived
classes to differentiate them at runtime. Returns
WORD_CURSOR.
virtual inline int Optimize()
- Optimize the cursor before starting a Walk. Returns OK on
success, NOTOK otherwise.
virtual int ContextSave(String& buffer) const =
0
- Save in buffer all the information necessary
to resume the walk at the point it left. The ASCII representation
of the last key found (GetFound()) is written in
buffer using the WordKey::Get method.
virtual int ContextRestore(const String& buffer) =
0
- Restore from buffer all the information necessary to resume the
walk at the point it left. The buffer is expected
to contain an ASCII representation of a WordKey (see WordKey::Set
method). A Seek is done on the key and the object
is prepared to jump to the next occurrence when
WalkNext is called (the cursor_get_flags is set to
DB_NEXT.
virtual int Walk() = 0
- Walk and collect data from the index. Returns OK on success,
NOTOK otherwise.
virtual int WalkInit() = 0
- Must be called before other Walk methods are used. Fill
internal state according to input parameters and move before the
first matching entry. Returns OK on success, NOTOK otherwise.
virtual int WalkRewind() = 0
- Move before the first index matching entry. Returns OK on
success, NOTOK otherwise.
virtual int WalkNext() = 0
- Move to the next matching entry. At end of list,
WORD_WALK_ATEND is returned. Returns OK on success, NOTOK
otherwise. When OK is returned, the GetFound() method returns the
matched entry. When WORD_WALK_ATEND is returned, the GetFound()
method returns an empty object if the end of the index was reached
or the match that was found and that is greated than the specified
search criterion.
virtual int WalkNextStep() = 0
- Advance the cursor one step. The entry pointed to by the cursor
may or may not match the requirements. Returns OK if entry pointed
by cursor matches requirements. Returns NOTOK on failure. Returns
WORD_WALK_NOMATCH_FAILED if the current entry does not match
requirements, it's safe to call WalkNextStep again until either OK
or NOTOK is returned.
virtual int WalkNextExclude(const WordKey&
key)
- Return 0 if this key must not be returned by WalkNext as a
valid match. The WalkNextStep method calls this virtual method
immediately after jumping to the next entry in the database. This
may be used, for instance, to skip entries that were selected by a
previous search.
virtual int WalkFinish() = 0
- Terminate Walk, free allocated resources. Returns OK on
success, NOTOK otherwise.
virtual int Seek(const WordKey& patch) =
0
- Move before the inverted index position specified in
patch. May only be called after a successfull call
to the
WalkNext or WalkNextStep method.
Copy defined fields from patch into a copy of the
found data member and initialize internal state so
that WalkNext jumps to this key next time it's called
(cursor_get_flag set to DB_SET_RANGE). Returns OK if successfull,
NOTOK otherwise.
virtual inline int IsAtEnd() const
- Returns true if cursor is positioned after the last possible
match, false otherwise.
virtual inline int IsNoMatch() const
- Returns true if cursor hit a value that does not match search
criterion.
inline WordKey& GetSearch()
- Returns the search criterion.
inline int GetAction() const
- Returns the type of action when a matching entry is
found.
inline List *GetResults()
- Returns the list of WordReference found. The application is
responsible for deallocation of the list. If the
action input flag bit HTDIG_WORDLIST_COLLECTOR is
not set, return a NULL pointer.
inline List *GetTraces()
- For debugging purposes. Returns the list of WordReference hit
during the search process. Some of them match the searched key,
some don't. The application is responsible for deallocation of the
list.
inline void SetTraces(List* traceRes_arg)
- For debugging purposes. Set the list of WordReference hit
during the search process.
inline const WordReference& GetFound()
- Returns the last entry hit by the search. Only contains a valid
value if the last
WalkNext or
WalkNextStep call was successfull (i.e. returned
OK).
inline int GetStatus() const
- Returns the status of the cursor which may be OK or
WORD_WALK_ATEND.
virtual int Get(String& bufferout) const =
0
- Convert the whole structure to an ASCII string description.
Returns OK if successfull, NOTOK otherwise.
inline String Get() const
- Convert the whole structure to an ASCII string description and
return it.
virtual int Initialize(WordList *nwords, const WordKey
&nsearchKey, wordlist_walk_callback_t ncallback, Object *
ncallback_data, int naction) = 0
- Protected method. Derived classes should use this function to
initialize the object if they do not call a WordCursor constructor
in their own constructutor. Initialization may occur after the
object is created and must occur before a Walk*
method is called. See the DESCRIPTION section for the semantics of
the arguments. Return OK on success, NOTOK on error.
WordKey searchKey
- Input data. The key to be searched, see DESCRIPTION for more
information.
WordReference found
- Output data. Last match found. Use GetFound() to retrieve
it.
int status
- Output data. WORD_WALK_ATEND if cursor is past last match, OK
otherwise. Use GetStatus() to retrieve it.
WordList *words
- The inverted index used by this cursor.
Node:WordCursorOne,
Next:WordMonitor, Previous:WordCursor, Up:Reference
WordCursorOne
Node:WordCursorOne NAME, Next:WordCursorOne
SYNOPSIS, Previous:WordCursorOne, Up:WordCursorOne
WordCursorOne NAME
search and retrieve entries in a WordListOne object.
Node:WordCursorOne SYNOPSIS, Next:WordCursorOne
DESCRIPTION, Previous:WordCursorOne NAME, Up:WordCursorOne
WordCursorOne SYNOPSIS
#include <WordList.h>
int callback(WordList *, WordDBCursor& , const WordReference *, Object &)
{
...
}
Object* data = ...
WordList *words = ...;
WordCursor *search = words->Cursor(callback, data);
WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"));
WordCursor *search = words->Cursor(WordKey("word <UNDEF> <UNDEF>"), callback, data);
WordCursor *search = words->Cursor(WordKey());
...
if(search->Walk() == NOTOK) bark;
List* results = search->GetResults();
search->WalkInit();
if(search->WalkNext() == OK)
dosomething(search->GetFound());
search->WalkFinish();
Node:WordCursorOne DESCRIPTION,
Next:WordCursorOne
METHODS, Previous:WordCursorOne SYNOPSIS, Up:WordCursorOne
WordCursorOne DESCRIPTION
WordCursorOne is a WordCursor derived class that implements
search in a WordListOne object. It currently is the only derived
class of the WordCursor object. Most of its behaviour is described
in the WordCursor manual page, only the behaviour specific to
WordCursorOne is documented here.
Node:WordCursorOne METHODS,
Previous:WordCursorOne DESCRIPTION,
Up:WordCursorOne
WordCursorOne METHODS
WordCursorOne(WordList *words)
- Private constructor. Creator of the object must then call
Initialize() prior to using any other methods.
WordCursorOne(WordList *words, wordlist_walk_callback_t
callback, Object * callback_data)
- Private constructor. See WordList::Cursor method with same
prototype for description.
WordCursorOne(WordList *words, const WordKey
&searchKey, int action = HTDIG_WORDLIST_WALKER)
- Private constructor. See WordList::Cursor method with same
prototype for description.
WordCursorOne(WordList *words, const WordKey
&searchKey, wordlist_walk_callback_t callback, Object *
callback_data)
- Private constructor. See WordList::Cursor method with same
prototype for description.
Node:WordMonitor,
Next:Configuration,
Previous:WordCursorOne,
Up:Reference
WordMonitor
Node:WordMonitor NAME, Next:WordMonitor SYNOPSIS,
Previous:WordMonitor,
Up:WordMonitor
WordMonitor NAME
monitoring classes activity.
Node:WordMonitor SYNOPSIS, Next:WordMonitor
DESCRIPTION, Previous:WordMonitor NAME, Up:WordMonitor
WordMonitor SYNOPSIS
Only called thru WordContext::Initialize()
Node:WordMonitor DESCRIPTION,
Next:WordMonitor
CONFIGURATION, Previous:WordMonitor SYNOPSIS, Up:WordMonitor
WordMonitor DESCRIPTION
The test directory contains a benchmark-report
script used to generate and archive graphs from the output of
WordMonitor .
Node:WordMonitor CONFIGURATION,
Previous:WordMonitor DESCRIPTION,
Up:WordMonitor
WordMonitor CONFIGURATION
For more information on the configuration attributes and a
complete list of attributes, see the mifluz(3) manual page.
wordlist_monitor_period <sec> (default
0)
- If the value sec is a positive integer, set a
timer to print reports every sec seconds. The
timer is set using the ALRM signal and will fail if the calling
application already has a handler on that signal.
wordlist_monitor_output <file>[,{rrd,readable]
(default stderr)
- Print reports on file instead of the default
stderr . If type is set to
rrd the output is fit for the
benchmark-report script. Otherwise it a (hardly :-)
readable string.
Node:Configuration,
Next:mifluz, Previous:WordMonitor, Up:Reference
Configuration
Node:Configuration NAME, Next:Configuration
SYNOPSIS, Previous:Configuration, Up:Configuration
Configuration NAME
reads the configuration file and manages it in memory.
Node:Configuration SYNOPSIS, Next:Configuration
DESCRIPTION, Previous:Configuration NAME, Up:Configuration
Configuration SYNOPSIS
#include <Configuration.h>
Configuration config;
ConfigDefault config_defaults = {
{ "verbose", "true" },
{ 0, 0 }
};
config.Defaults(config_defaults);
config.Read("/spare2/myconfig") ;
config.Add("sync", "false");
if(config["sync"]) ...
if(config.Value("rate") < 50) ...
if(config.Boolean("sync")) ...
Node:Configuration DESCRIPTION,
Next:Configuration FILE
FORMAT, Previous:Configuration SYNOPSIS, Up:Configuration
Configuration DESCRIPTION
The primary purpose of the Configuration class
is to parse a configuration file and allow the application to
modify the internal data structure produced. All values are strings
and are converted by the appropriate accessors. For instance the
Boolean method will return numerical true (not
zero) if the string either contains a number that is different from
zero or the string true .
The ConfigDefaults type is a structure of two char
pointers: the name of the configuration attribute and it's value.
The end of the array is the first entry that contains a null
pointer instead of the attribute name. Numerical values must be in
strings. For instance:
ConfigDefault* config_defaults = {
{ "wordlist_compress", "true" },
{ "wordlist_page_size", "8192" },
{ 0, 0 }
};
The additional fields of the ConfigDefault are
purely informative.
Node:Configuration FILE FORMAT,
Next:Configuration
METHODS, Previous:Configuration DESCRIPTION,
Up:Configuration
Configuration FILE FORMAT
The configuration file is a plain ASCII text file. Each line in
the file is either a comment or an attribute. Comment lines are
blank lines or lines that start with a '#'. Attributes consist of a
variable name and an associated value:
<name>:<whitespace><value><newline>
The <name> contains any alphanumeric character or
underline (_) The <value> can include any character except
newline. It also cannot start with spaces or tabs since those are
considered part of the whitespace after the colon. It is important
to keep in mind that any trailing spaces or tabs will be
included.
It is possible to split the <value> across several lines
of the configuration file by ending each line with a backslash (\).
The effect on the value is that a space is added where the line
split occurs.
A configuration file can include another file, by using the
special <name>, include . The <value> is
taken as the file name of another configuration file to be read in
at this point. If the given file name is not fully qualified, it is
taken relative to the directory in which the current configuration
file is found. Variable expansion is permitted in the file name.
Multiple include statements, and nested includes are also
permitted.
include: common.conf
Node:Configuration METHODS,
Previous:Configuration FILE
FORMAT, Up:Configuration
Configuration METHODS
Configuration()
- Constructor
~Configuration()
- Destructor
void Add(const String& str)
- Add configuration item str to the
configuration. The value associated with it is undefined.
void Add(const String& name, const String&
value)
- Add configuration item name to the
configuration and associate it with value .
int Remove(const String& name)
- Remove the name from the configuration.
void NameValueSeparators(const String& s)
- Let the Configuration know how to parse name value pairs. Each
character of string s is a valid separator between
the
name and the value.
virtual int Read(const String& filename)
- Read name/value configuration pairs from the file
filename .
const String Find(const String& name)
const
- Return the value of configuration attribute
name as a
String .
const String operator[](const String& name)
const
- Alias to the Find method.
int Value(const String& name, int default_value = 0)
const
- Return the value associated with the configuration attribute
name , converted to integer using the atoi(3)
function. If the attribute is not found in the configuration and a
default_value is provided, return it.
double Double(const String& name, double
default_value = 0) const
- Return the value associated with the configuration attribute
name , converted to double using the atof(3)
function. If the attribute is not found in the configuration and a
default_value is provided, return it.
int Boolean(const String& name, int default_value =
0) const
- Return 1 if the value associated to name is
either 1, yes or true . Return 0
if the value associated to name is either
0, no or false .
void Defaults(const ConfigDefaults *array)
- Load configuration attributes from the
name and
value members of the array
argument.
Node:mifluz, Previous:Configuration, Up:Reference
mifluz
Node:mifluz NAME,
Next:mifluz SYNOPSIS,
Previous:mifluz, Up:mifluz
mifluz NAME
C++ library to use and manage inverted indexes
Node:mifluz
SYNOPSIS, Next:mifluz DESCRIPTION, Previous:mifluz NAME, Up:mifluz
mifluz SYNOPSIS
#include <mifluz.h>
main()
{
Configuration* config = WordContext::Initialize();
WordList* words = new WordList(*config);
...
delete words;
WordContext::Finish();
}
Node:mifluz DESCRIPTION, Next:mifluz CLASSES
AND COMMANDS, Previous:mifluz SYNOPSIS, Up:mifluz
mifluz DESCRIPTION
The purpose of mifluz is to provide a C++ library
to build and query a full text inverted index. It is dynamically
updatable, scalable (up to 1Tb indexes), uses a controlled amount
of memory, shares index files and memory cache among processes or
threads and compresses index files to 50% of the raw data. The
structure of the index is configurable at runtime and allows
inclusion of relevance ranking information. The query functions do
not require loading all the occurrences of a searched term. They
consume very few resources and many searches can be run in
parallel.
The file management library used in mifluz is a modified
Berkeley DB (www.sleepycat.com) version 3.1.14.
Node:mifluz CLASSES AND
COMMANDS, Next:mifluz CONFIGURATION,
Previous:mifluz
DESCRIPTION, Up:mifluz
mifluz CLASSES AND COMMANDS
Configuration
-
reads the configuration file and manages it in memory.
WordContext
-
read configuration and setup mifluz context.
WordCursor
-
abstract class to search and retrieve entries in a WordList
object.
WordCursorOne
-
search and retrieve entries in a WordListOne object.
WordDBInfo
- inverted index usage environment.
WordDict
-
manage and use an inverted index dictionary.
WordKey
- inverted index key.
WordKeyInfo
- information on the key structure of the inverted index.
WordList
-
abstract class to manage and use an inverted index file.
WordListOne
-
manage and use an inverted index file.
WordMonitor
- monitoring classes activity.
WordRecord
- inverted index record.
WordRecordInfo
- information on the record structure of the inverted
index.
WordReference
- inverted index occurrence.
WordType
- defines a word in term of allowed characters, length etc.
htdb_dump
-
dump the content of an inverted index in Berkeley DB
fashion
htdb_load
-
displays statistics for Berkeley DB environments.
htdb_stat
-
displays statistics for Berkeley DB environments.
mifluzdict
-
dump the dictionnary of an inverted index.
mifluzdump
-
dump the content of an inverted index.
mifluzload
-
load the content of an inverted index.
mifluzsearch
- search the content of an inverted index.
Node:mifluz CONFIGURATION, Next:mifluz ENVIRONMENT,
Previous:mifluz CLASSES AND
COMMANDS, Up:mifluz
mifluz CONFIGURATION
The format of the configuration file read by
WordContext::Initialize is:
keyword: value
Comments may be added on lines starting with a #. The default
configuration file is read from from the file pointed by the
MIFLUZ_CONFIG environment variable or
~/.mifluz or /etc/mifluz.conf in
this order. If no configuration file is available, builtin defaults
are used. Here is an example configuration file:
wordlist_extend: true
wordlist_cache_size: 10485760
wordlist_page_size: 32768
wordlist_compress: 1
wordlist_wordrecord_description: NONE
wordlist_wordkey_description: Word/DocID 32/Flags 8/Location 16
wordlist_monitor: true
wordlist_monitor_period: 30
wordlist_monitor_output: monitor.out,rrd
wordlist_allow_numbers {true|false} <number>
(default false)
- A digit is considered a valid character within a word if this
configuration parameter is set to
true otherwise it is
an error to insert a word containing digits. See the
Normalize method for more information.
wordlist_cache_inserts {true|false} (default
false)
- If true all Insert calls are cached in memory.
When the WordList object is closed or a different access method is
called the cached entries are flushed in the inverted index.
wordlist_cache_max <bytes> (default 0)
- Maximum size of the cumulated cache files generated when doing
bulk insertion with the BatchStart() function.
When this limit is reached, the cache files are all merged into the
inverted index. The value 0 means infinite size allowed. See
WordList(3) for the rationale behind cache file handling.
wordlist_cache_size <bytes> (default
500K)
- Berkeley DB cache size (see Berkeley DB documentation) Cache
makes a huge difference in performance. It must be at least 2% of
the expected total data size. Note that if compression is activated
the data size is eight times larger than the actual file size. In
this case the cache must be scaled to 2% of the data size, not 2%
of the file size. See Cache tuning in the mifluz
guide for more hints. See WordList(3) for the rationale behind
cache file handling.
wordlist_compress {true|false} (default
false)
- Activate compression of the index. The resulting index is eight
times smaller than the uncompressed index.
wordlist_env_dir <directory> (default
.)
- Only valid if
wordlist_env_share set to
true. Specify the directory in which the sharable
environment will be created. All inverted indexes specified with a
non-absolute pathname will be created relative to this
directory.
wordlist_env_share {true,false} (default
false)
- If true a sharable environment is open or created if none
exist.
wordlist_env_skip {true,false} (default
false)
- If true no environment is created at all. This must never be
used if a
WordList object is created. It may be useful
if only WordKey objects are used, for instance.
wordlist_extend {true|false} (default false)
- If true maintain reference count of unique
words. The Noccurrence method gives access to this
count.
wordlist_locale <locale> (default C)
- Set the locale of the program to locale . See
setlocale(3) for more information.
wordlist_lowercase {true|false} <number> (default
true)
- If a word contains upper case letters it is converted to
lowercase if this configuration parameter is true, otherwise it is
left untouched.
wordlist_maximum_word_length <number> (default
25)
- The maximum length of a word. See the
Normalize method for more information.
wordlist_mimimun_word_length <number> (default
3)
- The minimum length of a word. See the
Normalize method for more information.
wordlist_monitor {true|false} (default false)
- If true create a
WordMonitor instance to gather
statistics and build reports.
wordlist_monitor_output <file>[,{rrd,readable]
(default stderr)
- Print reports on file instead of the default
stderr . If type is set to
rrd the output is fit for the
benchmark-report script. Otherwise it a (hardly :-)
readable string.
wordlist_monitor_period <sec> (default
0)
- If the value sec is a positive integer, set a
timer to print reports every sec seconds. The
timer is set using the ALRM signal and will fail if the calling
application already has a handler on that signal.
wordlist_page_size <bytes> (default
8192)
- Berkeley DB page size (see Berkeley DB documentation)
wordlist_truncate {true|false} <number> (default
true)
- If a word is too long according to the
wordlist_maximum_word_length it is truncated if this
configuration parameter is true otherwise it is
considered an invalid word.
wordlist_valid_punctuation [characters] (default
none)
- A list of punctuation characters that may appear in a word.
These characters will be removed from the word before insertion in
the index.
wordlist_verbose <number> (default 0)
- Set the verbosity level of the WordList class.
1 walk logic
2 walk logic details
3 walk logic lots of details
wordlist_wordkey_description <desc> (no
default)
- Describe the structure of the inverted index key. In the
following explanation of the
<desc> format,
mandatory words are in bold and values that must be replaced in
italic.
Word bits/name bits [/...]
The name is an alphanumerical symbolic name for the
key field. The bits is the number of bits required to
store this field. Note that all values are stored in unsigned
integers (unsigned int). Example:
Word 8/Document 16/Location 8
wordlist_wordkey_document [field ...] (default
none)
- A white space separated list of field numbers that define a
document. The field number list must not contain gaps. For instance
1 2 3 is valid but 1 3 4 is not valid. This configuration parameter
is not used by the mifluz library but may be used by a query
application to define the semantic of a document. In response to a
query, the application will return a list of results in which only
distinct documents will be shown.
wordlist_wordkey_location field (default
none)
- A single field number that contains the position of a word in a
given document. This configuration parameter is not used by the
mifluz library but may be used by a query application.
wordlist_wordrecord_description {NONE|DATA|STR} (no
default)
- NONE: the record is empty
DATA: the record contains an integer (unsigned int)
STR: the record contains a string (String)
Node:mifluz ENVIRONMENT, Previous:mifluz
CONFIGURATION, Up:mifluz
mifluz ENVIRONMENT
MIFLUZ_CONFIG file name of configuration file
read by WordContext(3). Defaults to ~/.mifluz. or
/usr/etc/mifluz.conf
Node:Concept
Index, Previous:Reference, Up:Top
Index of Concepts
Table of Contents
|
|
|