Previous: Sample Database, Up: Database Formats

5.2.3 Old Database Format

The old database format is used by Unix locate and find programs and earlier releases of the GNU ones. updatedb produces this format if given the --old-format option.

updatedb runs programs called bigram and code to produce old-format databases. The old format differs from the new one in the following ways. Instead of each entry starting with an offset-differential count byte and ending with a null, byte values from 0 through 28 indicate offset-differential counts from -14 through 14. The byte value indicating that a long offset-differential count follows is 0x1e (30), not 0x80. The long counts are stored in host byte order, which is not necessarily network byte order, and host integer word size, which is usually 4 bytes. They also represent a count 14 less than their value. The database lines have no termination byte; the start of the next line is indicated by its first byte having a value <= 30.

In addition, instead of starting with a dummy entry, the old database format starts with a 256 byte table containing the 128 most common bigrams in the file list. A bigram is a pair of adjacent bytes. Bytes in the database that have the high bit set are indexes (with the high bit cleared) into the bigram table. The bigram and offset-differential count coding makes these databases 20-25% smaller than the new format, but makes them not 8-bit clean. Any byte in a file name that is in the ranges used for the special codes is replaced in the database by a question mark, which not coincidentally is the shell wildcard to match a single character.

The old format therefore can not faithfully store entries with non-ASCII characters. It therefore should not be used in internationalized environments.

The output of locate --statistics will give an incorrect count of the number of filenames containing newlines or high-bit characters for old-format databases.