The Quest for a Perfect Dictionary Format
The Perfect Hash
by John Kormylo
E.L.F. Software
I was never entirely happy with the minimum RAM dictionary
format. I would have preferred one made entirely with hash
tables, but the results were rather large compared to the source
text. However, I eventually hit on a viable solution.
The problem with a pure hash solution is that the hash table
takes up to 12 bytes/node (using longs), and almost every
character at the end of words requires a new node.
The key to the solution is a "suffix" hash. If you eliminate all
the characters from the start of every word which are used in
common with any other word (plus one), you are left with what can
be described as "suffix" strings. For example, "quackery" and
"quacking" gives you "ry" and "ng". If you reverse the strings
("yr" and "gn") and hash them, you have a "suffix" hash.
Applying this to the 1,628,241 byte source file used previously
produced 213,573 bytes of suffix text (not counting NULL
terminators), which was hashed using 17,078 nodes.
The second clever part was implementing the suffix hash using
only 2 bytes/node. One byte holds the character and the other
holds an unsigned relative parent index (0 means end of word).
This means that every child node must be within 255 array
locations of its parent. As it happens, this occurs most of the
time, but occasionally some nodes will have to appear more than
once to handle all of their children.
Nodes without parents can go anywhere, so you always start with
one of them. Then always use the child node (of a parent within
255 array locations) with the smallest number of descendants
(children, grandchildren, etc.). That way if you have to restart
you will have eliminated as many nodes as possible beforehand.
In the above example, the hash table was implemented using an
array of 17,121 nodes, restarting 34 times with a total of 43
duplicate nodes (0.25% waste). More to the point, it represented
213,573 bytes of text using only 34,242 bytes of hash, or 84%
compression.
Hashing the rest of the text was done using two different types
of hash tables, one using longs and the other using shorts. The
characters were mapped to values between 1 and N (no
optimization, just bypassing some punctuation).
Each node of the "long hash" is implemented as an array of longs.
The first entry holds the count and the remaining N entries hold
indexes for the next node for the corresponding mapped characters
(1...N). For a table consisting of M nodes, values above M
indicate an index (plus M) into the "short hash" array. The long
hash table is used until the count drops below 32K (and the
number of nodes needed is less than 64K).
As it turned out, only one long node was needed. Fortunately,
there is no real cost for using the more general formulation.
The "short hash" is a packed hash using unsigned shorts for the
parent and base, and a signed short for the count. These indexes
are relative to the starting entry (whose absolute index comes
from the "long hash"). Each hash table is limited to 64K nodes,
but the total number of nodes can (and will) be much larger. The
sign bit of the count is used to indicate that this is a valid
word termination (terminating NULLs are not hashed). A count of
1 (or 0x8002) indicates that the base value is an index into the
suffix hash.
The example dictionary used 52 short hash tables, for a total of
229,656 nodes (4416 nodes/table average). Every hash table
produced a perfect hash (no wasted nodes). The final dictionary
used a total of 1,412,480 bytes. While this is much larger than
the minimum RAM solution, it is still smaller than the original
source text. Nor does it put much of a dent into the RAM of a
modern computer.
There is no faster method for looking up words than a perfect
hash. (Computing indexes is a little slower.) The only
important size limitation on this format is that the suffix hash
must use less than 64K nodes.