Re: Benchmarks

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c,comp.lang.c++

Date:

Fri, 7 Nov 2008 01:56:19 -0800 (PST)

Message-ID:

<0b262458-4461-4d51-83a6-d15f7cdb0f28@g17g2000prg.googlegroups.com>

On Nov 7, 8:00 am, Jerry Coffin <jcof...@taeus.com> wrote:

In article <49135fbe.322673...@news.sbtc.net>, c...@tiac.net says...

[ ... ]

Just as a note, the common claim that hash table are O(1)
per access whereas search trees are O(log n) per access is
misleading, and is arrived at by an apples and oranges
comparison. The number of probes in a hash table is O(1)
whereas the number of probes in a search tree is O(log n).
However the cost of computing the hash code for independent
keys is O(m) where m is the average key length, which is
necessarily greater than log2(n). In comparison based trees
the cost of the comparison is O(m), i.e., the probe cost is
O(m). In radix based trees the probe cost is O(1). If O(m)
= O(n) the execution costs per probe (ignoring access
issues) are:

Keeping in mind, however, that m and n will usually be quite a
bit difference, so any similarity between O(m) and O(n) is
purely coincidental. In reality, the performance of hash
tables tends to be rather different in general than the
performance of tree-based systems (radix or comparison).

hash table - O(log n)
comparison tree - O((log n)^2)
radix trees - O(log n)

That is, hash tables and radix trees have the same order of
performance with comparison trees a distant third.

Funny that no one has mentionned it, but the performance of a
hash table is very, very dependent on the quality of the hashing
function. With a perfect hashing function, the access time is
O(1), but with a poor hashing function, it can be significantly
worse---even to the point of O(n). Unless you're sure that you
can define a good hashing function, you're probably better off
with a balanced tree. (The radix tree is an interesting
suggestion, often ignored. Given the sparseness of the entries,
it may result in a significant lack of locality and a too many
cache misses. But given typical word length, this isn't
certain, and it has other advantages, like not requiring the
word to be actually assembled in memory.)

Note, however, that there are factors that don't show up in a
big-O comparison that can still be quite substantial,
especially for collectiosn of reasonble sizes. For one obvious
one, consider the common case (like this one) where the keys
are strings. Hashing the string requires looking at the whole
string, but comparing two strings will often look at _far_
fewer -- for the first few probes, it'll typically only
involve looking at one or two characters.

From a theoretical viewpoint, this effect is negligible -- but
given the number of words in a typical vocabulary, it can be
quite substantial.

I've done some actual measures (some time back), see
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. In
this case, I only measured access, not insertion, but for the
list of words in the ispell French dictionary, I found that a
hash table with a good hashing function (FNV or my Mersenne
prime algorithm) was better than three times faster; using a
list of actual program symbols, the difference was still better
than 2.5 (the French word list contained 94883 different words,
the program symbols 15047).

I've since added some additional hashing algorithms, and tested
on differernt machines (which does make a difference), but I've
not gotten the results written up. I think I'll add an
implementation using a radix tree, just for good measure.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34