Re: Hash Code Compression

From:

Eric Sosman <esosman@ieee-dot-org.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Sat, 12 Jan 2008 11:07:42 -0500

Message-ID:

<IOOdnSh1mpRxeBXanZ2dnUVZ_qOknZ2d@comcast.com>

j1mb0jay wrote:

Daniel Pitts wrote:

On Jan 11, 3:05 pm, j1mb0jay <n...@none.com> wrote:

Eric Sosman wrote:

j1mb0jay wrote:

I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. I was wondering if i should still be using
these even though I have over two million words in the hash table.
Although the hash table is currently Big 0(4).

This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
by definition. What do you really mean?

I am using the Multiply Add and Divide (MAD) method for the
compression of the hash code, does Java have any built in
functions(methods) that will do this for me, or does anyone know of a
more efficient way?

The value delivered by hashCode -- for any class, not
just for String -- is a Java int, 32 bits wide. How (and why)
are you "compressing" this value?

My hash table is made up of an array of n LinkedLists (where n is a
prime number that is roughly double the number of words in the
dictionary).

I firstly use the String.hashCode() method on a given word. I then
compress this number so that i can use it as a index into the array of
LinkedList; as this 32bit number is often far to large.

Aha! Your "compression" is the computation of a bucket
index from the hash code. Here's a hint about one reasonable
way to do it: Ask yourself why you chose a prime number as the
bucket count. Was it because prime numbers swirl around hash
tables like a mystical numerological fog, or was there a concrete
reason? If the latter, what was it?

Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)

     It depends what you're counting, and how you're counting it.

     If you're counting the number of String comparisons involved
in looking up one String value in a table that already holds N
values, the effort is O(N). You choose one of M lists, which holds
(on the average) N/M items, and you search the list. The search
effort is proportional to the list length, hence (since M is a
constant and since Big-Oh swallows proportionality constants) the
search effort is O(N). With a very small multiplier, to be sure,
but still O(N).

     But maybe M isn't constant after all; you said you chose it
to be roughly twice the size of the expected N. If you regard M
as a function of N rather than a constant, then the average list
length N/M is N/(2*N) is 0.5, itself a constant, and the search
effort is O(1). Different ground rules, different outcomes.

     Or maybe you're looking at the total effort to insert all N
items into an initially empty table. Then under the first model
the effort is proportional to 0/M + 1/M + ... + (N-1)/M, which
is N*(N-1)/(2*M) or O(N*N). Under the second model, the effort
is 0.5 + 0.5 + ... + 0.5 = N/2 which is O(N).

     Or maybe you're noting that each word is inserted only once
but looked up many times, in which case things get much more
complex and you need to start worrying about effects like "common"
words entering into the table pretty early in the game while the
table is fairly small, and "rare" words appearing later on when
the table has grown large. Also, searches for "common" words are
almost always successful, while those for "rare" words have more
chance of being unsuccessful. A thorny thicket, to be sure.

     Or maybe you're also trying to account for the time spent in
computing the hash codes, in which case the lengths of the Strings
enter into the picture, and not just their quantity.

     Big-Oh exists for the purpose of sweeping unimportant details
under the rug, deliberately discarding precision. But you can only
use Big-Oh well if you're precise about what you're approximating,
about what "N" means, about what things you model as constants and
what you consider variable, and so on. You can only throw away
precision if you're precise about throwing it away!

--
Eric Sosman
esosman@ieee-dot-org.invalid