Re: Hash Code
j1mb0jay wrote:
On Tue, 20 May 2008 18:14:11 +0000, Kenneth P. Turvey wrote:
On Tue, 20 May 2008 18:48:15 +0100, j1mb0jay wrote:
Would this mean there would be no collisions thus keeping the time
complexity of the search method down to O(1). In turn meaning the hash
table could be an array of Key Value pairs rather than a array of
linked lists.
No. There could still be collisions. In addition, the time it would
take to calculate a hash would be significant, meaning the performance
of the algorithm would degrade.
Just to be clear, if you are trying to map something with
2,000,000,000,000 possible values to a hash table with 200 entries,
there is simply no way to avoid collisions. Sometimes the data will
simply collide.
I understand that this would be the case but I have to say that i was
told as part of the module that re-hashing the whole table into a new
table with a higher table size is the best option. [...]
It depends on how you define "best." Think of it as an
ordinary purchase-and-sale transaction: Expanding the table
is an investment of CPU time today in hopes of recouping it
and more tomorrow. For the same table with the same sequence
of insertions, deletions, and searches, the investment could
be "good" or "bad" depending on other circumstances.
Even the above is an oversimplification. For example, it
might be important to guarantee that no search takes more
than 7 milliquavers, even at the cost of blowing 1000000 mq
on every seventy-third insertion.
Finally, rehash-or-not needn't be a binary decision. Look
up "incremental hashing" techniques that expand and contract
a hash table gradually instead of in one great cathartic spasm.
Yes i understand that the hash it self would also be intensive if using
SHA512 but this number is 16 times bigger than the output of a
String.hashCode() function.
Fine, but what do you do with that great big number? You
can't use it as an array index until you've derived a much
smaller number from it -- you can't even use String's hashCode()
until you've reduced its range. And once you've reduced the
range to 0-198 there are only 199 places the items can go. If
you've got 15 items to scatter among 199 places, there's about
a 53% chance you'll have at least one collision. With 200 or
more items, obviously, the probability grows to 100%.
Let's see: If you start with a bunch of empty slots and
start filling them as long as there are no collisions, and then
when a collision occurs you ~double the table size and re-hash,
and keep doing so until there are no more collisions ... Well,
that's more math than I want to attempt right at the moment,
but it's clear that you'll use an ENORMOUS amount of space.
Note that the ~doubling might not eliminate the collision (and
might even produce collisions among the existing items that
didn't collide in the smaller table), so ~quadrupling and so
on might be necessary ...
Here's a suggestion: Try implementing your scheme as you've
outlined it, perhaps cutting a few corners here and there for
simplicity's sake. Load your implementation with detailed
logging of what's going on: Each insertion, each deletion,
each lookup, each table expansion or contraction, and so on.
Then run a few test cases and study the logs.
--
Eric.Sosman@sun.com