Re: hashCode() for Custom classes

From:

Lew <lew@lewscanon.com>

Newsgroups:

comp.lang.java.programmer

Date:

Sat, 19 Apr 2008 20:54:55 -0400

Message-ID:

<P6ydnXcmXZXCCZfVnZ2dnUVZ_sWdnZ2d@comcast.com>

Logan Shaw wrote:

Well yeah, it's very true that if you don't have any knowledge about
what parts may be expected to remain the same across a lot of objects
(or what parts might together be expected to be different), then you
shouldn't leave anything out. But if you do have that knowledge, it
could be worth it. Of course, you should be judicious about it.

One example is this: let's suppose I'm writing a TCP server and I
get a lot of connections from various clients. For a connection
object, I might have username, password, source IP address, and
source port number as fields. In this case, it might be totally
sufficient to consider only the IP address and port number when
implementing hashCode() for the connection object. They should be
pretty close to unique by themselves[1].

...

Another example: suppose I have a PostalAddress object whose
members are name, streetAddress, city, state, and zipCode. If
I have name, streetAddress, and zipCode (and if they're mandatory
so that I always have values for them), there is probably not any
benefit from including city and state in hashCode(), because what
are the chances that for any two objects, name, streetAddress, and
zipCode will be the same but city or state will be different?

Of course you can make the argument that I'm assuming my objects
model real-world things and relying on those real-world semantics;
I could have in some context, you might object, a whole bunch of
PostalAddress objects that are intended to model bogus addresses,
and then I might have bad performance. Maybe I'm making a class
called BadAddressCollector to contain and report on all the unique
bad addresses in a database. But I would argue that in such a
case, the number of collisions is still probably not that high,
and when it comes to performance, you should place emphasis on
the common case.

Hash codes are all about probability. Optimizing for the likely case enhances
the utility of the hash. Your suggestions involving using the most
significant portions of the primary key for a lookup, and accepting the
chances of collision on the least significant items as a matter of efficiency.
That's just good engineering. One thing to watch out for is that the chosen
fields contain enough selectivity to maintain good distribution over various
capacities of hash containers. For U.S. street addresses, a hash on just
state probably isn't good enough, but one on the ZIP + 4 (itself already a
hash code of sorts) might suffice.

On the other hand, there's nothing wrong with being conservative
and avoiding assumptions even if they are only very slightly
questionable. But the subject is how best to implement hashCode(),
and it seems like this type of implementation could be appropriate,
sometimes.

Aye.

--
Lew