Re: find words that contains some specific letters

From:

Lew <noone@lewscanon.com>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 07 Jun 2009 00:09:56 -0400

Message-ID:

<h0feil$qt1$1@news.albasani.net>

Tom Anderson wrote:
> actual hard data!

I admire your patience to give us hard data.

...
Standard string hash, stirred, load factor 0.45:

0: 334883 (63.873863220214844)
1: 150294 (28.666305541992188)
2: 33387 (6.368064880371094)
3: 5077 (0.9683609008789062)
4: 600 (0.11444091796875)
5: 44 (0.008392333984375)
6: 3 (5.7220458984375E-4)

Standard string hash, stirred, load factor 0.90:

0: 106949 (40.79780578613281)
1: 95944 (36.5997314453125)
2: 42971 (16.392135620117188)
3: 12775 (4.8732757568359375)
4: 2903 (1.1074066162109375)
5: 510 (0.194549560546875)
6: 82 (0.031280517578125)
7: 8 (0.0030517578125)
8: 2 (7.62939453125E-4)
...

We might care about maximum bucket list length and the skew toward shorter or
longer bucket list lengths.

For some uses, say, solving a jumbled-word problem, it may be that (nearly)
all map get()s will succeed, with (nearly) never a miss. The bucket list at
the hash index will (nearly) never be empty. So performance will (mostly) not
be affected by the empty-list test. The ratio of short-list buckets to those
with any entries at all will affect performance.

From Tom Anderson's results, the map at 0.45f load used almost 64% of
capacity for (nearly) never-reached empty bucket lists. Of the rest,
one-entry buckets made up over 79%, and buckets with two or fewer entries
about 97%.

The map at .90f load used almost 41% of capacity for (nearly) never-reached
empty bucket lists. Of the rest, one-entry buckets made up less than 62%, and
two-or-unders about 89.5%.

From upthread:

private static final Map <String, Set <String>> jungles;

For the jumbled-word application all key Strings at a particular bucket will
be the same length. So will all the Strings in the value Set once the right
entry is found. The length check in String#equals() will not help performance.

To optimize searches, intern the Strings. This benefits both the Map and the
Set, one-entry buckets the most and halving value comparisons in two-entry
buckets. It also reduces the impact of load factor on expected search times.

The read-only Sets, once retrieved, are likely meant for iteration, perhaps a
lot of it. If so, LinkedHashSet could be a good implementation:

<http://java.sun.com/javase/6/docs/api/java/util/LinkedHashSet.html>

Iteration over a LinkedHashSet requires time proportional to the size
of the set, regardless of its capacity. Iteration over a HashSet is
likely to be more expensive, requiring time proportional to its capacity.

<http://java.sun.com/javase/6/docs/api/java/util/HashMap.html>

As a general rule, the default load factor (.75)
offers a good tradeoff between time and space costs.

I've lost any will to deviate from the default .75f load factor.

--
Lew