Re: find words that contains some specific letters
On Sat, 6 Jun 2009, Lew wrote:
Giovanni Azua wrote:
I would expect in a real life dataset e.g. Oxford dictionary 350k words
and phrases
350K words and phrases (I didn't know you meant phrases, too) seems low.
Let's say there are about 512 KiPhrases of interest, of an average length for
English of sixteen characters.
to have way more collisions than this, not only from the lack of
A regular HashMap for those data at the default .75f load factor will most
likely be about 475 KiBuckets long, have about a quarter of those empty,
nearly all of the rest will contain one entry, and a few will have two,
three, or much less likely, more.
I thought i'd put my compiler where Lew's mouth is. I used this code (with
the relevant bits commented out or edited for each test):
http://urchin.earth.li/~twic/tmp/Collisions.java
And the /usr/share/dict/words file on my machine, which contains 234937
words and 2486825 characters including carriage returns, which makes for
an average of 9.6 characters per word.
I used nominal load factors of 0.75 and 1.00. My code follows HashMap in
rounding up to the nearest power of two; this gave true load factors of
0.45 and 0.90 respectively, with tables containing 524288 and 262144
buckets.
My code also follows HashMap in that it stirs the hash up before using it.
For completeness, i give results with and without stirring, and also using
the Ada hash function John posted.
The following tables give a number of collided hashes, followed by the
count of buckets with that number of hashes in them, followed by that
count as a percentage of the total number of buckets.
Standard String hash, no stirring, load factor 0.45:
0: 334747 (63.847923278808594)
1: 150391 (28.68480682373047)
2: 33573 (6.403541564941406)
3: 4958 (0.9456634521484375)
4: 571 (0.10890960693359375)
5: 46 (0.0087738037109375)
6: 2 (3.814697265625E-4)
Standard String hash, no stirring, load factor 0.90:
0: 106807 (40.74363708496094)
1: 96064 (36.6455078125)
2: 43199 (16.479110717773438)
3: 12535 (4.7817230224609375)
4: 2940 (1.12152099609375)
5: 494 (0.188446044921875)
6: 96 (0.03662109375)
7: 8 (0.0030517578125)
8: 1 (3.814697265625E-4)
Standard string hash, stirred, load factor 0.45:
0: 334883 (63.873863220214844)
1: 150294 (28.666305541992188)
2: 33387 (6.368064880371094)
3: 5077 (0.9683609008789062)
4: 600 (0.11444091796875)
5: 44 (0.008392333984375)
6: 3 (5.7220458984375E-4)
Standard string hash, stirred, load factor 0.90:
0: 106949 (40.79780578613281)
1: 95944 (36.5997314453125)
2: 42971 (16.392135620117188)
3: 12775 (4.8732757568359375)
4: 2903 (1.1074066162109375)
5: 510 (0.194549560546875)
6: 82 (0.031280517578125)
7: 8 (0.0030517578125)
8: 2 (7.62939453125E-4)
Ada string hash, no stirring, load factor 0.45:
0: 391319 (74.63817596435547)
1: 88824 (16.94183349609375)
2: 24581 (4.688453674316406)
3: 8784 (1.6754150390625)
4: 4115 (0.7848739624023438)
5: 2183 (0.41637420654296875)
6: 1296 (0.2471923828125)
7: 822 (0.1567840576171875)
8: 562 (0.1071929931640625)
9: 374 (0.0713348388671875)
10: 278 (0.0530242919921875)
11: 202 (0.0385284423828125)
12: 161 (0.03070831298828125)
13: 140 (0.026702880859375)
14: 113 (0.02155303955078125)
15: 81 (0.01544952392578125)
16: 57 (0.01087188720703125)
17: 62 (0.0118255615234375)
18: 51 (0.00972747802734375)
19: 42 (0.0080108642578125)
20: 24 (0.00457763671875)
21: 32 (0.006103515625)
22: 27 (0.00514984130859375)
23: 30 (0.0057220458984375)
24: 11 (0.00209808349609375)
25: 9 (0.00171661376953125)
26: 12 (0.002288818359375)
27: 10 (0.0019073486328125)
28: 13 (0.00247955322265625)
29: 12 (0.002288818359375)
30: 5 (9.5367431640625E-4)
31: 5 (9.5367431640625E-4)
32: 6 (0.0011444091796875)
33: 5 (9.5367431640625E-4)
34: 5 (9.5367431640625E-4)
35: 3 (5.7220458984375E-4)
36: 1 (1.9073486328125E-4)
37: 5 (9.5367431640625E-4)
38: 2 (3.814697265625E-4)
39: 2 (3.814697265625E-4)
40: 1 (1.9073486328125E-4)
41: 3 (5.7220458984375E-4)
42: 2 (3.814697265625E-4)
43: 2 (3.814697265625E-4)
44: 1 (1.9073486328125E-4)
45: 2 (3.814697265625E-4)
49: 3 (5.7220458984375E-4)
51: 1 (1.9073486328125E-4)
52: 2 (3.814697265625E-4)
54: 1 (1.9073486328125E-4)
55: 2 (3.814697265625E-4)
62: 1 (1.9073486328125E-4)
73: 1 (1.9073486328125E-4)
Ada string hash, no stirring, load factor 0.90:
0: 156382 (59.654998779296875)
1: 59970 (22.876739501953125)
2: 22198 (8.467864990234375)
3: 9565 (3.6487579345703125)
4: 4742 (1.808929443359375)
5: 2682 (1.023101806640625)
6: 1775 (0.6771087646484375)
7: 1147 (0.4375457763671875)
8: 791 (0.3017425537109375)
9: 547 (0.2086639404296875)
10: 403 (0.1537322998046875)
11: 336 (0.128173828125)
12: 231 (0.0881195068359375)
13: 203 (0.0774383544921875)
14: 156 (0.05950927734375)
15: 132 (0.05035400390625)
16: 118 (0.045013427734375)
17: 90 (0.034332275390625)
18: 80 (0.030517578125)
19: 63 (0.0240325927734375)
20: 64 (0.0244140625)
21: 42 (0.016021728515625)
22: 48 (0.018310546875)
23: 36 (0.01373291015625)
24: 29 (0.0110626220703125)
25: 28 (0.01068115234375)
26: 24 (0.0091552734375)
27: 33 (0.0125885009765625)
28: 20 (0.00762939453125)
29: 16 (0.006103515625)
30: 10 (0.003814697265625)
31: 14 (0.005340576171875)
32: 12 (0.00457763671875)
33: 10 (0.003814697265625)
34: 16 (0.006103515625)
35: 12 (0.00457763671875)
36: 10 (0.003814697265625)
37: 10 (0.003814697265625)
38: 8 (0.0030517578125)
39: 7 (0.0026702880859375)
40: 6 (0.002288818359375)
41: 5 (0.0019073486328125)
42: 8 (0.0030517578125)
43: 3 (0.0011444091796875)
44: 8 (0.0030517578125)
45: 3 (0.0011444091796875)
46: 1 (3.814697265625E-4)
47: 7 (0.0026702880859375)
48: 4 (0.00152587890625)
49: 3 (0.0011444091796875)
50: 2 (7.62939453125E-4)
51: 2 (7.62939453125E-4)
52: 4 (0.00152587890625)
53: 2 (7.62939453125E-4)
54: 2 (7.62939453125E-4)
55: 1 (3.814697265625E-4)
56: 2 (7.62939453125E-4)
57: 1 (3.814697265625E-4)
58: 3 (0.0011444091796875)
61: 1 (3.814697265625E-4)
64: 1 (3.814697265625E-4)
65: 1 (3.814697265625E-4)
66: 1 (3.814697265625E-4)
67: 1 (3.814697265625E-4)
69: 1 (3.814697265625E-4)
70: 1 (3.814697265625E-4)
73: 1 (3.814697265625E-4)
74: 1 (3.814697265625E-4)
79: 1 (3.814697265625E-4)
80: 1 (3.814697265625E-4)
83: 1 (3.814697265625E-4)
87: 1 (3.814697265625E-4)
93: 1 (3.814697265625E-4)
110: 1 (3.814697265625E-4)
116: 1 (3.814697265625E-4)
122: 1 (3.814697265625E-4)
Ada string hash, stirred, load factor 0.45:
0: 336927 (64.26372528076172)
1: 147039 (28.04546356201172)
2: 33984 (6.48193359375)
3: 5527 (1.0541915893554688)
4: 719 (0.13713836669921875)
5: 83 (0.01583099365234375)
6: 7 (0.00133514404296875)
8: 2 (3.814697265625E-4)
Ada string hash, stirred, load factor 0.90:
0: 108112 (41.241455078125)
1: 94683 (36.11869812011719)
2: 42466 (16.199493408203125)
3: 13109 (5.0006866455078125)
4: 3023 (1.1531829833984375)
5: 625 (0.2384185791015625)
6: 110 (0.041961669921875)
7: 12 (0.00457763671875)
8: 2 (7.62939453125E-4)
9: 2 (7.62939453125E-4)
So, Lew's description was rather optimistic; there are more empty buckets
than he predicted, and also more collided buckets. Still, in what we can
consider the normal case - standard String hash, stirred, 0.45 load factor
- 64% of the entries were in a bucket of their own.
Also, the ada hash function is indeed complete crap. And HashMap's
hash-stirring function is remarkably effective at saving the day!
tom
--
Ten years on, and there is still nothing like this bizarre tale of
biomechanical space madness.