Re: HashMap vs linear table lookup
none wrote:
HashSet have to compute the hashCode of all Strings on the whole
content, but hashCode is compute only once per String.
String.equals first use length and then each caracter in order to return
result. You have to do it each time in linear comparison.
The HashSet has to compute the hash exactly once for each entry when it's
inserted, as you seem to say.
if n is the total number of strings
p is the avg length,
hashset [sic] is O(n*p) to O(pn), worse case if hash function is poor.
The String hashCode() computation should be reasonable for most purposes. But
even if the hash were worst-case (e.g., returned 17 for every String), there'd
still only be 'n' comparisons when doing a lookup.
And in big-O notation the 'p' factor goes away.
linear is O(pn) to O(n), best case if strings are very different
lengths or beginings.
How are you getting n squared? A linear lookup only needs to make n
comparisons to find a single input.
Hence you have to profile your dataset (avg length, number of unique,
similarity, etc.) to know the break point.
On lookup, i.e., to compare if a given input is already in the Set, you only
compute the hash for the input, not for the elements in the Set already. The
range for HashSet lookups should be O(1) to O(n) depending on the hash
quality. (O(n) occurring when all entries hash to the same value, which
devolves the HashSet into the linear lookup.)
--
Lew