Hash collisions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Hash collisions

Rex Kerr-2
The 2.8.1 algorithm for calculating tuple hashes is, in pseudocode,
  (...(nameHash + _1.hashCode) * smallPrime^(n-1) + _2.hashCode * smallPrime^(n-2) + ...

Fortunately, the small prime is a _different_ small prime than the one used for strings, so there's no problem with collisions there.  But tuples and tuples use the same prime, so unfortunately

((0,1), (0,0)).hashCode == ((0,0), (1,0)).hashCode

((0,0,1), (0,0,0), (0,0,0)).hashCode ===
((0,0,0), (0,1,0), (0,0,0)).hashCode ===
((0,0,0), (0,0,0), (1,0,0)).hashCode

That is, there are positions in tuples of tuples that are entirely equivalent.

The same thing goes for lists of lists.  (And tuples of lists and lists of tuples, etc.)

Arrays have the opposite problem due to their Java heritage: they have different hash codes even when they're the same.

These collisions could be avoided by using a bit-mixing scheme (e.g. FNV) so that the second-item-in-first-tuple doesn't affect the bits the same way as the first-item-in-second-tuple.

I'm happy to submit a patch against the library source if there's interest in fixing this, but perhaps it is more important to have stable non-ideal hash values than to change them to fix them?

  --Rex

Reply | Threaded
Open this post in threaded view
|

Re: Hash collisions

Paul Phillips-3
On Sat, Jan 01, 2011 at 12:33:44AM -0500, Rex Kerr wrote:
> I'm happy to submit a patch against the library source if there's
> interest in fixing this, but perhaps it is more important to have
> stable non-ideal hash values than to change them to fix them?

First you have to read this:

  https://lampsvn.epfl.ch/trac/scala/ticket/2537

The issue is not stability of hash values across versions: if anyone is
relying on that they're going to be disappointed.  The issue so far is
that everything I've tried ends up a net negative by consuming more time
to hash than is recovered via better distribution.  I think the
submitter of 2537 has the right idea toward the end there, but I have no
chance of getting back to it anytime soon.

If you want to submit a patch, please do: it should be accompanied by
benchmarks showing that at the very least scalac doesn't get any slower
compiling itself.

--
Paul Phillips      | Those who can make you believe absurdities
Protagonist        | can make you commit atrocities.
Empiricist         |     -- Voltaire
pp: i haul pills   |----------* http://www.improving.org/paulp/ *----------