Hash Functions: the modulo prime myth 2

Category: Algorithm   Tags: , , ,

I felt the need to write about this topic again. In my previous article, I received no less than 12 comments, all in a misguided effort to try to discredit the research showing that doing hash % prime does not result in a better distribution. (I also thank them in that they help make the analysis more rigorous)

JM: Assume by some stroke of bad luck that your primary hash function does not yield evenly distributed hashes, then computing the index by taking hash % bucket-count with bucket-count prime eliminates the most common cases of collisions.

The problem with this often cited argument is that somehow, using a prime will magically solve the most common cases. What exactly are these common cases? When did it become good science to automatically assume that it is common to output numbers that share a factor with the modulus?

Jason: Using a uniform distribution is the problem. A uniformly distributed set is going to hash nicely regardless of how many buckets you’re hashing into, assuming a decent hash function. You use a prime number of buckets to deal with the general case of an unknown distribution.

This is another often cited argument. The problem with this one is that hash functions are built specifically to have a uniformly distributed output. The test was not about the hash function itself. It was about the modulus at the very end.

  1. I = low entropy data (low randomness)
  2. H = Perfect Hash Function
  3. H(I) = high entropy data (high randomness) equal to random numbers
  4. bucket = (H(I) or random number) % m  this is where the experiment is!

Appeasing the critics

Many people wanted me to do this:

  1. I = low entropy data
  2. H = Bad Hash Function
  3. H(I) = unknown entropy
  4. bucket = H(I) % m

Their reasoning was that most people make bad hash functions that some how end up as multiples of the modulus (not repeating, because any modulus would not matter).

Attacking the problem

Now despite the fact that Paul Hsieh had already shown (allegedly) that a prime modulus could yield bad collisions, how do we find out for ourselves?

There are some problems with this.

  • How do you specifically craft low entropy data without bias that would test prime modulus’ in general?

I believe this is impossible, so I decided to go with the English dictionary, since its a common task to hash words. (I used the wamerican dictionary that comes with Ubuntu, link below).

  • How do you specifically make a mediocre hash function without bias that would test hash functions in general?

I believe this is impossible too, so I decided to go with Knuth’s additive hash as described in TAOCP Chapter 3. It is probably widely used by beginners for its speed and simplicity.

ub4 additive( char *key, size_t len, ub4 prime)
{
  ub4    hash;
  size_t i;
  for (hash=(ub4)len, i=0; i<len; ++i) hash = hash+key[i];
  return (hash % prime);
}

Here are the results of the primes vs composites using the deviation from perfect hash equation I detailed in the previous article. I tested primes up to 1021 which is approximately 2^10 and about half of the maximum hash in this dataset.

Primes from 2 to 1021 vs Composites p+1
prime better: 75
composite better: 97

Primes from 5 to 1021 vs Composites p-1
prime better: 85
composite better: 85

Update

I just thought to myself, why not test all the numbers from 1-1024?

deviation1

Average Deviation using Prime modulus
62.32363

Average Deviation using Composite modulus (excluding 1, which is neither prime nor composite and gets a perfect score anyway)
60.35716

Maximum Deviation using Prime modulus
360.2399

Maximum Deviation using Composite modulus
363.5846

The results of this test shows that overall, in the average case, using a composite modulus is actually better by 3.2% than using a prime! In the case of maximum deviation, the prime wins out, but barely by a ratio of 0.92% (note this is less than 1%, not 92%)

My original goal has been achieved. I have proven that for a common case with a reasonable (according  to the critics) dataset, that a prime modulus does not mean better.

You can test even more numbers with the script I have included below, but I doubt there will be more of a difference.

Prime vs Composite Code

Ubuntu wAmerican dictionary: 98,569 English Words

Additional Material

In response to Jason’s comment that the distribution of the word list after hashing was too normal.

worddist

This looks different from a normal distribution of a perfectly random dataset. First of all, the peak is off center. Second, the right is longer and shorter than the left.

As for the spikes, you all wanted a bad hash function with low entropy data did you not? The deviation equation will yield the same score for any 2 hashes that have the same data.

  • Reddit
  • Facebook
  • Google Bookmarks
  • RSS

Related posts:

  1. Hash Functions: the modulo prime myth?
  2. A Better Python Reload
  3. Morton Codes
  4. PHP Sucks: No stable sort
  5. Memory Size of Python Objects

19 Comments  »

  1. JM says:

    “When did it become good science to automatically assume that it is common to output numbers that share a factor with the modulus?”

    One case, quite common at that? Use pointers/references as hashes, straight up. All even/multiples of four on most systems, and other regularities may abound as well (equal differences due to equal allocation sizes). To head off the obvious objection: yes, making the hash less naive is a better idea, and no, this isn’t always easy. In some languages it’s impossible or contrived (because references are opaque, so you’d have to invent your own number scheme just to get uniqueness). Using a prime number of buckets may cost less performance than improving the hash. People want their hashing, even if theory says they can’t really have it.

    It would be nice to have actual performance data of various hashtable implementations used in various implementations, but I have none available. The dictionary example is of course but one. Hashing English words is probably uncommon, though — hashing many strings with near-identical prefixes or postfixes would be more common. I have no idea what difference (if any) this has on overall results.

    (And this is *really* the last thing I’ll say on the subject, it’s eating up work time. :-)

    • admin says:

      —Edit—
      There is an easy solution to this! Just use an odd modulus size!
      —End Edit—

      “Use pointers/references as hashes, straight up.”

      That’s not a hash of any kind though :P There’s no mixing.

      “Hashing English words is probably uncommon, though — hashing many strings with near-identical prefixes or postfixes would be more common.”

      Many English words do have common prefixes and postfixes. A common use of a hash table would be in lookups for keywords in compilers. A quick glance at the C and C++ keywords seems to indicate that there are no near-identical prefixes and postfixes.

  2. Jason says:

    Prior to taking your modulus, your hash function is just the sum of the length of the word and each character’s ordinal value. This is fine, except that you’re just using a list of regular dictionary words, and you’re going to end up with a narrow normal distribution of values, meaning lots of duplicates. Dupes will always hash to same location regardless of your modulus, meaning that the vast majority of your collisions aren’t determined by prime vs. composite number of buckets. You might argue that this is a wash since both prime and composite are seeing the same normally distributed inputs. That’s not true, but the main point is something else, that your input set is much smaller than you think: your dictionary has 98,568 words in it, but you’re really looking at only 1,796 different things.

    Those 1,796 different things happen to also be uniformly distributed between 66 and 2433. Using a uniform input distribution was the main criticism of your original post.

    Going back to the ‘that’s not true’ statement I made in the first paragraph: the metric that you’re using is sensitive to a ‘piling on’ effect, in that if you do find two values that collide in the sense we care about, and then throw in the fact that there are lots of dupes of those values, your metric is going to be thrown way off. In your smaller prime tests, with your current code, it looks like there’s a huge gap in performance between the two methods, with composite more frequently coming out the big winner (though it does swing dramatically both ways). Do you notice that once you reach a high enough number of buckets, though, your ‘winner’ is just barely ahead of your ‘loser’? That’s because having more buckets reduces the impact of that piling on effect. Correct your test to eliminate the dupes (and remove the piling on effect), and you’ll see that the winner is always the larger of the prime or the composite, exactly what we’d expect from a normal distribution.

    Again, the point of using primes is to mitigate the ill effects of hash functions that might accidentally create patterned hash values (for instance, hash values that are always even). Nobody would claim that they help anything for either of the test cases you’ve provided.

    • Jason says:

      Sorry:

      “Correct your test to eliminate the dupes (and remove the piling on effect), and you’ll see that the winner is always the larger of the prime or the composite, exactly what we’d expect from a normal distribution.”

      Should have said:

      “exactly what we’d expect from a uniform distribution.”

    • admin says:

      Thanks for the analysis Jason. I didn’t know for sure if the output was as you described prior to modulus so I graphed it. The results show that the word lists aren’t really close to a normal distribution.

      As for the spikes due to duplicates: I don’t think I am going to change it because it is always possible to choose a pathologically bad dataset where one hash fails and another succeeds.

      “Again, the point of using primes is to mitigate the ill effects of hash functions that might accidentally create patterned hash values”

      Yes, and in this case, the primes did not do better in the case of the large patterned value of spikes in the dictionary.

      • Jason says:

        You’re right, I should have been more careful about claiming it was a normal distribution, I didn’t actually check that. That really isn’t important, though, because however the dictionary itself is distributed, you’re still getting the duplication issue, and the underlying distribution of the 1,796 ‘real’ things that you’re hashing is still uniform.

        “As for the spikes due to duplicates: I don’t think I am going to change it because it is always possible to choose a pathologically bad dataset where one hash fails and another succeeds.”

        Take what you did to an extreme: if you mapped each of a million words to one of only two integers, and in one table, they went to the same bucket, and in another they went to different buckets, does that tell you anything at all about the general suitability of the modulus that you picked for each table? Do you get any points for having used a million words? Of course not.

        “Yes, and in this case, the primes did not do better in the case of the large patterned value of spikes in the dictionary.”

        But they didn’t do worse, either, once you correct your methodology, and you’re still ignoring the main point, that nobody expects prime or non-prime to make a difference in the face of a uniform distribution. Primes simply make it harder (not impossible, just harder) to accidentally screw things up with a particular class of bad hash functions, nothing more.

        Note that I’ve never disputed the idea that the net benefit of using a prime modulus might not be enough to actually do so in your hashtable implementation. I’m only disputing the idea that’s it’s a myth there’s any benefit at all.

        • admin says:

          “…does that tell you anything at all about the general suitability of the modulus that you picked for each table?”

          Having a piling due to duplicate hashes is not a weakness of the deviation function. Think about it this way:

          If knuth(word) = 5 for all words.

          Yes there is a huge piling. But the deviation equation correct shows that there is a huge deviation from a perfect hash. And when you compare

          5 % prime vs 5 % composite, the deviation equations will yield equivalent scores!

          • Jason says:

            Look at your metric again. ideal_collisions is sensitive to the number of bins, so the cost per item of missing the ideal is different between two different-sized hash tables. The piling just amplifies this basic problem. Note that this difference gets smaller as your hashtables get bigger, then notice that the difference between your winner and loser gets smaller as your hashtables get bigger, too. Hmm…

            You can’t point to cases where the smaller hashtable ‘wins’ as evidence against this, either. That argument would only stand if you could point to one pile of duplicates as being worth the same as another pile (when you use elements as your numerator, you’re basically making the assertion that each element costs the same). But your uncorrected input data has that humped distribution, meaning some elements are duplicated far more often than others, meaning some piles are much more expensive than others.

          • admin says:

            “Jason says:Look at your metric again. ideal_collisions is sensitive to the number of bins, so the cost per item of missing the ideal is different between two different-sized hash tables.”

            Actually its normalized again because the total sum is divided by the modulus.

  3. Cliff says:

    You should really do “Hash Tables: The Myth of Collisions Mattering More Than Anything Else.”

    For lookups the time it takes to compute the bucket (hash + mod) tends to dominate when using keys of significant size.

    For strings try comparing any “low” collision rate hash and prime sized table with using zlib’s crc32 and a power of two sized table and see which wins for near sized tables. Say 1031 sized table with a “low” collision rate hash like what the academics suggest and a 1024 sized table with crc32 from zlib. Then have a look at lookup performance vs key size. Last time I did this crc32 had twice as many collisions but thoroughly outperformed the academic solutions because the hashing time was much faster and that was what was actually taking all the time.

  4. Bernard says:

    As far as I know one would use a prime mod to counter regularity in his data, for example if the hash function returns all even numbers using an even mod would fill only even buckets, while using a prime mod would get a more uniform distribution.

  5. Vertigo says:

    Given that “x MOD y” returns the difference between x and the closest multiple of y that is less than x, we can reason out the value (or lack thereof) of choosing y such that y is prime.

    Do prime numbers have more distant multiples? No.

    Do larger numbers have more distant multiple? Yes.

    Clearly, the significant factor is choosing larger numbers.

    All that said, I’m a bit confused by your explanation of where “modulus prime” fits into the hashing sequence. In all of my coursework (and I just popped open a few texts to be sure), using modulus was one of the suggested hash functions, not some sort of “next step” to follow the hash.

    • admin says:

      I copied the hash function in this article from Donald Knuth’s TAOCP which is as authoritative as it gets.

      In this case and many other algorithms I have seen, many people use mod prime as the “last step” to mix the final hash number within a bounds.

      • Vertigo says:

        And in the code provide, ub4 additive( char*, size_t, ub4) is the hash function. Certainly, a modulus operation is used as part of the hash, but it isn’t following the hash, nor is a modulus operation a required part of a hash function.

        • admin says:

          Of course its not required. That was the whole point of this article: That it is not required, and it sometimes doesn’t even help despite the fact that people keep doing it.

          • Vertigo says:

            If you wanted to show that modulus operations are not necessary, your test case should have excluded any modulus operation, not utilized a modulus operation with a non-prime operand.

          • admin says:

            Vertigo: “If you wanted to show that modulus operations are not necessary, your test case should have excluded any modulus operation, not utilized a modulus operation with a non-prime operand.”

            No you misunderstand. What I showed was that it is not necessary to conform to a prime. Your hash output must be bounded (unless you have unlimited memory).

            Neglecting the modulus at the end is equivalent to overflow on a bounded integer. You will find no difference between an unbounded integer % num vs a bounded integer without a mod.

            Example: hash % (2^32-1) is equivalent to emulating a native 32bit integer.

  6. Lewisc says:

    from what I understand(which is to say, not very much), the hashing algorithm you are using actually has the mod value as a part of the algorithm.

    e.g. you are doing a hash of (N1+N2+N3…Nk mod n).

    where N1…Nk are elements of the natural numbers leading up to n.(Since N1 mod n + N2 mod n = N1 + N2 mod n)
    That forms a group.

    If you have a hash using multiplication(for whatever reason) wouldn’t you have to use a prime number to get group? And if it isn’t a group, does that cause problems?

RSS feed for comments on this post, TrackBack URI

Leave a Comment