Account

Log in (OpenID enabled)

Hash Functions: the modulo prime myth?

Category: Algorithm, Math   Tags: , , ,

Update: new test data and rebuttal of critics here

Hash functions are used all over programming. From cryptography, hash tables, to error checking.

Many hash functions have the following form.

  1. Generate a pseudo random number from data using your magical function.
  2. Force this number into a narrow boundary with modulus. (ie. mod 8 forces all numbers to be from 0 to 7)

One of practices that bothered me was the use of prime numbers to limit the bucket size.

Why should we round down our buckets to the nearest prime? Doesn’t less buckets mean more collisions?

I searched for the answer. But no one knew for sure. Over and over again, they essentially repeated: “because Knuth said so“, or “because primes are naturally more randomy“. So I decided to look again for a more conclusive answer.

Search

I started out by searching for websites and academic papers to explain. I could not find any papers on this (update: found one thanks to ShreevatsaR). For websites, I found a few (emphasis mine):

http://www.partow.net/programming/hashfunctions/#HashAndPrimes

In simple terms when you multiply a set of random numbers by a prime number the resulting numbers when analyzed at their bit levels should show no bias towards being one state or another ie: Pr(Bi = 1) ~= 0.5. There is no concrete proof that this is the case or that it only happens with prime numbers, it just seems to be an ongoing self-proclaimed intuition that some professionals in the field seem obligied to follow.

http://www.azillionmonkeys.com/qed/hash.html

My boss advocated simply performing a modulo by prime operation and cited Knuth’s 30 years old “the Art of Computer Programming”. I showed him examples where modulo by prime had extremely bad collisions, but this did not seem to sway him at all. It seems Knuth’s rewrites have come too late.

http://burtleburtle.net/bob/hash/doobs.html

The best hash table sizes are powers of 2.  There is no need to do mod a prime (mod is sooo slow!).

http://www.zlib.net/maxino06_fletcher-adler.pdf

Quote from paper by Theresa Maxino of CMU: “The better mixing of bits that the Adler checksum provides due to its prime modulus has been claimed to provide better error detection capabilities than the Fletcher checksum. We have found that this is often not the case. The issue is that while the prime modulus in the Adler checksum results in better mixing, there are fewer “bins” (i.e., valid FCS values) available for code words. In most cases, this reduction in bins outweighs the gains made by better mixing.”

Investigation

I then decided to investigate for myself. I decided to look at the case of Adler-32 (uses the prime mod 65521 ) vs Fletcher (uses 2^16-1=65535).

Here is a graph where I generated 10 million cryptographically secure 32-bit integers, and then took the modulus and recorded how many elements were put in the same bucket.

hash

As you can see here. They look pretty much the same.

Here are the equations I used to check for deviation from a perfect hash.

  • ideal number of collisions per bucket = elements inserted / modulus
  • deviation

I ran these equations through 100 million random integers. It appears that they are both very similar, but using the larger composite bucket size yielded a slightly smaller deviation. 31.2388930349 (prime) vs 31.1857721094 (composite).

Counterargument

The only argument for using prime numbers that almost makes sense is like this:

If you have mod 15, then the following are equivalent:

  • 0
  • 5+5+5
  • 5+5+5+5+5+5

The problem with this argument is that mod 13 is also susceptible in a way:

  • 0
  • 3+5+5
  • 3+5+5+3+5+5

Counterargument 2

A number coprime to the prime, multiplied together, will always yield a unique number.

Ok this is true, but it will result in big numbers which you will then need to reduce to a smaller number via modulus, and then you lose your uniqueness again.

Counterargument 3

Ok so this came up in the comments many times. It is useful for linear probing in Hash Tables. Note that signature functions (MD5) and error correction (Adler32) do not deal with collisions.

Linear probing also does not scale well and is probably not the best way to deal with collisions.

Conclusions

The main source of collisions depends on your data and hash function (before modulus).

Just use a bigger modulus unless you see many collisions.

Clarification

I  need to explain why I used random numbers because many people mistakenly believe this is a flaw in the experiment.

The hash functions in questions look like this:

  1. I = low entropy data (low randomness input)
  2. H = Perfect Hash Function
  3. H(I) = high entropy data (high randomness) equal to random numbers
  4. bucket = (H(I) or random number) % m  this is where the experiment is!

For some reason, people think my test sits at step 2, the actual meat of the hash function! This is not the case. My test sits at step 4, the last step where the modulus is taken.

In fact, it is a bigger mistake to feed in low-entropy data into the modulus because hash functions are designed to transform low entropy to high entropy!

My test file written in Python

  • Reddit
  • HackerNews
  • Twitter
  • DZone
  • del.icio.us
  • FriendFeed
  • StumbleUpon
  • RSS

Related posts:

  1. Hash Functions: the modulo prime myth 2
  2. Morton Codes
  3. PHP Sucks: No stable sort
  4. A Better Python Reload
  5. Finding the Current Address in a C Program

27 Comments  »

  1. Somebody says:

    Correction to the previous post (making it clearer between bucket and container):

    The use of a prime bucket size has all to do with number theory and co-primeness.

    Also I don’t see how the 3+5+5 example works. Where does the 3 come from?

    With smaller container sizes, having the container size not be prime exposes it to the mod function looping over and over onto the same slots. Say you have a container size of 10, and your hash function returns 5. This will mean that your next jumps will always occur on the same slots (5 and 0) you’ve essentially limited your total container size to 2.

    The container size need not be prime, only that it be co-prime with the hash function output for the effective container size to be the whole container size. But that would mean limiting the output domain of the hash function (so as to maintain co-primeness). The simple solution is then to make it of prime size, thus guaranteeing that any number the hash function generates will jump potentially to all possible slots in the container.

    I suspect that your 65521 vs 66535 example yields an asymptotic behaviour (where co-primeness washes out). Why don’t you try with buckets of size 12 and 13 instead and see what happens.

    • admin says:

      Hi somebody, thanks for the reply.

      About the 5+5+5 example, it wasn’t my idea. It was some explanation I found online. Also the 3 comes from the fact that its also likely that you have somewhat repeating data vs 5 5 5 5 5 5 over and over again.

      Also I don’t disagree that prime number modulus yields better mixing, but as the Maxino paper from Carnegie Mellon and my findings show that a bigger bucket size outweighs the benefit of a prime modulus.

      How often are you going to have a bucket size of 12 and 13?

  2. Nathan says:

    Heh, no.

    You started with an incorrect problem definition, which seems to have resulted in your research spinning in circles and winding up in a nonsensical conclusion.

    Since you seem smart enough to benefit from working out the details of the answer on your own, I’ll give you a few hints.

    The trivial form of the problem allows the following assumptions:

    1. The output of the hash function effectively turns a sequence of inputs into a sequence of pseudo-random numbers.

    2. Given enough inputs, the sequence of outputs from a perfect hash function will approach a uniform distribution.

    To see why a prime number of buckets is ideal, start by thinking of a way to feed a modulus an input distribution (uniform, in this case) and algebraically turn it into a density function (effectively allowing you to predict a min and max number of collisions).

    Then, investigate how that transformation behaves when the modulus is, or is not, prime.

    p.s. The conclusion is nonsensical because your proposed method suggests that the developer will have access to the data when they are evaluating the number of buckets. If this were the case, then you are not addressing the issue of “generally, a prime number of buckets are better” *at all*. You are instead looking at a problem with zero unknowns, so simple empirical testing lets you fit the modulus, hash, and data together in an near-optimal way. This is not related to the problem discussed by Knuth, or in your standard data structures class.

    • admin says:

      Hello Nathan, thanks for the reply. Here is my response:

      1. I already did this by using a cryptographically secure random number source. Namely the urandom function in Python which uses entropy from your computer.

      2. I used 100 million random numbers which as you have seen, results in a mostly uniform distribution. There are slight deviations which you can see in the graph after taking the modulus.

      3. I did investigate a prime vs composite modulus, and found that the deviations from a perfect hash was nearly the same, with a slight edge for the larger composite one.

      p.s. My goal wasn’t trying to prove that all composite mods were better than prime mods. I only need to show one case where a composite modulus yielded similar or better collision density in order to prove that having a prime modulus isn’t always better.

      • Jason says:

        Using a uniform distribution is the problem. A uniformly distributed set is going to hash nicely regardless of how many buckets you’re hashing into, assuming a decent hash function. You use a prime number of buckets to deal with the general case of an unknown distribution.

        • admin says:

          I did try to consider that the data you hash might not have high entropy. But, the problem is that there’s no acceptable dataset for testing.

          “You use a prime number of buckets to deal with the general case of an unknown distribution” has 2 assumptions.

          1. That mod prime numbers are more randomy.
          2. That the entropy in your data is low.

          If you use compression on your data to increase the entropy as high as possible, it is practically equivalent to random numbers.

          Maybe in the future, I will try testing it on an English dictionary.

          • Jason says:

            1. That mod prime numbers are more randomy.

            They are more likely to be uniformly distributed for an arbitrary input set, which sounds ‘more randomy’ to me.

            2. That the entropy in your data is low.

            Well, yes, but you’ve just shown that with uniformly distributed data sets, there’s no advantage for either approach. So if you use primes and win in the cases where the entropy in your data is low, you win period, because then you don’t have to know that your data set is uniformly distributed to get decent mixing. We are talking about general hashing operations, after all, so no peeking allowed.

  3. medicinekeeper says:

    The motivation for using primes with the modulo operator is number theoretic in nature, and you very nearly hit it with the 5 + 5 + 5 problem.

    Part of what makes arithmetic work is that if you ever have a * b = 0, then either a = 0 or b = 0 (possibly both). However, if we’re working in a world where everything is done modulo 15, then we get 3 * 5 = 0. We call 3 and 5 zero divisors, and we stay the hell away from them.

    Of course, for a lot of algorithmic analysis, we want multiplication to work with the modulus. If we choose a prime, then all these evil problems about coprimality vanish and we can multiply just fine.

    It’s not a big problem, though – at least, the math extends pretty simply to composite numbers, particularly prime powers. The analysis just goes easier starting from the simple case and extending.

  4. Ojotoxy says:

    I made a graph in AP of the collisions vs. table size. There were 2 apparent branches to this graph, one with composities and the other were outlying primes. They eventually converged as the table grew. You’re testing a big table.

  5. I think to say we are going out on a limb by drawing any conclusions from investigating only two divisors (65521 and 65535), is an understatement.

  6. thom says:

    the mod p thing is only for hashtables. the mod p thing is just a method to narrow your 32-bit hash value down to the number of buckets in your hashtable. but that is specific to hashtables.

    it doesn’t apply to hash functions generally. most hash functions work with 32-bit (say) numbers the whole way through the computation, and don’t need to narrow the number down at any point, not even at the end.

  7. JM says:

    Like many other people have said, this is specific to hashtables, and even then hashtables where you cannot make the hash function distribute evenly. Prime numbers make sense for secondary hash functions, the ones that map hashes to array indexes, assuming that the primary hash function maps objects of some sort to the full range of integers and is someone else’s concern.

    With a uniform distribution as shown here it doesn’t matter how you stuff things in buckets. The problem is that a hashtable in general doesn’t deal with uniformly distributed hashes, or at least it can’t usually assume it does. If you control the hashing process front to back (and you have a good idea of the distribution of your inputs) this is not an issue. This bears repeating: with a uniform hash and non-perverse input (of the kind that you have here, and the kind you ideally want to see in any scenario) this is not an issue, and focusing on getting this right is very important, but sometimes getting it wrong is an unavoidable consequence of separating concerns.

    Assume by some stroke of bad luck that your primary hash function does not yield evenly distributed hashes, then computing the index by taking hash % bucket-count with bucket-count prime eliminates the most common cases of collisions. Taking the extreme case, if your bucket count is even (like, say, a power of 2), then every even hash will end up on an even index in the array, and every odd hash will end up on an odd index. A good primary hash function will have an even distribution in all bits, in particular bit 0, so that this isn’t a problem, but a bad hash function where bit 0 is constant effectively cuts the capacity of your table in half.

    Of course, even using a prime number makes you susceptible to a bad hash function which returns hashes that are all multiples of that number… but this is far less likely (pragmatically speaking, not statistically). Likewise, this offers no protection against pathological hash functions like those returning a constant, but then, nothing will. It’s a trade-off. A good one, in my opinion, since to many programmers hashing is still very much like magic, and they tend to produce suboptimal hash functions without knowing it, and they wouldn’t know how to improve them if they had to. A small thing like making your bucket count prime can make quite a difference here, and in a general library it’s quite worthwhile.

    Linear probing vs. chaining is a different discussion, that only comes into it when you have collisions in the first place, which is the thing we’re trying to avoid here.

    • admin says:

      ” The problem is that a hashtable in general doesn’t deal with uniformly distributed hashes, or at least it can’t usually assume it does.”

      I didn’t assume my input was uniformly distributed. I assumed the output of the hashing function (right before modulus) was uniformly distributed. This is a fair assumption due to the definition of a perfect hash.

      “Likewise, this offers no protection against pathological hash functions like those returning a constant, but then, nothing will. It’s a trade-off.”

      The point of my article was to conclude that the hash function (prior to modulus) is far more important than the modulus itself.

      An analogy for this would be like saying “You should reinforce your car with armor plating (prime modulus) to avoid collisions, especially if you don’t have brakes (a good hash function)”. My answer is this: fix the hash function.

      • JM says:

        “I didn’t assume my input was uniformly distributed. I assumed the output of the hashing function (right before modulus) was uniformly distributed.”
        You are right, as I hopefully pointed out in my post: if your hash function produces uniformly distributed output (like it should), then the whole prime modulo business is unnecessary. I was answering the question of why it still pops up.

        “This is a fair assumption due to the definition of a perfect hash.”
        A *perfect* hash has no collisions, by definition, and that’s a rather special case. We can restrict the discussion to *good* hashes here, that is, hashes with uniform distribution.

        “The point of my article was to conclude that the hash function (prior to modulus) is far more important than the modulus itself.”
        This is also undisputed.

        “My answer is this: fix the hash function.”
        But you’re a library writer implementing the Hashtable class, and you can’t say anything to the guy who produced the bad hash function. Your choice is either to go the extra mile and make the bucket size prime to avoid a very common performance pitfall, or ignore the scenario and expect the client to provide a better hash function in the first place.

        From a design perspective, the last solution is simpler and more elegant. But from a pragmatic perspective, it’s very likely that going the extra mile is a cheap way of getting a performance win for a great many applications using your framework. Now, can you guess which framework will be more popular — the one going for the easier, cleaner implementation (which, let’s not overstate this, just omits a single “getPrime()” call when the table needs to expand) or the one with the cheap performance fix?

        Keep in mind that a *perfect* hash function (in the theoretical sense) is only possible in the first place if you know your input in advance, and scenarios like those don’t need to involve a generic hashtable at all. If you know your input, use a minimal perfect hash generator so you don’t have to think about it in the first place *and* get great performance. Compiler writers: use this for your keyword lookups! You’ll be the coolest kid on the block.

        • admin says:

          “Now, can you guess which framework will be more popular — the one going for the easier, cleaner implementation (which, let’s not overstate this, just omits a single “getPrime()” call when the table needs to expand) or the one with the cheap performance fix?”

          I’d have to say the simpler one wins in real life. The C standard library went with the idea that simpler implementations were better than more complicated ones. Now its one of the longest lasting popular language in existence.

          Embedding a big fat prime table (the real cost of getPrime plus a binary search) does not seem like a good idea especially in the embedded world.

          • JM says:

            “The C standard library”
            …has no hashtable implementation (or anything related to hashing, for that matter) and is not, in any case, a framework of the kind I’ve been discussing here.

            The argument that simple implementations win because C was successful is a non sequitur. I can argue just as easily that better performing implementations win because C was successful (it held and in many cases still holds the performance crown). I’m aware of the “less is more” philsophy, but you cannot maintain that that extends all the way down to a single decision in the implementation which does not affect the interface or use in the slightest (which is what that principle is all about). Also, I’d challenge you to take a look at the implementations of these “simple” C functions nowadays — they get quite complicated, to make maximum use of modern hardware.

            If we extend the “simpler implementation is better” principle to its logical conclusion, you should not be using a hashtable someone else wrote in the first place, nor should you be writing a general one yourself, since that will always involve two hash functions while you could do with one optimized one. And we’ve already established that in that scenario, the prime bucket approach is irrelevant.

            “especially in the embedded world”
            The elevator controller argument? Now you’re just moving the goalposts. When you move to embedded, you’re working with a different set of constraints and a whole lot of things change, certainly not in the least the fact that you would not use a general hashtable implementation. You’d want (or be forced to use) custom memory allocation and your application would probably be very limited in scope, so you can (and should) make the hashing much more specific.

            If you’re on an embedded system, then yes, again, the whole prime modulo situation becomes irrelevant, and I doubt anybody would argue otherwise. But having said that:

            “Embedding a big fat prime table (the real cost of getPrime plus a binary search)”
            You have obviously not even looked at implementations that take this approach. I have (one C#, one Java, one C++, quite clearly not based on each other). None of them used tables that could even remotely be called big, and none of them used binary search on these small tables. You’re attacking a naive strawman.

            All I’ve been trying to do is explain why people do this in practice — it’s *not* based on superstition or premature optimization, at least not *this* particular use of a prime number of buckets. It’s very much worth mentioning that this is by no means the only way to guard against poor primary hash functions. There are others — http://weblogs.java.net/blog/tchangu/archive/2005/06/hashmap_impleme.html discusses one, and it even references Knuth about power-of-two bucket sizes!

            The central point is this: hashtable implementers guard against poor hash functions because this pays off. Using a prime number of buckets is one way of doing this, and it does make sense. Using a prime number of buckets does not *always* make sense, and it’s good to point this out too (there’s plenty of superstition on prime numbers), but it’s not true that it *never* makes sense.

            That’s all, folks! I don’t think I can say any more on the subject.

  8. Gijsbert says:

    You assumption is that often a hash function is used that returns pseudo random numbers.
    But as Thom points out the mod p is really about hash tables. What is often stored in hash tables? Objects, meaning pointer values. The hash function used is simply the pointer value, which is stored at a location mod p. Pointer values are often a multiple of 4. Using mod 4 or mod 8 or any multiple of 2 makes no sense here, because you get too many collisions. Prime numbers give a reasonably low number of collisions in this case. Hence the use of prime numbers. But there are no pseudo-random numbers anywhere is this story.

    • Gijsbert says:

      I see that JM submitted the same point, as I was typing my message.

    • admin says:

      ” Pointer values are often a multiple of 4. Using mod 4 or mod 8 or any multiple of 2 makes no sense here”

      It doesn’t matter if after you apply the hash function, the hash value doesn’t become a multiple of 4 anymore.

  9. thom says:

    hashtables and hash functions (of the kind you’re talking about) are two separate, not necessarily connected things.

    hashtables receive hash values. those hash values typically aren’t computed using strong hash functions of the kind you’re referring to, for various reasons. the hashtable uses the mod p as a slightly safer way to distribute the received hash value among its buckets. it is slightly safer in the event that the received hash values aren’t well-distributed. since hash values typically aren’t computed using strong hash functions, this is an important thing to do.

    there are various reasons why hash values for a hashtable aren’t computed using strong hash functions. but mainly it’s just not seen as necessary in typical programming scenarios – the consequence of having collisions are not that bad. and the performance penalty for computing a strong hash function is fairly high.

    strong hash functions are really only necessary for cryptography, where avoiding collisions is necessary for security. hashtables have no such requirement.

    • admin says:

      thom, I tried to point that out that hash tables are not the same as hash functions.

      But Knuth used 2 modulo prime hash table examples, and Adler-32 uses modulo prime even though its simply a hash function. I am willing to bet many other people use mod prime on every hash.

      Many other modern algorithms of hash tables and hash functions don’t need mod prime and achieve even better mixing. (see Bob Jenkin’s website for statistics)

      I will have to disagree with you that hash tables have no requirement for collisions. Collisions are the bane of hash tables. Every time you need to repeat a lookup, your once O(1) cost becomes O(n).

      • thom says:

        sure i can agree with you on those two points. hashtables don’t need to use mod p. absolutely, there are other methods of placing received hash values into buckets. also, hash functions don’t need to use mod p. that’s what i said in my comment.

        in fact, contrary to your premise, the vast majority of hash functions do any mod p whatsoever, they just do lots of arithmetic and bit-mixing on a 32-bit value and return that directly. at the bottom of the link you provided: http://burtleburtle.net/bob/hash/doobs.html lists a number of hash functions. not a single one of those uses mod p. Adler-32 uses mod p for a very particular reason – and that reason has nothing to do with hash functions in general nor with hash tables. mod p just happens to be a useful operation within Adler-32.

  10. Moliate says:

    Interesting topic.

    Since “mod prime” is a hash function in itself (naive, but still a hash function) my assumption would be that this final step is just to reduce the weaknesses of an algorithm. For example: Adler-32 is just a checksum algorithm, not a cryptographic hash function. It was created for speed and may not produce such perfect hashes in itself.

    I must say that I’ve only seen the “mod prime” in quick algorithms suited for checksums and hash tables. I think it’s just a trade-off between speed and hash quality.

  11. primer says:

    let’s assume, for the sake of argument, that you’re stuck with a prime sized table…crucially, primes which aren’t nicely 2^k(+|-)1. One take away I have from what you’ve worked on is that collisions matter somewhat less than the hashing/bin selection. Do you have any suggestions on an alternate function which would less optimally but still reasonably distribute items into the prime sized table while avoiding mod? All the good hacks for computing mod without division seem to require that you be 1 away from a power of two, or that you have very small numbers. At least, that was what I gleaned from the various discussions of using fixed point math for mod. There’s other ways to do it, but they are slower than just doing the mod in the first place and seem to be designed for machines which lack division. Thoughts? One motivation I can give you is that the problem with ^2 sized tables is that they waste a lot more space, so if memory is at a premium, it’s probably not the best choice, speed notwithstanding.

    • admin says:

      I am unaware of any special tricks to avoid the modulus operator. Usually people just let the number overflow on a standard C type.

      There was a discussion on the speed of the modulus operator on Reddit, and the consensus seemed to be that it was negligible.

  12. Eric Hopper says:

    The problem is that many hash functions do not produce an even distribution. The other problem is that I feel you’ve chosen a poor problem statement.

    Cryptographically secure hash functions generally produce an even distribution or they’re broken. They are also slow, and nobody uses them for hashing in a small hash table with a limited number of buckets, they use them for hashing in essentially infinite sized hash tables distributed all over the net.

    I don’t know what the current state of the art is in hash functions typically used for in-memory hash tables, but I bet that while they might be good, there is still unevenness in their distribution. In particular, for one common case, the output of the hash function is likely to almost always be divisible by 4 (or even larger power of 2) since the hash function merely returns the address of the object being hashed.

    Your model of /dev/urandom as representing the output of a typical hash function is fatally flawed.

    Secondly, of course a bigger table size is going to have more of an effect. In order to have a fair test, I suggest choosing 510481 and 510510 as the table sizes. And use the output of a real hash function. And generate your inputs by taking combinations of two words from an english dictionary. Since you’re using Python, the string hash function would be adequate. 510510 is very composite, and 510481 is a very slightly smaller prime number. For a similar smaller table, choose 30030 (composite) and 30029 (prime).

    Personally, I would choose the prime that’s a little larger than the table size you would choose if you chose a composite.

RSS feed for comments on this post, TrackBack URI

Leave a Comment

(Cookies must be enabled)