<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Hash Functions: the modulo prime myth?</title>
	<atom:link href="http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/feed" rel="self" type="application/rss+xml" />
	<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth</link>
	<description></description>
	<lastBuildDate>Fri, 12 Mar 2010 01:46:05 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: admin</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-415</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Mon, 12 Oct 2009 04:21:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-415</guid>
		<description>Eric, see this followup link that was at the top of the page that dispels those myths. Not only did I use a real dataset (English Words) but I also used Donald Knuth&#039;s (of TAOCP fame) additive hash.

http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth-2</description>
		<content:encoded><![CDATA[<p>Eric, see this followup link that was at the top of the page that dispels those myths. Not only did I use a real dataset (English Words) but I also used Donald Knuth&#8217;s (of TAOCP fame) additive hash.</p>
<p><a  href="http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth-2" rel="nofollow">http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth-2</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Hopper</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-414</link>
		<dc:creator>Eric Hopper</dc:creator>
		<pubDate>Mon, 12 Oct 2009 00:44:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-414</guid>
		<description>The problem is that many hash functions do not produce an even distribution.  The other problem is that I feel you&#039;ve chosen a poor problem statement.

Cryptographically secure hash functions generally produce an even distribution or they&#039;re broken.  They are also slow, and nobody uses them for hashing in a small hash table with a limited number of buckets, they use them for hashing in essentially infinite sized hash tables distributed all over the net.

I don&#039;t know what the current state of the art is in hash functions typically used for in-memory hash tables, but I bet that while they might be good, there is still unevenness in their distribution.  In particular, for one common case, the output of the hash function is likely to almost always be divisible by 4 (or even larger power of 2) since the hash function merely returns the address of the object being hashed.

Your model of /dev/urandom as representing the output of a typical hash function is fatally flawed.

Secondly, of course a bigger table size is going to have more of an effect.  In order to have a fair test, I suggest choosing 510481 and 510510 as the table sizes.  And use the output of a real hash function.  And generate your inputs by taking combinations of two words from an english dictionary.  Since you&#039;re using Python, the string hash function would be adequate.  510510 is very composite, and 510481 is a very slightly smaller prime number.  For a similar smaller table, choose 30030 (composite) and 30029 (prime).

Personally, I would choose the prime that&#039;s a little larger than the table size you would choose if you chose a composite.</description>
		<content:encoded><![CDATA[<p>The problem is that many hash functions do not produce an even distribution.  The other problem is that I feel you&#8217;ve chosen a poor problem statement.</p>
<p>Cryptographically secure hash functions generally produce an even distribution or they&#8217;re broken.  They are also slow, and nobody uses them for hashing in a small hash table with a limited number of buckets, they use them for hashing in essentially infinite sized hash tables distributed all over the net.</p>
<p>I don&#8217;t know what the current state of the art is in hash functions typically used for in-memory hash tables, but I bet that while they might be good, there is still unevenness in their distribution.  In particular, for one common case, the output of the hash function is likely to almost always be divisible by 4 (or even larger power of 2) since the hash function merely returns the address of the object being hashed.</p>
<p>Your model of /dev/urandom as representing the output of a typical hash function is fatally flawed.</p>
<p>Secondly, of course a bigger table size is going to have more of an effect.  In order to have a fair test, I suggest choosing 510481 and 510510 as the table sizes.  And use the output of a real hash function.  And generate your inputs by taking combinations of two words from an english dictionary.  Since you&#8217;re using Python, the string hash function would be adequate.  510510 is very composite, and 510481 is a very slightly smaller prime number.  For a similar smaller table, choose 30030 (composite) and 30029 (prime).</p>
<p>Personally, I would choose the prime that&#8217;s a little larger than the table size you would choose if you chose a composite.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-287</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Sat, 22 Aug 2009 03:53:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-287</guid>
		<description>I am unaware of any special tricks to avoid the modulus operator. Usually people just let the number overflow on a standard C type. 

There was a discussion on the speed of the modulus operator on Reddit, and the consensus seemed to be that it was negligible.</description>
		<content:encoded><![CDATA[<p>I am unaware of any special tricks to avoid the modulus operator. Usually people just let the number overflow on a standard C type. </p>
<p>There was a discussion on the speed of the modulus operator on Reddit, and the consensus seemed to be that it was negligible.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: primer</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-286</link>
		<dc:creator>primer</dc:creator>
		<pubDate>Fri, 21 Aug 2009 22:08:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-286</guid>
		<description>let&#039;s assume, for the sake of argument, that you&#039;re stuck with a prime sized table...crucially, primes which aren&#039;t nicely 2^k(+&#124;-)1.  One take away I have from what you&#039;ve worked on is that collisions matter somewhat less than the hashing/bin selection.  Do you have any suggestions on an alternate function which would less optimally but still reasonably distribute items into the prime sized table while avoiding mod?  All the good hacks for computing mod without division seem to require that you be 1 away from a power of two, or that you have very small numbers.  At least, that was what I gleaned from the various discussions of using fixed point math for mod.  There&#039;s other ways to do it, but they are slower than just doing the mod in the first place and seem to be designed for machines which lack division.  Thoughts?  One motivation I can give you is that the problem with ^2 sized tables is that they waste a lot more space, so if memory is at a premium, it&#039;s probably not the best choice, speed notwithstanding.</description>
		<content:encoded><![CDATA[<p>let&#8217;s assume, for the sake of argument, that you&#8217;re stuck with a prime sized table&#8230;crucially, primes which aren&#8217;t nicely 2^k(+|-)1.  One take away I have from what you&#8217;ve worked on is that collisions matter somewhat less than the hashing/bin selection.  Do you have any suggestions on an alternate function which would less optimally but still reasonably distribute items into the prime sized table while avoiding mod?  All the good hacks for computing mod without division seem to require that you be 1 away from a power of two, or that you have very small numbers.  At least, that was what I gleaned from the various discussions of using fixed point math for mod.  There&#8217;s other ways to do it, but they are slower than just doing the mod in the first place and seem to be designed for machines which lack division.  Thoughts?  One motivation I can give you is that the problem with ^2 sized tables is that they waste a lot more space, so if memory is at a premium, it&#8217;s probably not the best choice, speed notwithstanding.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Moliate</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-196</link>
		<dc:creator>Moliate</dc:creator>
		<pubDate>Sat, 11 Jul 2009 21:28:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-196</guid>
		<description>Interesting topic.

Since &quot;mod prime&quot; is a hash function in itself (naive, but still a hash function) my assumption would be that this final step is just to reduce the weaknesses of an algorithm. For example: Adler-32 is just a checksum algorithm, not a cryptographic hash function. It was created for speed and may not produce such perfect hashes in itself. 

I must say that I&#039;ve only seen the &quot;mod prime&quot; in quick algorithms suited for checksums and hash tables. I think it&#039;s just a trade-off between speed and hash quality.</description>
		<content:encoded><![CDATA[<p>Interesting topic.</p>
<p>Since &#8220;mod prime&#8221; is a hash function in itself (naive, but still a hash function) my assumption would be that this final step is just to reduce the weaknesses of an algorithm. For example: Adler-32 is just a checksum algorithm, not a cryptographic hash function. It was created for speed and may not produce such perfect hashes in itself. </p>
<p>I must say that I&#8217;ve only seen the &#8220;mod prime&#8221; in quick algorithms suited for checksums and hash tables. I think it&#8217;s just a trade-off between speed and hash quality.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: JM</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-32</link>
		<dc:creator>JM</dc:creator>
		<pubDate>Thu, 11 Jun 2009 09:38:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-32</guid>
		<description>&quot;The C standard library&quot;
...has no hashtable implementation (or anything related to hashing, for that matter) and is not, in any case, a framework of the kind I&#039;ve been discussing here.

The argument that simple implementations win because C was successful is a non sequitur. I can argue just as easily that better performing implementations win because C was successful (it held and in many cases still holds the performance crown). I&#039;m aware of the &quot;less is more&quot; philsophy, but you cannot maintain that that extends all the way down to a single decision in the implementation which does not affect the interface or use in the slightest (which is what that principle is all about). Also, I&#039;d challenge you to take a look at the implementations of these &quot;simple&quot; C functions nowadays -- they get quite complicated, to make maximum use of modern hardware.

If we extend the &quot;simpler implementation is better&quot; principle to its logical conclusion, you should not be using a hashtable someone else wrote in the first place, nor should you be writing a general one yourself, since that will always involve two hash functions while you could do with one optimized one. And we&#039;ve already established that in that scenario, the prime bucket approach is irrelevant. 

&quot;especially in the embedded world&quot;
The elevator controller argument? Now you&#039;re just moving the goalposts. When you move to embedded, you&#039;re working with a different set of constraints and a whole lot of things change, certainly not in the least the fact that you would not use a general hashtable implementation. You&#039;d want (or be forced to use) custom memory allocation and your application would probably be very limited in scope, so you can (and should) make the hashing much more specific.

If you&#039;re on an embedded system, then yes, again, the whole prime modulo situation becomes irrelevant, and I doubt anybody would argue otherwise. But having said that:

&quot;Embedding a big fat prime table (the real cost of getPrime plus a binary search)&quot;
You have obviously not even looked at implementations that take this approach. I have (one C#, one Java, one C++, quite clearly not based on each other). None of them used tables that could even remotely be called big, and none of them used binary search on these small tables. You&#039;re attacking a naive strawman.

All I&#039;ve been trying to do is explain why people do this in practice -- it&#039;s *not* based on superstition or premature optimization, at least not *this* particular use of a prime number of buckets. It&#039;s very much worth mentioning that this is by no means the only way to guard against poor primary hash functions. There are others -- http://weblogs.java.net/blog/tchangu/archive/2005/06/hashmap_impleme.html discusses one, and it even references Knuth about power-of-two bucket sizes!

The central point is this: hashtable implementers guard against poor hash functions because this pays off. Using a prime number of buckets is one way of doing this, and it does make sense. Using a prime number of buckets does not *always* make sense, and it&#039;s good to point this out too (there&#039;s plenty of superstition on prime numbers), but it&#039;s not true that it *never* makes sense.

That&#039;s all, folks! I don&#039;t think I can say any more on the subject.</description>
		<content:encoded><![CDATA[<p>&#8220;The C standard library&#8221;<br />
&#8230;has no hashtable implementation (or anything related to hashing, for that matter) and is not, in any case, a framework of the kind I&#8217;ve been discussing here.</p>
<p>The argument that simple implementations win because C was successful is a non sequitur. I can argue just as easily that better performing implementations win because C was successful (it held and in many cases still holds the performance crown). I&#8217;m aware of the &#8220;less is more&#8221; philsophy, but you cannot maintain that that extends all the way down to a single decision in the implementation which does not affect the interface or use in the slightest (which is what that principle is all about). Also, I&#8217;d challenge you to take a look at the implementations of these &#8220;simple&#8221; C functions nowadays &#8212; they get quite complicated, to make maximum use of modern hardware.</p>
<p>If we extend the &#8220;simpler implementation is better&#8221; principle to its logical conclusion, you should not be using a hashtable someone else wrote in the first place, nor should you be writing a general one yourself, since that will always involve two hash functions while you could do with one optimized one. And we&#8217;ve already established that in that scenario, the prime bucket approach is irrelevant. </p>
<p>&#8220;especially in the embedded world&#8221;<br />
The elevator controller argument? Now you&#8217;re just moving the goalposts. When you move to embedded, you&#8217;re working with a different set of constraints and a whole lot of things change, certainly not in the least the fact that you would not use a general hashtable implementation. You&#8217;d want (or be forced to use) custom memory allocation and your application would probably be very limited in scope, so you can (and should) make the hashing much more specific.</p>
<p>If you&#8217;re on an embedded system, then yes, again, the whole prime modulo situation becomes irrelevant, and I doubt anybody would argue otherwise. But having said that:</p>
<p>&#8220;Embedding a big fat prime table (the real cost of getPrime plus a binary search)&#8221;<br />
You have obviously not even looked at implementations that take this approach. I have (one C#, one Java, one C++, quite clearly not based on each other). None of them used tables that could even remotely be called big, and none of them used binary search on these small tables. You&#8217;re attacking a naive strawman.</p>
<p>All I&#8217;ve been trying to do is explain why people do this in practice &#8212; it&#8217;s *not* based on superstition or premature optimization, at least not *this* particular use of a prime number of buckets. It&#8217;s very much worth mentioning that this is by no means the only way to guard against poor primary hash functions. There are others &#8212; <a  href="http://weblogs.java.net/blog/tchangu/archive/2005/06/hashmap_impleme.html" rel="nofollow" class="external">http://weblogs.java.net/blog/tchangu/archive/2005/06/hashmap_impleme.html</a> discusses one, and it even references Knuth about power-of-two bucket sizes!</p>
<p>The central point is this: hashtable implementers guard against poor hash functions because this pays off. Using a prime number of buckets is one way of doing this, and it does make sense. Using a prime number of buckets does not *always* make sense, and it&#8217;s good to point this out too (there&#8217;s plenty of superstition on prime numbers), but it&#8217;s not true that it *never* makes sense.</p>
<p>That&#8217;s all, folks! I don&#8217;t think I can say any more on the subject.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: thom</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-31</link>
		<dc:creator>thom</dc:creator>
		<pubDate>Thu, 11 Jun 2009 07:01:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-31</guid>
		<description>sure i can agree with you on those two points.  hashtables don&#039;t need to use mod p.  absolutely, there are other methods of placing received hash values into buckets.  also, hash functions don&#039;t need to use mod p.  that&#039;s what i said in my comment.

in fact, contrary to your premise, the vast majority of hash functions do any mod p whatsoever, they just do lots of arithmetic and bit-mixing on a 32-bit value and return that directly.  at the bottom of the link you provided:  http://burtleburtle.net/bob/hash/doobs.html lists a number of hash functions.  not a single one of those uses mod p.  Adler-32 uses mod p for a very particular reason - and that reason has nothing to do with hash functions in general nor with hash tables.  mod p just happens to be a useful operation within Adler-32.</description>
		<content:encoded><![CDATA[<p>sure i can agree with you on those two points.  hashtables don&#8217;t need to use mod p.  absolutely, there are other methods of placing received hash values into buckets.  also, hash functions don&#8217;t need to use mod p.  that&#8217;s what i said in my comment.</p>
<p>in fact, contrary to your premise, the vast majority of hash functions do any mod p whatsoever, they just do lots of arithmetic and bit-mixing on a 32-bit value and return that directly.  at the bottom of the link you provided:  <a  href="http://burtleburtle.net/bob/hash/doobs.html" rel="nofollow" class="external">http://burtleburtle.net/bob/hash/doobs.html</a> lists a number of hash functions.  not a single one of those uses mod p.  Adler-32 uses mod p for a very particular reason &#8211; and that reason has nothing to do with hash functions in general nor with hash tables.  mod p just happens to be a useful operation within Adler-32.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-30</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Wed, 10 Jun 2009 22:27:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-30</guid>
		<description>&quot;Now, can you guess which framework will be more popular — the one going for the easier, cleaner implementation (which, let’s not overstate this, just omits a single “getPrime()” call when the table needs to expand) or the one with the cheap performance fix?&quot;

I&#039;d have to say the simpler one wins in real life. The C standard library went with the idea that simpler implementations were better than more complicated ones. Now its one of the longest lasting popular language in existence.

Embedding a big fat prime table (the real cost of getPrime plus a binary search) does not seem like a good idea especially in the embedded world.</description>
		<content:encoded><![CDATA[<p>&#8220;Now, can you guess which framework will be more popular — the one going for the easier, cleaner implementation (which, let’s not overstate this, just omits a single “getPrime()” call when the table needs to expand) or the one with the cheap performance fix?&#8221;</p>
<p>I&#8217;d have to say the simpler one wins in real life. The C standard library went with the idea that simpler implementations were better than more complicated ones. Now its one of the longest lasting popular language in existence.</p>
<p>Embedding a big fat prime table (the real cost of getPrime plus a binary search) does not seem like a good idea especially in the embedded world.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: admin</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-28</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Wed, 10 Jun 2009 18:48:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-28</guid>
		<description>thom, I tried to point that out that hash tables are not the same as hash functions.

But Knuth used 2 modulo prime hash table examples, and Adler-32 uses modulo prime even though its simply a hash function. I am willing to bet many other people use mod prime on every hash.

Many other modern algorithms of hash tables and hash functions don&#039;t need mod prime and achieve even better mixing. (see Bob Jenkin&#039;s website for statistics)

I will have to disagree with you that hash tables have no requirement for collisions. Collisions are the bane of hash tables. Every time you need to repeat a lookup, your once O(1) cost becomes O(n).</description>
		<content:encoded><![CDATA[<p>thom, I tried to point that out that hash tables are not the same as hash functions.</p>
<p>But Knuth used 2 modulo prime hash table examples, and Adler-32 uses modulo prime even though its simply a hash function. I am willing to bet many other people use mod prime on every hash.</p>
<p>Many other modern algorithms of hash tables and hash functions don&#8217;t need mod prime and achieve even better mixing. (see Bob Jenkin&#8217;s website for statistics)</p>
<p>I will have to disagree with you that hash tables have no requirement for collisions. Collisions are the bane of hash tables. Every time you need to repeat a lookup, your once O(1) cost becomes O(n).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: thom</title>
		<link>http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth/comment-page-1#comment-27</link>
		<dc:creator>thom</dc:creator>
		<pubDate>Wed, 10 Jun 2009 15:43:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.codexon.com/?p=95#comment-27</guid>
		<description>hashtables and hash functions (of the kind you&#039;re talking about) are two separate, not necessarily connected things.

hashtables receive hash values.  those hash values typically aren&#039;t computed using strong hash functions of the kind you&#039;re referring to, for various reasons.  the hashtable uses the mod p as a slightly safer way to distribute the received hash value among its buckets.  it is slightly safer in the event that the received hash values aren&#039;t well-distributed.  since hash values typically aren&#039;t computed using strong hash functions, this is an important thing to do.

there are various reasons why hash values for a hashtable aren&#039;t computed using strong hash functions.  but mainly it&#039;s just not seen as necessary in typical programming scenarios - the consequence of having collisions are not that bad.  and the performance penalty for computing a strong hash function is fairly high.

strong hash functions are really only necessary for cryptography, where avoiding collisions is necessary for security.  hashtables have no such requirement.</description>
		<content:encoded><![CDATA[<p>hashtables and hash functions (of the kind you&#8217;re talking about) are two separate, not necessarily connected things.</p>
<p>hashtables receive hash values.  those hash values typically aren&#8217;t computed using strong hash functions of the kind you&#8217;re referring to, for various reasons.  the hashtable uses the mod p as a slightly safer way to distribute the received hash value among its buckets.  it is slightly safer in the event that the received hash values aren&#8217;t well-distributed.  since hash values typically aren&#8217;t computed using strong hash functions, this is an important thing to do.</p>
<p>there are various reasons why hash values for a hashtable aren&#8217;t computed using strong hash functions.  but mainly it&#8217;s just not seen as necessary in typical programming scenarios &#8211; the consequence of having collisions are not that bad.  and the performance penalty for computing a strong hash function is fairly high.</p>
<p>strong hash functions are really only necessary for cryptography, where avoiding collisions is necessary for security.  hashtables have no such requirement.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
