[BUG][Performance] Symbol>>hash too slow [was: [BUG][FIX][Slowdown] Add preference too slow (Symbol>>hash too slow)]

Andrew P. Black black at cse.ogi.edu
Tue Feb 20 07:27:55 UTC 2001


Here is Dan's original comment:

At 15:10 -0700 2000.09.18, Dan Ingalls wrote:
>
>2630-BadHashFix-SqR -- Andres Valloud & Luciano Notarfrancesco -- 3 
>August 2000
>Removes Symbol>>hash.
>This is done since now symbols may be equal to strings and 
>viceversa. When they were equal, though, their hashes were not and 
>that's not good.
>Fixing this brought up some issues with the Interpreter. When the C 
>code is generated, some methods are inlined. Such inlined methods 
>have local variable names that mask global variable names. There is 
>a check for this (CCodeGenerator>>prepareMethods). The methods 
>dictionary had symbol keys, but it was queried for strings. Since 
>the hashes for equal variable names were different, those 
>overlapping names were never reported as errors by the 
>CCodeGenerator. We changed the names of the local variables by 
>prefixing them with 'local'.
>Also reimplements String>>hash. The previous implementations could 
>provide only 14 bits of hash, and this was just at its best. It did 
>not spread hash values enough either, so just about 12 bits were 
>being used by all the Strings in the system. The new implementation 
>of hash for Strings provides an excellent hash mentioned by Mark van 
>Gulik, and it is now used in more places through the Collection 
>hierarchy (check ByteArray>>hash and SequenceableCollection>>hash).
>Also reimplements Fraction>>hash so that it does not answer 
>LargeIntegers. A small method called hashMultiply was added to 
>SmallInteger to keep things clean. A convenient method was also 
>added to Set class to rehash all sets except MethodDictionaries 
>(takes too long and they are not affected by #hash).
>Finally, do not edit or fileOut this changeset! It contains hand 
>made postscripts and initialization code that will rehash all Sets 
>and Dictionaries in your system.
>A quick check to see if this changeset did its work properly is to 
>do the following:
>	* Try to get the Set class by typing 'Set' in a workspace and 
>then doing a print-it
>	* Open a workspace
>If that works, you are in good shape.
>SqR & len, 8/4/2000 14:41"

I half remember the original discussion too, but I don't recall the 
rationale that said _why_ it is ever reasonable to expect a String 
and a Symbol to ever be equal.   Could someone enlighten me?

In the particular example mentioned in Dan's comment, 
CCodeGenerator>>prepareMethods, the issue seems to be checking 
whether a string matches any symbol in the set of globals or method 
names.  Clearly, interning the string (which involves hashing in to 
the SymbolTable) is a waste of time, since if the string in question 
is not in the SymbolTable, there is no point in adding it -- it can't 
be in the globals or methods names sets either.

<Alert mode: 'Fools rush in where Angels fear to tread ...'>

So, it seems to me that the existing code, which computes the hash of 
a string (by looking at all the characters) in order to compare it 
against a set of symbols, should be used in exactly one place: 
String>>asSymbol.

Having to look at every character in a Symbol to compute its hash 
sounds like a bad idea.  In my mind, the whole reason for the 
existence of Symbols as a separate class is to make comparisons 
between them fast (by permitting us to use Symbol identity as a fast 
implementation of equality), which would lead to the very simple code

Symbol>>= another
	^ self == another

Symbol>>hash
	^self identityHash

This ought to maintain the required relationship between = and hash. 
The fact that Doug Way had to work around the performance of 
Symbol>>hash says that we are making the wrong tradeoff

<Alert endMode>

OK, what am I missing?

	Andrew






More information about the Squeak-dev mailing list