[Vm-dev] Making a Hash of things, er, strings

ken.dickey at whidbey.com ken.dickey at whidbey.com
Fri Aug 20 18:26:08 UTC 2021


A brief, ignorable, experience report..

One idea for speeding up lookups is to have a Selector class be a 
subclass
of Symbol and when a Symbol becomes a Selector to have hash value fields
in the Selector.  When a MethodDictionary is rehashed, there is a small
set of hashes which are tried and the most perfect (least collisions)
is used for that particular MethodDictionary.

Because of Spur, we can have several direct subclasses of 
MathodDictionary
differing only in which hash value is accessed.  Just change the ClassID
and the hash access method selects the proper hash value of symbols.
As long as the hash accessor stays polymorphic, the PIC mechanics will
work for us.

So the question becomes "Are there string/symbol hash functions where
this approach makes sense?".  I.e. is there enough difference in hash
functions that we can get close(r) to perfect hashes and have better
results choosing one over another?

As I wait for Lulu to deliver Andres' book, I found some string
hash functions
  [note: 
https://azrael.digipen.edu/~mmead/www/Courses/CS280/HashFunctions-1.html 
]
and took an a first look at this, counting collisions
for
	(hash \\ dictSize) + 1
or more efficiently [note 
https://craftinginterpreters.com/optimization.html]
	(hash bitAnd: (dictSize - 1)) + 1

Early results:

vvv=========vvv
"Count has collisions for all selectors in each of 1527 methodDicts"
"For each Class methodDict, which hash has the fewest collisions?"
HashCandidates globalCollisionsCounts.  "Lower is Better"

#hash collisions = 5466
#identityHash collisions = 6534
#hash0 collisions = 5331
#hash1 collisions = 5301
#hash2 collisions = 5419
#hash3 collisions = 5442
#hash4 collisions = 5436
#hash5 collisions = 14630
#hash6 collisions = 5442
#hash7 collisions = 5436
#hash8 collisions = 18708
#hash9 collisions = 25344
#hash10 collisions = 23518

#(5466 6534 5331 5301 5419 5442 5436 14630 5442 5436 18708 25344 23518)

"Compare hashs for each MethodDict; lowest collisions wins"
HashCandidates globalHashWinCounts. "Higher is Better"

#hash win count = 664
#identityHash win count = 610
#hash0 win count = 672
#hash1 win count = 680
#hash2 win count = 662
#hash3 win count = 679
#hash4 win count = 666
#hash5 win count = 429
#hash6 win count = 679
#hash7 win count = 666
#hash8 win count = 350
#hash9 win count = 247
#hash10 win count = 263

#(664 610 672 680 662 679 666 429 679 666 350 247 263)
^^^=========^^^

One quick observation is that string hash is better than identityHash 
and
hash1 (K&R) is better than current string hash.  Also, the hashes
used so far do not show much advantage against each other in the
better cases.

---

An unrelated idea is to "copy down" just megamorphic methods to 
subclasses to speed
up megamorphic lookups.

Neither of these ideas requires JIT/recompile to implement, just IDE 
changes.

Just playing around..
-KenD

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: String-*stringhashtests.st
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20210820/b714b5a7/attachment.ksh>


More information about the Vm-dev mailing list