Re: Musings on Identity - Squeak-dev

31 Mar 2000


      ...
<fontfamily><param>Monaco</param>Some ideas for Squeak 3
...
:-)</fontfamily> <fontfamily><param>Monaco</param>
...
Currently the object header contains a 12 bit field for an identity
...
hash. This is too little and too much at the same time: it takes
...
valuable space in the object header and it is insufficient for large
...
IdentitySets or -Dictionaries.
...
...
Therefore how about ditching this field alltogether and
...
reimplementing IdentitySets as arrays sorted by oop ?
...
...
Lookup should be reasonably fast using binary search.
...
Since the garbage collector only shifts objects but does not shuffle
...
them, the sort order would not be affected by compaction.
...
Become operations would necessitate resorting of affected sets
...
however.
...
IdentityDictionaries could be built using parallel arrays for keys
...
and values.
...
Insertion and deletion might become slow for very large sets. If this
...
really becomes an issue those sets could be split into a sorted array
...
of subsets with a two stage lookup.
...
...
Has this been done before ?
...
Any flaws in the resoning ?</fontfamily>
Georg -
People have ditched the hash field, and then constituted it lazily on
demand for such objects that need an identityHash, either in an
alternate header format, or in a look-aside cache.  Both of these
schemes add complexity, but they also do the job right.
Right now in Squeak, when hashing fails to discriminate, we use linear
search, which leads to an access time proportional to N//4096.  So
binary search would be an improvement for tables larger than about 64K
entries, and it would be about 10 times better for sets with a million
entries.
I had essentially the same idea as you are proposing a couple of years
ago.  At that time I needed an identityDictionary of potentially all
objects in the system for the SystemTracer.  Fortunately, the need is
usually only for a relatively small portion of the objects (pointer
objects for which not all the internal references have yet been written
out).  You can see that I bailed out and hacked a linear secondary
probe with usage-based promotion for this requirement, instead of
discarding indentity hashes as you propose.
When I first thought of doing what you propose, I thought it would be
too ugly because it would require exposing oops, and all sorts of
things could go wrong because oops change over time.  For instance,
what if you hit a gc partway through comparing the oops of two
elements?  You would have to make sure you asked for their oops at "the
same time";  from then on things should be OK because garbage
collection preserves the order of objects in the system.
This last point led me to the realization that you could do adopt this
approach AND do it all safely in Squeak IF the primitive capability
were packaged in the form of a comparison primitive
<<ProtoObject> identityPrecedes: otherObject
<<primitive: NNN>
self primitiveFailed
You are right that such IdentitySets would be invalidated by become:,
but this is true now for forwardBecome which can change the hash
associated with a reference (we made symmetric become preserve the hash
associated with references).
Two more things to consider...
* What other use do you have for the hash bits in the header?
* Would you care as much if we did away with the compact class field
and extended the number of bits in the identityhash from 12 to 17? 
This would move the crossover point above from 64K to 2 million
entries.
One proposal would be to at least implement the comparison primitive
above.  Then anyone who really cared could build their own
GiantIdentitySet with log N performance.  One could even get log
(N//4096) by using a two-level structure with hashing in the first
level and sorting in the second.
- Dan