<fontfamily><param>Monaco</param>Some ideas for Squeak 3
:-)</fontfamily> <fontfamily><param>Monaco</param>
Currently the object header contains a 12 bit field for an identity
hash. This is too little and too much at the same time: it takes
valuable space in the object header and it is insufficient for large
IdentitySets or -Dictionaries.
Therefore how about ditching this field alltogether and
reimplementing IdentitySets as arrays sorted by oop ?
Lookup should be reasonably fast using binary search.
Since the garbage collector only shifts objects but does not shuffle
them, the sort order would not be affected by compaction.
Become operations would necessitate resorting of affected sets
however.
IdentityDictionaries could be built using parallel arrays for keys
and values.
Insertion and deletion might become slow for very large sets. If this
really becomes an issue those sets could be split into a sorted array
of subsets with a two stage lookup.
Has this been done before ?
Any flaws in the resoning ?</fontfamily>
Georg -
People have ditched the hash field, and then constituted it lazily on demand for such objects that need an identityHash, either in an alternate header format, or in a look-aside cache. Both of these schemes add complexity, but they also do the job right.
Right now in Squeak, when hashing fails to discriminate, we use linear search, which leads to an access time proportional to N//4096. So binary search would be an improvement for tables larger than about 64K entries, and it would be about 10 times better for sets with a million entries.
I had essentially the same idea as you are proposing a couple of years ago. At that time I needed an identityDictionary of potentially all objects in the system for the SystemTracer. Fortunately, the need is usually only for a relatively small portion of the objects (pointer objects for which not all the internal references have yet been written out). You can see that I bailed out and hacked a linear secondary probe with usage-based promotion for this requirement, instead of discarding indentity hashes as you propose.
When I first thought of doing what you propose, I thought it would be too ugly because it would require exposing oops, and all sorts of things could go wrong because oops change over time. For instance, what if you hit a gc partway through comparing the oops of two elements? You would have to make sure you asked for their oops at "the same time"; from then on things should be OK because garbage collection preserves the order of objects in the system.
This last point led me to the realization that you could do adopt this approach AND do it all safely in Squeak IF the primitive capability were packaged in the form of a comparison primitive
<<ProtoObject> identityPrecedes: otherObject
<<primitive: NNN>
self primitiveFailed
You are right that such IdentitySets would be invalidated by become:, but this is true now for forwardBecome which can change the hash associated with a reference (we made symmetric become preserve the hash associated with references).
Two more things to consider...
* What other use do you have for the hash bits in the header?
* Would you care as much if we did away with the compact class field and extended the number of bits in the identityhash from 12 to 17? This would move the crossover point above from 64K to 2 million entries.
One proposal would be to at least implement the comparison primitive above. Then anyone who really cared could build their own GiantIdentitySet with log N performance. One could even get log (N//4096) by using a two-level structure with hashing in the first level and sorting in the second.
- Dan
squeak-dev@lists.squeakfoundation.org