Musings on Identity (hashing tricks)

Allen Wirfs-Brock Allen_Wirfs-Brock at Instantiations.com
Fri Mar 31 16:44:56 UTC 2000


"lazy hash fields" is a way to get a larger range of identify hash values 
without significantly increasing the average object size. We have 
successfully used all of the techniques described below in various 
implementations.

Basic idea: Asume that the identify hash is not actually accessed for most 
objects. So, use a extra object slot to hold the hash value but only 
allocate this slot when and if the hash value is actually accessed. Exploit 
the fact that the garbage collector moves objects to simplify dynamically 
adding the hash slot.

Details:
Allocate two bits (the hash tag) in the header with the following 
interpretation:
    0  hash value has not been accessed, no hash value assigned.
    1  hash value has been accessed, hash field not yet allocated
    2  hash value has been accessed and the hash field has been added
    3  unused, reserved, special, etc.

Most objects are initially allocated with a hash tag of 0. The first time 
an attempt is made to access the hash value of such an object, its hash tag 
is set to 1 and a hash value is assigned. The value must be assigned in 
such a way that it can be regenerated or retrieve on subsequent accesses. 
There are at least two possible ways to do this: a) use some function on 
the current address of the object as its hash value; b) assign an random 
value as the hash value and store that value in a "look-aside" table, keyed 
by the object's current address. (using the address as the value is simpler 
but may not provide a very good distribution of hash values)

When the hash value is accessed for an object whose hash tag is 1, the 
current address of the object is used to either recompute its hash function 
or to access the look-aside table, as appropriate to the implementation.

When the garbage collector needs to move an object it most check the value 
of the hash tag. If the object to be moved has a hash tag of 1 the GC must 
extent the object in a implementation defined way with an additional field 
to contain the hash value. Common approaches are to added the field at the 
end of the object or as a negative offset from the base of the object. 
After allocating the field the GC computes the hash value or retrieves it 
from the look-aside table and stores the value in the field. The hash tag 
of the object is set to 2. If a look-aside table is used, the object's 
table entry is freed at this time.

When the hash value is accessed for an object whose hash tag is 2, the hash 
value is simply retrieved from the extended hash field.

Various interesting elaboration are possible.  For example, if there are 
permanent objects that can not move the extra hash tag value might be used 
to say use the object's address as the hash value. (Or, if the 
implementation does not use a look-aside table, just use tag 1 to indicate 
this.)

It is sometimes useful, to have the ability to explicitly set an object's 
identify hash to a specific value. For example, the world is simpler if an 
externalized object can be reconstituted with the same identify hash value 
it had when it was externalized. Note that it is perfectly safe for a 
program to assign arbitrary identify hash values to an object up until the 
first time that the object's identify hash is actually accessed. The 
look-aside table variation makes its very easy to do this. Simply add a set 
identity hash primitive that creates look-aside entries for objects whose 
hash tag is 0. (Fail for other tag values)

Happy hashing,
Allen_Wirfs-Brock at instantiations.com





More information about the Squeak-dev mailing list