Musings on Identity (hashing tricks)
Allen Wirfs-Brock
Allen_Wirfs-Brock at Instantiations.com
Fri Mar 31 16:44:56 UTC 2000
"lazy hash fields" is a way to get a larger range of identify hash values
without significantly increasing the average object size. We have
successfully used all of the techniques described below in various
implementations.
Basic idea: Asume that the identify hash is not actually accessed for most
objects. So, use a extra object slot to hold the hash value but only
allocate this slot when and if the hash value is actually accessed. Exploit
the fact that the garbage collector moves objects to simplify dynamically
adding the hash slot.
Details:
Allocate two bits (the hash tag) in the header with the following
interpretation:
0 hash value has not been accessed, no hash value assigned.
1 hash value has been accessed, hash field not yet allocated
2 hash value has been accessed and the hash field has been added
3 unused, reserved, special, etc.
Most objects are initially allocated with a hash tag of 0. The first time
an attempt is made to access the hash value of such an object, its hash tag
is set to 1 and a hash value is assigned. The value must be assigned in
such a way that it can be regenerated or retrieve on subsequent accesses.
There are at least two possible ways to do this: a) use some function on
the current address of the object as its hash value; b) assign an random
value as the hash value and store that value in a "look-aside" table, keyed
by the object's current address. (using the address as the value is simpler
but may not provide a very good distribution of hash values)
When the hash value is accessed for an object whose hash tag is 1, the
current address of the object is used to either recompute its hash function
or to access the look-aside table, as appropriate to the implementation.
When the garbage collector needs to move an object it most check the value
of the hash tag. If the object to be moved has a hash tag of 1 the GC must
extent the object in a implementation defined way with an additional field
to contain the hash value. Common approaches are to added the field at the
end of the object or as a negative offset from the base of the object.
After allocating the field the GC computes the hash value or retrieves it
from the look-aside table and stores the value in the field. The hash tag
of the object is set to 2. If a look-aside table is used, the object's
table entry is freed at this time.
When the hash value is accessed for an object whose hash tag is 2, the hash
value is simply retrieved from the extended hash field.
Various interesting elaboration are possible. For example, if there are
permanent objects that can not move the extra hash tag value might be used
to say use the object's address as the hash value. (Or, if the
implementation does not use a look-aside table, just use tag 1 to indicate
this.)
It is sometimes useful, to have the ability to explicitly set an object's
identify hash to a specific value. For example, the world is simpler if an
externalized object can be reconstituted with the same identify hash value
it had when it was externalized. Note that it is perfectly safe for a
program to assign arbitrary identify hash values to an object up until the
first time that the object's identify hash is actually accessed. The
look-aside table variation makes its very easy to do this. Simply add a set
identity hash primitive that creates look-aside entries for objects whose
hash tag is 0. (Fail for other tag values)
Happy hashing,
Allen_Wirfs-Brock at instantiations.com
More information about the Squeak-dev
mailing list
|