[ENH] LargeArray, ExtendedAttributeDictionary, TupleAttributeDictionary

Tue Aug 13 18:35:00 UTC 2002

Change Set:		EAD
Date:			13 August 2002
Author:		Stephen Pair
Squeak Version:	3.2

Required: EAD-preload.cs (should be filed in before this change set)

This change set adds support for a very large weakly keyed identity
dictionary called ExtendedAttributeDictionary.  It also adds the classes
LargeArray and TupleAttributeDictionary.  LargeArray can also be used as
the basis for other large collections.  This may be of interest to
developers of persistency or distributed object solutions as each chunk
of a large array can be independently swapped.  Since most other
collections are based on Array, substituting a LargeArray should be a
simple matter.  See the class comments for more details on these three
classes.

The hash function for large (>4096) identity sets (IdentitySet,
IdentityDictionary, WeakIdentityKeyDictionary) has been changed to be
slightly more efficient (this change set will rehash any large instances
of those classes after filein).  The old hash function would spread the
identityHash by multiplying it by (array size // 4096).  This hash
function becomes less efficient as (array size \\ 4096) approaches 4096.
Instead, we now send #hashMultiply to the identityHash if the array size
is >4096.  This seems to spread a SmallInteger more evenly over 30 bits
for typical data sets.  A test that adds the first 100,000 objects in
memory as keys in a IdentityDictionary showed about a 9% improvement
when using #hashMultiply.

This change set also refactors two things in the collections hierarchy
to support LargeArray:

1) Many collections and streams need to rebuild their underlying arrays
when they need to grow to make more room.  To accomplish this, they have
historically used code like 'array species new: newSize'.  This change
set refactors such cases to use one of 4 new methods added to Object,
#similarInstance, #similarInstance:, #similarSpeciesInstance, and
#similarSpeciesInstance:.  By making this change, we offer the old
collection a chance to initialize the new collection based on its state,
which is critical for LargeArray, which needs to initialize the base
array class and the chunk size.

2) The implementations of #replaceFrom:to:with:startingAt: have been
refactored to allow the replacement collection a chance to override the
default replacement behavior.  Most implementations of
#replaceFrom:to:with:startingAt: have been renamed to
#basicReplaceFrom:to:with:startingAt: and the abstract implementation of
#replaceFrom:to:with:startingAt: now calls a new method called
#startingAt:replace:from:to:, which inverts the receiver and the
replacement collection.  This enables both collections to override the
default behavior when necessary in order to get the maximum use out of
the replacement primitive.  This is a key refactoring for LargeArray,
which would otherwise have to fall back on much slower iterative methods
for handling replacement (and in fact, under certain circumstances, the
behavior would be completely incorrect).

These two refactorings should have general appeal because they make the
system much more receptive of new collection classes.

----- LargeArray class comments -----
I am an array that is segmented into smaller arrays (called chunks). My
chunks can in general be any indexable object.  My indexble slots
actually contain the chunks, which can be accessed with basicAt: and
basicAt:put:.  The methods at: and at:put: access my elements.

Here are a few examples:

	To create a large array for storing ordinary objects:
	LargeArray new: 1000 chunkSize: 10 arrayClass: Array

	To create a large byte array:
	LargeArray new: 1000 chunkSize: 10 arrayClass: ByteArray

	To create a large string:
	LargeArray new: 1000 chunkSize: 10 arrayClass: String

	To create an empty large array:
	LargeArray new

	To create a WriteStream on a LargeArray of Characters:
	WriteStream on: (LargeArray new: 0 chunkSize: 100 arrayClass:
String)

----- ExtendedAttributeDictionary class comments -----
I am an IdentityDictionary whose keys are held weakly.  I am designed
for use when you need to potentially associate some attribute with any
object in the system.  I am like a WeakIdentityKeyDictionary with the
following differences:

- my hash function incorporates the object and the class' identity hash
(therefore, objects that are subject to #become: operations should not
be added as key's...or else you need to rehash after a become if the two
object's classes are different) 
- I do not provide for finalization and rehash will eliminate all nil
keys
- removeKey: performs much better than it does in
WeakIdentityKeyDictionary (it does not require a full rehash of the
dictionary)
- I aggressively nil out values of associations whose key is invalid
(i.e. GC collected the key).  During any scan, or when removing a key, I
will check for such associations and release the value
- #findElementOrNil: might answer a slot that contains an association
with an invalid key if the element was not found.  Users of this method
need to check for that.
- If I grow large, I use a LargeArray to store my associations.  This
improves the performance of incremental GC, which degrades when there
are very large objects in young space or when there are very large
objects with references to objects in young space
- My hash is spread over 30 bits (and could thus map every single object
in memory)...the hash function is about as efficient as possible given
the limited number of identity hash bits

Use this class when you expect your dictionary to grow very large, when
you don't need to finalize the weakly referenced objects, and/or when
you need better key removal performance.  You may want to periodically
#rehash instances to remove associations with nil keys.  Alternatively,
you can #clearDeadValues if you do not have time to rehash the whole
dictionary.

----- TupleAttributeDictionary -----
I am like my parent in that I can grow efficiently.  I am different in
that my keys are tuples (one or more objects).  My keys are held weakly
(if any item in a tuple is collected, then that association is no longer
valid and lookups on that tuple will fail).  Lookup is based on the
identity of each item in the tuple.

Examples:

	dict at: #(10 11) put: (Point x: 10 y: 11).
	dict at: {vendor. product} put: price.

I can be more efficient that performing multiple dictionary lookups to
find data associated with more than one object. 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: EAD-preload.1.cs
Type: application/octet-stream
Size: 3111 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20020813/09798e46/EAD-preload.1.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EAD.cs
Type: application/octet-stream
Size: 81909 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20020813/09798e46/EAD.obj