Set versus IdentitySet

Bijan Parsia bparsia at email.unc.edu
Thu Jan 24 16:01:09 UTC 2002


On Thu, 24 Jan 2002, Bijan Parsia wrote:

> Thu, 24 Jan 2002, Scott A Crosby wrote:
> 
> > On Thu, 24 Jan 2002, Bijan Parsia wrote:

[snip]
> > In theory, (IE, assuming very fast custom hash function), Sets give you
> > the ideal performance for large collections, and nearly ideal for
> > small collections, even without my identityHash patch.
> 
> Ok. Do need to watch for needlessly duplicated objects. E.g.,
> MethodDictionary>>scanFor: has a fair number of distinct words (and
> it's hardly a *large* method). If you pop a distinct 
> 	Array with: MethodDictionary with: #size
> into each word's set, you're going to have a fair bit of entirely
> pointless bloat. The identity distinct arrays are serving no purpose.

FWIW, the overhead of object refs over immediate objects (if I'm
understanding it correct) can be considerable.

I replaced all my refs to IndexFileEntries in my megadbase with msgIDs
(smallints) and refstreamed out. Not only was it *considerably* faster,
but the generated file is smaller, to wit 14.1 megs vs. 18.7 megs.

(For our watchers, please recall that this is with *no* stop word
filtering, header exclusion, or other sensible measures.)

(Is there a way to get the bytesize of an object in memory?)

I'm not clear on RefStream vs. ImageSegments? Saving the image, thus far,
the *clear* super speed winner. It takes seconds whereas the alternatives
I've thus far tried take minutes at the least.

Some other numbers:

Smalltalk garbageCollect; bytesLeft.  301230180
dbase _ nil. "For IndexFileEntry one."
Smalltalk garbageCollect; bytesLeft.  343931768

So 42.7 megs? Is that right?

Smalltalk garbageCollect; bytesLeft 344059424
dbase _ nil. "For msgId one."
Smalltalk garbageCollect; bytesLeft 378220652

So 34.2 megs?

The disk size of the image with both of these is 64.5 megs.

Image without IndexFileEntry one is 43.8 megs.

Image without either is 27.4 megs.

(I'll also note that these are useing strings as keys.)

Niling out the 'from' key (a perfect stop word for email address) saves
325068 bytes (IndexFileEntry) or 115328 bytes (msgID).

Just some rough figures for rough ideas :)

Cheers,
Bijan Parsia.




More information about the Squeak-dev mailing list