Set versus IdentitySet

Bijan Parsia bparsia at email.unc.edu
Thu Jan 24 16:13:30 UTC 2002


On Thu, 24 Jan 2002, Scott A Crosby wrote:

> On Thu, 24 Jan 2002, Bijan Parsia wrote:
> 
> > Thu, 24 Jan 2002, Scott A Crosby wrote:
> >
> >
> > Right, but I think you're hitting problems due to popping actual strings
> > in the found sets. If you're using some sort of key into a large text
> > files or set of text files, it's a bit easier. In this respect, it's
> > becoming *really* worth it to generalize Celeste's DB.
> 
> Correct. Keys are good, but only if they have equivalence.

Yep.

[snip]
> > Absolutely, which is why one should use something else ;) Either a set for
> > the collection or some trick on the key. I've not profiled, but I suspect
> > that the Array solution is going to be undesirable from a memory
> > standpoint. I.e., if you used a fused class/selector symbol (e.g., (aClass
> > name, '>>', aSelector) asSymbol) then you avoid a buncha problems. (in
> 
> That'll *hurt*..  You'll spend about 98% of your time in
> String(Symbol)>>hash or String(Symbol)>>copy:from:to:. You'll also pay a
> *lot* of memory for the symbols.

Ok. I think you're more focused on indexing time. I'm sorta assuming
indexing time is free ;)

> ',' when used to concatenate more than a couple of things is not cheap.

Absolutely.

> The best bet would be a custom class that stored the symbol for the
> selector, the class object, and a hash-slot that caches the hash of both
> of the above.

Yes. I didn't mean that as an implementation strat (though it may be worth
trying it :)).

> You have the engine use Set's to index these.
> 
> Foo>>=
>   ^class == class & (selector == selector)
> 
> Does such a class exist already?

Don't think so.

> If not, I'll write it and include it in my next update. (It'll be public,
> and out sometime later today or tomorrow.)

Great.


[snip]
> > Ok. Do need to watch for needlessly duplicated objects. E.g.,
> > MethodDictionary>>scanFor: has a fair number of distinct words (and
> > it's hardly a *large* method). If you pop a distinct
> > 	Array with: MethodDictionary with: #size
> > into each word's set, you're going to have a fair bit of entirely
> > pointless bloat. The identity distinct arrays are serving no purpose.
> 
> If you're doing:
> 
>    di add: (Array with: MethodDictionary with: #size).
> 
> No bloat. You're using those 12-byte Arrray's as the Documents.. Which are
> shared for each term contained in that document.

Ah, yes, sorry, got confused. However, see my measurements for object refs
vs. smallintegers in my prior message. The savings is real for indexs of
large collections.

> Your adaptor will take
> one of those in, then read the source for that method, then spit out
> terms.

Right.

> Though, You won't want to actually use Arrays, because you'll pay all your
> time in hashing or string comparision.

Hmm?

> What do you mean by identity dictinct arrays?

Arrays that are = but not ==. Brainfart as they don't happen :)

I've not profile the speed difference in Id Sets vs. regular sets for
these cases, so I don't know if it's worth pursuing it in the common
case. However, you can recover mulimegs by using SmallInteger
keys. Celeste will already have a dictionary of msgId -> IndexFileEntry in
memory, so the move there is trivial. (Everything in celeste is already
working with sets of msgIds, mostly.)

Cheers,
Bijan Parsia.




More information about the Squeak-dev mailing list