Moore's law and why persistence may not be necessary.

Thu Jan 24 03:44:35 UTC 2002

On Wed, 23 Jan 2002, Scott A Crosby wrote:

> On Wed, 23 Jan 2002, Tim Rowledge wrote:
> 
> > How fast does the indexing run? 

Not especially. It's not unreasonable.

> >Fast enough to make it sensible to drop
> > the indices when saving an image and regenerating on a
> > lazy-initialisation basis?
> 
> Maybe.. 

Much depends on the size of the material indexed, etc. etc. My *guess* is
that ReferenceStreaming, giving the right objects, is going to be faster
than regeration...but that's just a guess. I've not gotten profile cases
together yet to prove this.

> With the stock VM[*], and using 10mb of method text, the indexing
> rate is about 25kb of text/second, with the index about 70% of the size of
> the text.

I suspect that this is with relatively unsophisticated dbs. I.e., the
default index stores all the indexed strings in memory
(...hmm. repeatedly? I'm not in front of the machine I was working with
this).

(Talking memory here.)

> According to resuls from Bijarn, 

Bijarn? That's a new one :)

> he indexes 90mb worth of squeak email in
> about 25mb, and thats with full text.

Less that than, I'd say. I also really suspect that using IndexFileEntries
rather than MsgIDs for the keys to the indexed document is bulking things
up a lot. Better tests, and I hope a Celeste you can try, tomorrow.

> One of the problems we're facing is we're building collections far bigger
> than normal. 

Where "normal" doesn't include celeste :)

This is good though!

> Because few collections in squeak are even a hundred
> elements, this code is exposing all kinds of latent performance issues in
> Set, IdentitySet, Dictionary, #hash, and how they all interact.

Yes.

> Thus, performance can vary all over the map from what it should be.

I'll take Scott's word on this. I'm *very* sanguine about it all. It's
looking good.

[snip]
> > Is there a filter/adaptor to exclude a (longish) list of 'common' words
> > yet? It seems to me that an index of emails could sensibly exclude many
> 
> My cheezy simple (test) filter has this for testing purposes, but someone
> should rewrite a new adaptor that does this cleanly.

Which I dumped for the email. And in the email I *am* excluding
attachments, but not excluding (some) headers. I.e, I'm actually indexing
MailMessage>>formattedText. One tuning is to exclude all headers and index
them separately (as much for convenient of searching just headers/specific
headers as anything else...although keys like From will point to *large*
sets of documents (19000!), that takes up space and gets slow in the
indexing I suspect, for the reasons scott alludes to above; note that
Scott feel confident that one can improve large collection handling; even
if not, standard techniques can do a lot).

> Patience; the code has only existed for a day. :)

And nice code it is. I really recommend trying it out. It's simple, *dead*
easy to use. Scott, wanna do up a SqueakNews article? I'll help!

> > words (d00d, l337, h4x0r among them :-) )and it might be useful to have
> > a separate index for header content. The latter might be small enough to
> > be useful even on ordinary machines with memory in the 64mb range.

Er...I think we've mentioned these :) All on the list.

For small mail collections, the existing code is Ok. Much better than I
believed. But I will integrate it allong these lines with drop backs for
when you can't afford the memory.

> I wrote the engine which is now basically done.

And very nice it is. Have I mentioned how cool it is? It's really
nice. THe large collection fixes are *essential*, IMHO.

> It just needs testing. 

And test cases, I'll definitely help with these.

> I
> lack the understanding of the system to actually integrate it into every
> part of the system that should use it.

I'm going to work on the parts I know :)

> I believe Bijarn 

Again? What's with the "Bijarn"? :)

> is working on an adaptor that only indexes mail headers.

Yes. Going to do full Celeste integration. #participateHas: etc. will get
DocumentIndexs of their own. Ooo, having a set of categories that are
email address/names is trivial. (I.e., replace the category list with a
list of email addresses, click on one to see that person's mail...a lead
pipe sinch!) Just need largelists to fix the display speed and celeste, as
they say, will SUPER ROCK.

Three apps in my mind, good examples and testers:
	1) Celeste. If it can handle 19,000 messages with 90megs of text
	   it can handle an awful lot.
	2) Searching method text. No more "it'll take a few minutes"!!
	   This is a medium sized problem.
	3) Swikis. Lots of swikis are very small, but still! Enables
	   lots of things, IMHO.

There's enough variety in there to work out a good general interface and a
flexible useful engine. And they are, IMHO, absolutely compelling.

So, this is *my* first SqFoundation project :) I'll happily coordinate the
app side of things, if other folks want to get in. Scott can do engine
tweaks ;)

[snip]
> [*] My VM with my method cache and using ITIMER is over twice as fast at
> indexing. ITIMER could probalby be replaced with Raab's fix to
> primitiveResponse.

Er...get I get a copy? 

Cheers,
Bijan Parsia.