Moore's law and why persistence may not be necessary.

Wed Jan 23 14:48:49 UTC 2002

On 23 Jan 2002, Cees de Groot wrote:

> Richard A. O'Keefe <ok at atlas.otago.ac.nz> said:
> >It's twice as much RAM as I have on this 64-bit machine, and you'd be
> >amazed/dismayed how much memory X11 and other things take up already.
> >In fact, I've only got about that much swapspace left.
> >
> The point that Scott was making, I think, is that for, say, a 256Mb
> index you'd start a ~300Mb image and add around the same amoung of swap
> space. Now, apart from the fact that you need a 300Mb image file sitting
> on disk and need to provide 300Mb swap space (100% waste), you now have
> nicely delegated all disk I/O to the Linux kernel, which is bound to be
> much better/faster at it than you are on the Smalltalk level.

Thinking of this, The index, as constructed does have the locality
necessary to swap extremely well. The kernel will do things far faster
than I ever could.

But, fullGC has *no locality* and will just *die*.

But the real gain by that locality could be useful, if you have several
indexes in different images [see below] in the same machine. Then if you
have enough RAM to fit one/two images in their entirety AND the used parts
of several other images, and you don't expect fullGC's to happen in more
than one image concurrently... Then yes, you may be able to successfully
index, say, 4gb of text in 2gb of RAM and 2gb of swap by using 4 images.)

(Shameless plug) Obviously, to make this work, you'll want to minimize
fullGC's by using the VM root-table-overflow patch and my antiGC patch. :)

>
> And Scott doesn't need to write a persistence engine, probably another
> important point :-)

I got a full text index working in 8 hours, rather than a few weeks.  :)

And, does squeak *really* need a persistence engine, other than the image?
Who, unless they're storing multimedia into the image, is going to have an
image that big in practice? Multimedia can be easily dumped into
ImageSegments.

And the memory isn't that bad now that I measure it right. 10mb of method
source into 7mb of index. (indexing all alphabetic words >3 chars). Which
would indicate that one could index 200mb of email/swiki/smalltalk in
<200mb.

200mb is a huge amount of data.

1. 5 years of squeak mailing list. (Assuming the same traffic we have now
     (30mb/9 months))

2. 20 squeak images worth of code, likely enough for every changeset
version ever, and for many (all?) major external smalltalk packages.

3. 5k of swiki pages, each one several screenfulls. (20kb)

I learned an important lesson from a friend. Computers are getting so much
better, than many times old decisions should be rethought out. I see this
lesson all over the place, from the delayed Itanium, to diskbasing a
dbase. Wait too long, and your design decisions are crap. :)

In this case, I realized a year ago that memory is so cheap, that you can
usually just stuff everything there.. and it'll fit!! (Although you can
inflate memory usage by including binary literals like images, movies,
sound, etc.  Those don't count; they have no internal references and thus
are invisible to GC. They can be flushed to disk without GC noticing at
all.)

Its hard to find 200mb worth of human-made data that *requires* being in
memory. These examples, indexing parts of the universe of squeak, are the
first personal example I've seen where I might need even a few hundred
megs.

But, it looks like 500mb is more then enough to index ALL 3 OF THE ABOVE.
All the code squeak has ever had in the image, the mailing list, the
swiki. Thats <$100 of memory. to index everything the squeak universe has
done in the last *5 years*, perhaps *ever*.

Programmer time is always lacking. It will take concrete evidence to
convince me that that effort to diskbase this, or other things is worth
the effort and complexity. $100 in RAM is enough to index the entire
squeak universe. $200 is twice that.

If someone does want a diskbased version of this, reply to this and
explain what their data looks like and why, and I'll give a few tricks to
for writing adaptors that will allow manual or automatic (swap-based)
diskbasing.

If you think I'm in error in this case, I'm happy to be convinced
otherwise.

Scott