Cheap updates

Wed Jun 7 02:27:52 UTC 2000

I wrote:
	> My suggestion was to prime the table of a dictionary-based compressor
			       *****
	> with the class and instance variable names (I didn't say "as 
	> one long string" but that's what I meant) and let it compress
	> the text as a string, in which case "rate" and "rate:" would 
	> _both_ benefit, and so would "primeRate" (becase "ate" would 
	> be found).  It seems likely that this would do rather
	> better than compressWithTable: followed by gzipping.

Andreas Raab completely misunderstood.
	"This goes with seeding"
	gz _ GZipWriteStream on: String new.
	(BitBltSimulation allInstVarNames) do:[:iv| gz nextPutAll: iv].

Note that this ***physically includes*** the class and instance
variables names in the encoded stream.  That's not "priming".
Indeed, my message explicitly said that my suggestion was to *start*
the compressor with a non-empty dictionary.

The point is that once the sender and receiver *know* which class is
in question, they also *know* what the class and instance variable
names are, so *they don't have to be transmitted*.  Both the sender
and the receiver can start with non-empty dictionaries.

For a comparison that shows what I did mean:

    gz _ nil.
    n _ 0.
    BitBltSimulation selectorsDo: [:sel|
        gz _ GZipWriteStream on: String new.
        gz nextPutAll: (BitBltSimulation sourceCodeAt: sel).
        gz close.
        n _ n + gz encodedStream size.
    ].
    "Methods compressed separately WITHOUT priming"
    n
=>  42462 (unpatched Squeak 2.7 on a Mac)

    d _ BitBltSimulation allInstVarNames
        addAll: (BitBltSimulation allClassVarNames).
    gz _ GZipWriteStream on: String new.
    d do: [:name | gz nextPutAll: name].
    gz close.
    m _ gz encodedStream size.	"411"
    n _ 0.
    BitBltSimulation selectorsDo: [:sel|
        gz _ GZipWriteStream on: String new.
	d do: [:name | gz nextPutAll: name]. " simulate priming"
        gz nextPutAll: (BitBltSimulation sourceCodeAt: sel).
        gz close.
        n _ n - m + gz encodedStream size.
    ].
    "Methods compressed separately WITH (simulated) priming"
    n
=>  37645

So we have
    99993	Method texts with no compression
    42462	Method texts compressed separately WITHOUT priming (42%)
    37645	Method texts compressed separately WITH priming (38%)

This is _simulated_ priming because the point of priming is that you
_don't_ send the dictionary, which is why I subtract m off.

It is quite clear that compressing all the methods together does better
still, as one would expect.  That's appropriate for FileOuts.  But for
ChangeSets, where only a few methods might be affected, priming definitely
_does_ pay.  Priming could probably be taken a step or two further:  it
might pay to include the superclass method names as well.

Would priming help FileOuts?  If we compress all the BitBltSimulation
methods together, we get this:

    99993	Method texts
    24418	Method texts compressed together WITHOUT priming (24%)
    24312	Method texts compressed together WITH priming (24%)

When you've got a _lot_ of text (100kb) to compress, priming helps,
but helps very little.  So (simple-minded) priming wouldn't help
FileOuts much.

The conclusion

Andreas Raab wrote:
	One thing that can help for really huge amounts of data is to use a
	predefined fixed dictionary.

As we've seen, it's precisely with large amounts of data that priming
pays off less.

	Stuff like bzip...

but again, bzip only really pays off on _large_ data sets.

It's fairly clear that some sort of language-sensitive compression
could do better than gzipping raw text.  If you have
	x _ a foo: b bar: c.
	y _ x foo: c bar: b.
you have a repetition of foo:bar: which gzip won't see, because the
material in between is different.
	_x foo:bar: a b c _y foo:bar: x c b.
(prefix Polish) would in that case compress better.