Cheap updates
Richard A. O'Keefe
ok at atlas.otago.ac.nz
Wed Jun 7 02:27:52 UTC 2000
I wrote:
> My suggestion was to prime the table of a dictionary-based compressor
*****
> with the class and instance variable names (I didn't say "as
> one long string" but that's what I meant) and let it compress
> the text as a string, in which case "rate" and "rate:" would
> _both_ benefit, and so would "primeRate" (becase "ate" would
> be found). It seems likely that this would do rather
> better than compressWithTable: followed by gzipping.
Andreas Raab completely misunderstood.
"This goes with seeding"
gz _ GZipWriteStream on: String new.
(BitBltSimulation allInstVarNames) do:[:iv| gz nextPutAll: iv].
Note that this ***physically includes*** the class and instance
variables names in the encoded stream. That's not "priming".
Indeed, my message explicitly said that my suggestion was to *start*
the compressor with a non-empty dictionary.
The point is that once the sender and receiver *know* which class is
in question, they also *know* what the class and instance variable
names are, so *they don't have to be transmitted*. Both the sender
and the receiver can start with non-empty dictionaries.
For a comparison that shows what I did mean:
gz _ nil.
n _ 0.
BitBltSimulation selectorsDo: [:sel|
gz _ GZipWriteStream on: String new.
gz nextPutAll: (BitBltSimulation sourceCodeAt: sel).
gz close.
n _ n + gz encodedStream size.
].
"Methods compressed separately WITHOUT priming"
n
=> 42462 (unpatched Squeak 2.7 on a Mac)
d _ BitBltSimulation allInstVarNames
addAll: (BitBltSimulation allClassVarNames).
gz _ GZipWriteStream on: String new.
d do: [:name | gz nextPutAll: name].
gz close.
m _ gz encodedStream size. "411"
n _ 0.
BitBltSimulation selectorsDo: [:sel|
gz _ GZipWriteStream on: String new.
d do: [:name | gz nextPutAll: name]. " simulate priming"
gz nextPutAll: (BitBltSimulation sourceCodeAt: sel).
gz close.
n _ n - m + gz encodedStream size.
].
"Methods compressed separately WITH (simulated) priming"
n
=> 37645
So we have
99993 Method texts with no compression
42462 Method texts compressed separately WITHOUT priming (42%)
37645 Method texts compressed separately WITH priming (38%)
This is _simulated_ priming because the point of priming is that you
_don't_ send the dictionary, which is why I subtract m off.
It is quite clear that compressing all the methods together does better
still, as one would expect. That's appropriate for FileOuts. But for
ChangeSets, where only a few methods might be affected, priming definitely
_does_ pay. Priming could probably be taken a step or two further: it
might pay to include the superclass method names as well.
Would priming help FileOuts? If we compress all the BitBltSimulation
methods together, we get this:
99993 Method texts
24418 Method texts compressed together WITHOUT priming (24%)
24312 Method texts compressed together WITH priming (24%)
When you've got a _lot_ of text (100kb) to compress, priming helps,
but helps very little. So (simple-minded) priming wouldn't help
FileOuts much.
The conclusion
Andreas Raab wrote:
One thing that can help for really huge amounts of data is to use a
predefined fixed dictionary.
As we've seen, it's precisely with large amounts of data that priming
pays off less.
Stuff like bzip...
but again, bzip only really pays off on _large_ data sets.
It's fairly clear that some sort of language-sensitive compression
could do better than gzipping raw text. If you have
x _ a foo: b bar: c.
y _ x foo: c bar: b.
you have a repetition of foo:bar: which gzip won't see, because the
material in between is different.
_x foo:bar: a b c _y foo:bar: x c b.
(prefix Polish) would in that case compress better.
More information about the Squeak-dev
mailing list
|