Token-based source compression

Mon Aug 16 00:20:20 UTC 1999

I like the idea to compress source files by using tokenizing the most often
used words as mentioned in this list early on.  Therefore, I compiled the
following list of the most used tokens.

The list is sorted for maximal gain, assuming that each token is encoded as
two byte sequence, for example (255 asCharacter) as escape symbol and
another character as encoding.  I don't like the idea to restrict tokens to
7 bit ascii.

SqueakV2.sources (5602465 bytes)

token             occurences   size gain
methodsFor:            13421   120789
self                   23578    47156
ifTrue:                 8203    41015
stamp:                  7192    28768
ifFalse:                4196    25176
the                    16931    16931
instanceVariableNames:   833    16660
receiver                2159    12954
class                   4225    12675
nextPutAll:             1083     9747
classVariableNames:      535     9095
accessing               1243     8701
DynamicInterpreter       518     8288
aStream                 1612     8060
poolDictionaries:        535     8025
successFlag              791     7119
Answer                  1692     6768
instance                1081     6486
selector                1026     6156
Smalltalk                838     5866

That is, the gain for entries 1..20 is 406435

1..32:  468155, 1..64:  586451, 1..96:  671421

That is, you can save about  0.6 MB of 5.6 MB or 10%.

Not as much as I'd hoped.  How disappointing.

For Squeak2.4c.changes (5631817 bytes) you get a very similar picture.
InterpreterProxy has 1549 occurences but otherwise you get nearly the same
top 20.  Here, the gain is:  1..20: 371375, 1..32: 424620, 1..64: 527356,
1..96: 603985, 1..255: 844875

bye
--
Stefan Matthias Aust  //  Bevor wir fallen, fallen wir lieber auf.