Token-based source compression
Stefan Matthias Aust
sma at netsurf.de
Mon Aug 16 00:20:20 UTC 1999
I like the idea to compress source files by using tokenizing the most often
used words as mentioned in this list early on. Therefore, I compiled the
following list of the most used tokens.
The list is sorted for maximal gain, assuming that each token is encoded as
two byte sequence, for example (255 asCharacter) as escape symbol and
another character as encoding. I don't like the idea to restrict tokens to
7 bit ascii.
SqueakV2.sources (5602465 bytes)
token occurences size gain
methodsFor: 13421 120789
self 23578 47156
ifTrue: 8203 41015
stamp: 7192 28768
ifFalse: 4196 25176
the 16931 16931
instanceVariableNames: 833 16660
receiver 2159 12954
class 4225 12675
nextPutAll: 1083 9747
classVariableNames: 535 9095
accessing 1243 8701
DynamicInterpreter 518 8288
aStream 1612 8060
poolDictionaries: 535 8025
successFlag 791 7119
Answer 1692 6768
instance 1081 6486
selector 1026 6156
Smalltalk 838 5866
That is, the gain for entries 1..20 is 406435
1..32: 468155, 1..64: 586451, 1..96: 671421
That is, you can save about 0.6 MB of 5.6 MB or 10%.
Not as much as I'd hoped. How disappointing.
For Squeak2.4c.changes (5631817 bytes) you get a very similar picture.
InterpreterProxy has 1549 occurences but otherwise you get nearly the same
top 20. Here, the gain is: 1..20: 371375, 1..32: 424620, 1..64: 527356,
1..96: 603985, 1..255: 844875
bye
--
Stefan Matthias Aust // Bevor wir fallen, fallen wir lieber auf.
More information about the Squeak-dev
mailing list
|