New subject: #nextChunk speedup, the future of multibyte streams

30 Jan 2010


      Hi,
I uploaded a new version of the Multilingual package to the Inbox for 
reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of 
~3.7 (if the file has UTF-8 encoding).
The speedup doesn't come free, the code assumes a few things about the 
file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
   the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
   around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For 
example, can we say that every source file is UTF-8 encoded, just like 
CompressedSourceStreams?
Here is the benchmark which show the speedup:
(1 to: 3) collect: [ :run |
    Smalltalk garbageCollect.
    [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051)
New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't 
bother with them.)
While digging through the code of FileStream and subclasses, I found that 
it may be worth implementing MultiByteFileStream and 
MultiByteBinaryOrTextStream in a different way. Instead of subclassing 
existing stream classes (and adding cruft to the whole Stream hierachy) we 
could use a separate class named MultiByteStream which would encapsulate a 
stream (a FileStream or an in-memory stream), the converter, line-end 
conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% 
backwards compatible)?
Levente