[squeak-dev] #nextChunk speedup, the future of multibyte streams
Levente Uzonyi
leves at elte.hu
Sat Jan 30 02:09:42 UTC 2010
Hi,
I uploaded a new version of the Multilingual package to the Inbox for
reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
~3.7 (if the file has UTF-8 encoding).
The speedup doesn't come free, the code assumes a few things about the
file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For
example, can we say that every source file is UTF-8 encoded, just like
CompressedSourceStreams?
Here is the benchmark which show the speedup:
(1 to: 3) collect: [ :run |
Smalltalk garbageCollect.
[ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051)
New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't
bother with them.)
While digging through the code of FileStream and subclasses, I found that
it may be worth implementing MultiByteFileStream and
MultiByteBinaryOrTextStream in a different way. Instead of subclassing
existing stream classes (and adding cruft to the whole Stream hierachy) we
could use a separate class named MultiByteStream which would encapsulate a
stream (a FileStream or an in-memory stream), the converter, line-end
conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100%
backwards compatible)?
Levente
More information about the Squeak-dev
mailing list
|