[squeak-dev] #nextChunk speedup, the future of multibyte streams

Igor Stasenko siguctua at gmail.com
Sat Jan 30 03:17:05 UTC 2010


On 30 January 2010 04:09, Levente Uzonyi <leves at elte.hu> wrote:
> Hi,
>
> I uploaded a new version of the Multilingual package to the Inbox for
> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
> ~3.7 (if the file has UTF-8 encoding).
> The speedup doesn't come free, the code assumes a few things about the file
> it's reading:
> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>  the encoded stream that byte is an encoded ! character
> - the stream doesn't convert line endings (though this is can be worked
>  around if necessary)
>
> Are these assumptions valid? Can we have stricter assumptions? For example,
> can we say that every source file is UTF-8 encoded, just like
> CompressedSourceStreams?
>
> Here is the benchmark which show the speedup:
> (1 to: 3) collect: [ :run |
>        Smalltalk garbageCollect.
>        [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
>
> Current: #(7039 7037 7051)
> New: #(1923 1903 1890)
>
> (Note that further minor speedups are still possible, but I didn't bother
> with them.)
>
>
> While digging through the code of FileStream and subclasses, I found that it
> may be worth implementing MultiByteFileStream and
> MultiByteBinaryOrTextStream in a different way. Instead of subclassing
> existing stream classes (and adding cruft to the whole Stream hierachy) we
> could use a separate class named MultiByteStream which would encapsulate a
> stream (a FileStream or an in-memory stream), the converter, line-end
> conversion, etc. This would let us
> - get rid of the basic* methods of the stream hierarchy (which are broken).
> - remove duplicate code
> - find, deprecate, remove obsolete code
> - achieve better performance
> We may also be able to use two level buffering.
>
> What do you think? Should we do this (even if it will not be 100% backwards
> compatible)?

I am with you. Wrapping or delegation, is what i think a
MultiByteStream should use, i.e.
use an existing stream to read the data from and do own conversion.
It may slow down things a little due to stream chaining, but clean up
a lot of cruft out of implementation.

>
>
> Levente
>


-- 
Best regards,
Igor Stasenko AKA sig.



More information about the Squeak-dev mailing list