[squeak-dev] #nextChunk speedup, the future of multibyte streams

Sat Jan 30 03:25:58 UTC 2010

On Sat, 30 Jan 2010, Igor Stasenko wrote:

> On 30 January 2010 04:09, Levente Uzonyi <leves at elte.hu> wrote:
>> Hi,
>>
>> I uploaded a new version of the Multilingual package to the Inbox for
>> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
>> ~3.7 (if the file has UTF-8 encoding).
>> The speedup doesn't come free, the code assumes a few things about the file
>> it's reading:
>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>>  the encoded stream that byte is an encoded ! character
>> - the stream doesn't convert line endings (though this is can be worked
>>  around if necessary)
>>
>> Are these assumptions valid? Can we have stricter assumptions? For example,
>> can we say that every source file is UTF-8 encoded, just like
>> CompressedSourceStreams?
>>
>> Here is the benchmark which show the speedup:
>> (1 to: 3) collect: [ :run |
>>        Smalltalk garbageCollect.
>>        [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
>>
>> Current: #(7039 7037 7051)
>> New: #(1923 1903 1890)
>>
>> (Note that further minor speedups are still possible, but I didn't bother
>> with them.)
>>
>>
>> While digging through the code of FileStream and subclasses, I found that it
>> may be worth implementing MultiByteFileStream and
>> MultiByteBinaryOrTextStream in a different way. Instead of subclassing
>> existing stream classes (and adding cruft to the whole Stream hierachy) we
>> could use a separate class named MultiByteStream which would encapsulate a
>> stream (a FileStream or an in-memory stream), the converter, line-end
>> conversion, etc. This would let us
>> - get rid of the basic* methods of the stream hierarchy (which are broken).
>> - remove duplicate code
>> - find, deprecate, remove obsolete code
>> - achieve better performance
>> We may also be able to use two level buffering.
>>
>> What do you think? Should we do this (even if it will not be 100% backwards
>> compatible)?
>
> I am with you. Wrapping or delegation, is what i think a
> MultiByteStream should use, i.e.
> use an existing stream to read the data from and do own conversion.
> It may slow down things a little due to stream chaining, but clean up
> a lot of cruft out of implementation.

The chaining is already there (as sends):
MultiByteFileStream >> #next sends
TextConverter >> #nextFromStream: sends
MultiByteFileStream >> #basicNext sends
StandardFileStream >> #next

With encapsulation it could be:
MultiByteFileStream(MultiByteStream) >> #next sends
TextConverter >> #nextFromStream: sends
StandardFileStream >> #next

So I expect it to be a bit faster.

Levente

>
>>
>>
>> Levente
>>
>
>
> -- 
> Best regards,
> Igor Stasenko AKA sig.
>
>