Re: [squeak-dev] #nextChunk speedup, the future of multibyte streams

30 Jan 2010

      2010/1/30 Igor Stasenko siguctua@gmail.com:
...
On 30 January 2010 04:09, Levente Uzonyi leves@elte.hu wrote:
...
Hi,
I uploaded a new version of the Multilingual package to the Inbox for
reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
~3.7 (if the file has UTF-8 encoding).
The speedup doesn't come free, the code assumes a few things about the file
it's reading:

it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in

the encoded stream that byte is an encoded ! character

the stream doesn't convert line endings (though this is can be worked

around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For example,
can we say that every source file is UTF-8 encoded, just like
CompressedSourceStreams?
Here is the benchmark which show the speedup:
(1 to: 3) collect: [ :run |
       Smalltalk garbageCollect.
       [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051)
New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't bother
with them.)
While digging through the code of FileStream and subclasses, I found that it
may be worth implementing MultiByteFileStream and
MultiByteBinaryOrTextStream in a different way. Instead of subclassing
existing stream classes (and adding cruft to the whole Stream hierachy) we
could use a separate class named MultiByteStream which would encapsulate a
stream (a FileStream or an in-memory stream), the converter, line-end
conversion, etc. This would let us

get rid of the basic* methods of the stream hierarchy (which are broken).
remove duplicate code
find, deprecate, remove obsolete code
achieve better performance

We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% backwards
compatible)?
I am with you. Wrapping or delegation, is what i think a
MultiByteStream should use, i.e.
use an existing stream to read the data from and do own conversion.
It may slow down things a little due to stream chaining, but clean up
a lot of cruft out of implementation.
Yes, subclassing was the worst choice wrt hacking.
basicNext bareNext etc... should not exist.
IMO the wrapper implementation will not be only cleaner, it shall also
be faster.
http://www.squeaksource.com/XTream demonstrate a comfortable speed up
is possible:
{
[| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii; wantsLineEndConversion: false; converter:
UTF8TextConverter new.
       1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
[| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
new installLineEndConvention: nil)) buffered.
       1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
}
 #(332 19)
Nicolas
...
...
Levente
--
Best regards,
Igor Stasenko AKA sig.