2010/1/30 Igor Stasenko siguctua@gmail.com:
On 30 January 2010 04:09, Levente Uzonyi leves@elte.hu wrote:
Hi,
I uploaded a new version of the Multilingual package to the Inbox for reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of ~3.7 (if the file has UTF-8 encoding). The speedup doesn't come free, the code assumes a few things about the file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For example, can we say that every source file is UTF-8 encoded, just like CompressedSourceStreams?
Here is the benchmark which show the speedup: (1 to: 3) collect: [ :run | Smalltalk garbageCollect. [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051) New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't bother with them.)
While digging through the code of FileStream and subclasses, I found that it may be worth implementing MultiByteFileStream and MultiByteBinaryOrTextStream in a different way. Instead of subclassing existing stream classes (and adding cruft to the whole Stream hierachy) we could use a separate class named MultiByteStream which would encapsulate a stream (a FileStream or an in-memory stream), the converter, line-end conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% backwards compatible)?
I am with you. Wrapping or delegation, is what i think a MultiByteStream should use, i.e. use an existing stream to read the data from and do own conversion. It may slow down things a little due to stream chaining, but clean up a lot of cruft out of implementation.
Yes, subclassing was the worst choice wrt hacking. basicNext bareNext etc... should not exist. IMO the wrapper implementation will not be only cleaner, it shall also be faster. http://www.squeaksource.com/XTream demonstrate a comfortable speed up is possible:
{ [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles at: 2) name) ascii; wantsLineEndConversion: false; converter: UTF8TextConverter new. 1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun. [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter new installLineEndConvention: nil)) buffered. 1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun. } #(332 19)
Nicolas
Levente
-- Best regards, Igor Stasenko AKA sig.