Hi,
I uploaded a new version of the Multilingual package to the Inbox for reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of ~3.7 (if the file has UTF-8 encoding). The speedup doesn't come free, the code assumes a few things about the file it's reading: - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in the encoded stream that byte is an encoded ! character - the stream doesn't convert line endings (though this is can be worked around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For example, can we say that every source file is UTF-8 encoded, just like CompressedSourceStreams?
Here is the benchmark which show the speedup: (1 to: 3) collect: [ :run | Smalltalk garbageCollect. [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051) New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't bother with them.)
While digging through the code of FileStream and subclasses, I found that it may be worth implementing MultiByteFileStream and MultiByteBinaryOrTextStream in a different way. Instead of subclassing existing stream classes (and adding cruft to the whole Stream hierachy) we could use a separate class named MultiByteStream which would encapsulate a stream (a FileStream or an in-memory stream), the converter, line-end conversion, etc. This would let us - get rid of the basic* methods of the stream hierarchy (which are broken). - remove duplicate code - find, deprecate, remove obsolete code - achieve better performance We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% backwards compatible)?
Levente
On 30 January 2010 04:09, Levente Uzonyi leves@elte.hu wrote:
Hi,
I uploaded a new version of the Multilingual package to the Inbox for reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of ~3.7 (if the file has UTF-8 encoding). The speedup doesn't come free, the code assumes a few things about the file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For example, can we say that every source file is UTF-8 encoded, just like CompressedSourceStreams?
Here is the benchmark which show the speedup: (1 to: 3) collect: [ :run | Smalltalk garbageCollect. [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051) New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't bother with them.)
While digging through the code of FileStream and subclasses, I found that it may be worth implementing MultiByteFileStream and MultiByteBinaryOrTextStream in a different way. Instead of subclassing existing stream classes (and adding cruft to the whole Stream hierachy) we could use a separate class named MultiByteStream which would encapsulate a stream (a FileStream or an in-memory stream), the converter, line-end conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% backwards compatible)?
I am with you. Wrapping or delegation, is what i think a MultiByteStream should use, i.e. use an existing stream to read the data from and do own conversion. It may slow down things a little due to stream chaining, but clean up a lot of cruft out of implementation.
Levente
On Sat, 30 Jan 2010, Igor Stasenko wrote:
On 30 January 2010 04:09, Levente Uzonyi leves@elte.hu wrote:
Hi,
I uploaded a new version of the Multilingual package to the Inbox for reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of ~3.7 (if the file has UTF-8 encoding). The speedup doesn't come free, the code assumes a few things about the file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For example, can we say that every source file is UTF-8 encoded, just like CompressedSourceStreams?
Here is the benchmark which show the speedup: (1 to: 3) collect: [ :run | Smalltalk garbageCollect. [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051) New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't bother with them.)
While digging through the code of FileStream and subclasses, I found that it may be worth implementing MultiByteFileStream and MultiByteBinaryOrTextStream in a different way. Instead of subclassing existing stream classes (and adding cruft to the whole Stream hierachy) we could use a separate class named MultiByteStream which would encapsulate a stream (a FileStream or an in-memory stream), the converter, line-end conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% backwards compatible)?
I am with you. Wrapping or delegation, is what i think a MultiByteStream should use, i.e. use an existing stream to read the data from and do own conversion. It may slow down things a little due to stream chaining, but clean up a lot of cruft out of implementation.
The chaining is already there (as sends): MultiByteFileStream >> #next sends TextConverter >> #nextFromStream: sends MultiByteFileStream >> #basicNext sends StandardFileStream >> #next
With encapsulation it could be: MultiByteFileStream(MultiByteStream) >> #next sends TextConverter >> #nextFromStream: sends StandardFileStream >> #next
So I expect it to be a bit faster.
Levente
Levente
-- Best regards, Igor Stasenko AKA sig.
2010/1/30 Igor Stasenko siguctua@gmail.com:
On 30 January 2010 04:09, Levente Uzonyi leves@elte.hu wrote:
Hi,
I uploaded a new version of the Multilingual package to the Inbox for reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of ~3.7 (if the file has UTF-8 encoding). The speedup doesn't come free, the code assumes a few things about the file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
around if necessary)
Are these assumptions valid? Can we have stricter assumptions? For example, can we say that every source file is UTF-8 encoded, just like CompressedSourceStreams?
Here is the benchmark which show the speedup: (1 to: 3) collect: [ :run | Smalltalk garbageCollect. [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
Current: #(7039 7037 7051) New: #(1923 1903 1890)
(Note that further minor speedups are still possible, but I didn't bother with them.)
While digging through the code of FileStream and subclasses, I found that it may be worth implementing MultiByteFileStream and MultiByteBinaryOrTextStream in a different way. Instead of subclassing existing stream classes (and adding cruft to the whole Stream hierachy) we could use a separate class named MultiByteStream which would encapsulate a stream (a FileStream or an in-memory stream), the converter, line-end conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.
What do you think? Should we do this (even if it will not be 100% backwards compatible)?
I am with you. Wrapping or delegation, is what i think a MultiByteStream should use, i.e. use an existing stream to read the data from and do own conversion. It may slow down things a little due to stream chaining, but clean up a lot of cruft out of implementation.
Yes, subclassing was the worst choice wrt hacking. basicNext bareNext etc... should not exist. IMO the wrapper implementation will not be only cleaner, it shall also be faster. http://www.squeaksource.com/XTream demonstrate a comfortable speed up is possible:
{ [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles at: 2) name) ascii; wantsLineEndConversion: false; converter: UTF8TextConverter new. 1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun. [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter new installLineEndConvention: nil)) buffered. 1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun. } #(332 19)
Nicolas
Levente
-- Best regards, Igor Stasenko AKA sig.
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi leves@elte.hu wrote:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them.
Although I'm not a UTF-8 expert.
-Chris
On 29.01.2010, at 20:07, Chris Cunningham wrote:
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi leves@elte.hu wrote:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them.
Wrong.
Although I'm not a UTF-8 expert.
Obviously ;) See
http://en.wikipedia.org/wiki/UTF-8#Description
- Bert -
On 30 January 2010 09:15, Bert Freudenberg bert@freudenbergs.de wrote:
On 29.01.2010, at 20:07, Chris Cunningham wrote:
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi leves@elte.hu wrote:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them.
Wrong.
Although I'm not a UTF-8 expert.
Obviously ;) See
Either way, the presence of ! character should be tested after decoding utf8 data.
- Bert -
On Sat, 30 Jan 2010, Igor Stasenko wrote:
On 30 January 2010 09:15, Bert Freudenberg bert@freudenbergs.de wrote:
On 29.01.2010, at 20:07, Chris Cunningham wrote:
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi leves@elte.hu wrote:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
the encoded stream that byte is an encoded ! character
The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them.
Wrong.
Although I'm not a UTF-8 expert.
Obviously ;) See
Either way, the presence of ! character should be tested after decoding utf8 data.
Why? UTF-8 is ASCII compatible.
Levente
- Bert -
-- Best regards, Igor Stasenko AKA sig.
2010/1/31 Levente Uzonyi leves@elte.hu:
On Sat, 30 Jan 2010, Igor Stasenko wrote:
On 30 January 2010 09:15, Bert Freudenberg bert@freudenbergs.de wrote:
On 29.01.2010, at 20:07, Chris Cunningham wrote:
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi leves@elte.hu wrote:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs
in the encoded stream that byte is an encoded ! character
The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them.
Wrong.
Although I'm not a UTF-8 expert.
Obviously ;) See
Either way, the presence of ! character should be tested after decoding utf8 data.
Why? UTF-8 is ASCII compatible.
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
sure, due to nature of utf8 encoding you could shortcut, but then because of such hacks, you won't be able to switch to different encoding without pain:
<primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'
Levente
On 2010-01-31, at 10:54 AM, Igor Stasenko wrote:
Why? UTF-8 is ASCII compatible.
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
sure, due to nature of utf8 encoding you could shortcut, but then because of such hacks, you won't be able to switch to different encoding without pain:
<primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'
+1
Bytes and characters are not the same thing.
Colin
On Sun, 31 Jan 2010, Igor Stasenko wrote:
2010/1/31 Levente Uzonyi leves@elte.hu:
On Sat, 30 Jan 2010, Igor Stasenko wrote:
On 30 January 2010 09:15, Bert Freudenberg bert@freudenbergs.de wrote:
On 29.01.2010, at 20:07, Chris Cunningham wrote:
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi leves@elte.hu wrote:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs
in the encoded stream that byte is an encoded ! character
The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them.
Wrong.
Although I'm not a UTF-8 expert.
Obviously ;) See
Either way, the presence of ! character should be tested after decoding utf8 data.
Why? UTF-8 is ASCII compatible.
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because - #nextChunk doesn't work in binary mode: 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU" - text converters don't do any conversion if the stream is binary
sure, due to nature of utf8 encoding you could shortcut, but then because of such hacks, you won't be able to switch to different encoding without pain:
<primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'
That's what my original questions were about (which are still unanswered): - is it safe to assume that the encoding of source files will be compatible with this "hack"? - is it safe to assume that the source files are always UTF-8 encoded?
Levente
Levente
-- Best regards, Igor Stasenko AKA sig.
Levente Uzonyi wrote:
On Sun, 31 Jan 2010, Igor Stasenko wrote:
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because
- #nextChunk doesn't work in binary mode: 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary
Right, although I think Igor's point is slightly different. You could implement #upTo: for example by applying to encoding to the argument and then do #upToAllEncoded: which takes an encoded character sequence as the argument. This would preserve the generality of #upTo: with the potential for more general speedup. I.e.,
upTo: aCharacter => upToEncoded: bytes => primitive read <= return encodedBytes <= converter decode: encodedBytes <= returns characters
(one assumption here is that the converter doesn't "embed" a particular character sequence as a part of another one which is true for UTF-8 but I'm not sure about other encodings).
That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?
I think UTF-8 is going to be the only standard going forward. Precisely because it has such (often overlooked) extremely useful properties. So yes, I think it'd be safe to assume that this will work going forward.
Cheers, - Andreas
I don't like at all having a String being a blob of bits subject to encoding interpretation. String is a collection of characters, and there should be a canonical encoding known from the VM. utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks.
We should use ByteArray, or better, introduce an UTF8String if it becomes that important. Code will be much much much cleaner and foolproof.
Nicolas
2010/2/2 Andreas Raab andreas.raab@gmx.de:
Levente Uzonyi wrote:
On Sun, 31 Jan 2010, Igor Stasenko wrote:
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because
- #nextChunk doesn't work in binary mode:
'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary
Right, although I think Igor's point is slightly different. You could implement #upTo: for example by applying to encoding to the argument and then do #upToAllEncoded: which takes an encoded character sequence as the argument. This would preserve the generality of #upTo: with the potential for more general speedup. I.e.,
upTo: aCharacter => upToEncoded: bytes => primitive read <= return encodedBytes <= converter decode: encodedBytes <= returns characters
(one assumption here is that the converter doesn't "embed" a particular character sequence as a part of another one which is true for UTF-8 but I'm not sure about other encodings).
That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be
compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?
I think UTF-8 is going to be the only standard going forward. Precisely because it has such (often overlooked) extremely useful properties. So yes, I think it'd be safe to assume that this will work going forward.
Cheers, - Andreas
On 03.02.2010, at 00:13, Nicolas Cellier wrote:
I don't like at all having a String being a blob of bits subject to encoding interpretation. String is a collection of characters, and there should be a canonical encoding known from the VM. utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks.
We should use ByteArray, or better, introduce an UTF8String if it becomes that important. Code will be much much much cleaner and foolproof.
That's what Scratch did ...
- Bert -
On Wednesday 03 February 2010 01:50:05 pm Bert Freudenberg wrote:
On 03.02.2010, at 00:13, Nicolas Cellier wrote:
We should use ByteArray, or better, introduce an UTF8String if it becomes that important. Code will be much much much cleaner and foolproof.
That's what Scratch did ...
A much saner choice, I should say, after reading both Squeak and Scratch sources. Bytes are strictly for memory objects and have no place in higher level code. Higher level codes should only deal with encoded bytes - integer, character (ASCII), string (ASCII), utf8string (UTF8) etc.
Nicolas, why "if it becomes that important" qualifier for UTF-8? Wake up :-).
Subbu
On Mon, 1 Feb 2010, Andreas Raab wrote:
Levente Uzonyi wrote:
On Sun, 31 Jan 2010, Igor Stasenko wrote:
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because
- #nextChunk doesn't work in binary mode: 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary
Right, although I think Igor's point is slightly different. You could implement #upTo: for example by applying to encoding to the argument and then do #upToAllEncoded: which takes an encoded character sequence as the argument. This would preserve the generality of #upTo: with the potential for more general speedup. I.e.,
upTo: aCharacter => upToEncoded: bytes => primitive read <= return encodedBytes <= converter decode: encodedBytes <= returns characters
(one assumption here is that the converter doesn't "embed" a particular character sequence as a part of another one which is true for UTF-8 but I'm not sure about other encodings).
Another way to do this is to let the converter read the next chunk. "TextConverter >> #nextChunkFrom: stream" could use the current implementation of MultiByteFileStream >> #nextChunk, while UTF8TextConverter could use #upTo: (this would also let us avoid the #basicUpTo: hack). So we could use any encoding, while speeding up the UTF-8 case.
Maybe we could also move the encoding/decoding related methods/tables from String and subclasses to the (class side of the) TextConverters.
Levente
That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?
I think UTF-8 is going to be the only standard going forward. Precisely because it has such (often overlooked) extremely useful properties. So yes, I think it'd be safe to assume that this will work going forward.
Cheers,
- Andreas
squeak-dev@lists.squeakfoundation.org