A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz
==================== Summary ====================
Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217
support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
=============== Diff against Multilingual-tfel.217 ===============
Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames
+ ^ #('latin-1' 'latin1' 'iso-8859-1') copy. - ^ #('latin-1' 'latin1') copy. !
Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil]. + value1 := char1 asInteger. - value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char" + ^ char1 asCharacter - ^ char1 ].
"at least 2-byte char" char2 := aStream basicNext. + char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)]. + value2 := char2 asInteger. - char2 ifNil:[^self errorMalformedInput: (String with: char1)]. - value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63). ].
"at least 3-byte char" char3 := aStream basicNext. + char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)]. + value3 := char3 asInteger. - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)]. - value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext. + char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)]. + value4 := char4 asInteger. - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. - value4 := char4 asciiValue. unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].
+ unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)]. - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. unicode > 16r10FFFD ifTrue: [ + ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter). - ^self errorMalformedInput: (String with: char1 with: char2 with: char3). ]. unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode. !
Thanks Jacob.
Any objections here I put this into trunk? Looks good from here.
Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote:
A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz
==================== Summary ====================
Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217
support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
=============== Diff against Multilingual-tfel.217 ===============
Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames
- ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
!
Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil].
- value1 := char1 asInteger.
- value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char"
^ char1 asCharacter
^ char1
].
"at least 2-byte char" char2 := aStream basicNext.
- char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
- value2 := char2 asInteger.
char2 ifNil:[^self errorMalformedInput: (String with: char1)].
value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63). ].
"at least 3-byte char" char3 := aStream basicNext.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
- value3 := char3 asInteger.
char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext.
char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
value4 := char4 asInteger.
char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].value4 := char4 asciiValue.
- unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. unicode > 16r10FFFD ifTrue: [
^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
^self errorMalformedInput: (String with: char1 with: char2 with: char3).
].
unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode.
!
On Thu, 19 Jan 2017, Tobias Pape wrote:
Thanks Jacob.
Any objections here I put this into trunk?
Yep. TextConverters are intended to work with MultiByte*Streams only. Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Levente
Looks good from here.
Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote:
A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz
==================== Summary ====================
Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217
support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
=============== Diff against Multilingual-tfel.217 ===============
Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames
- ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
!
Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil].
- value1 := char1 asInteger.
- value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char"
^ char1 asCharacter
^ char1
].
"at least 2-byte char" char2 := aStream basicNext.
- char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
- value2 := char2 asInteger.
char2 ifNil:[^self errorMalformedInput: (String with: char1)].
value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63). ].
"at least 3-byte char" char3 := aStream basicNext.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
- value3 := char3 asInteger.
char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext.
char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
value4 := char4 asInteger.
char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].value4 := char4 asciiValue.
- unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. unicode > 16r10FFFD ifTrue: [
^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
^self errorMalformedInput: (String with: char1 with: char2 with: char3).
].
unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode.
!
On 19.01.2017, at 23:30, Levente Uzonyi leves@caesar.elte.hu wrote:
On Thu, 19 Jan 2017, Tobias Pape wrote:
Thanks Jacob.
Any objections here I put this into trunk?
Yep. TextConverters are intended to work with MultiByte*Streams only.
Didn't know that.
Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream.
I was confused to see that asciiValue returns something >127 in the first place.
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Clearly. But isn't this a step in the right direction?
Levente
Looks good from here.
Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote:
A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz ==================== Summary ==================== Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217 support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter. =============== Diff against Multilingual-tfel.217 =============== Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
! Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil].
- value1 := char1 asInteger.
- value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char"
^ char1 asCharacter
^ char1
].
"at least 2-byte char" char2 := aStream basicNext.
- char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
- value2 := char2 asInteger.
char2 ifNil:[^self errorMalformedInput: (String with: char1)].
value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63). ].
"at least 3-byte char" char3 := aStream basicNext.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
- value3 := char3 asInteger.
char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext.
char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
value4 := char4 asInteger.
char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].value4 := char4 asciiValue.
- unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. unicode > 16r10FFFD ifTrue: [
^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
^self errorMalformedInput: (String with: char1 with: char2 with: char3).
].
unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode.
!
On Fri, 20 Jan 2017, Tobias Pape wrote:
On 19.01.2017, at 23:30, Levente Uzonyi leves@caesar.elte.hu wrote:
On Thu, 19 Jan 2017, Tobias Pape wrote:
Thanks Jacob.
Any objections here I put this into trunk?
Yep. TextConverters are intended to work with MultiByte*Streams only.
Didn't know that.
Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream.
I was confused to see that asciiValue returns something >127 in the first place.
#asInteger does the same thing as #asciiValue. While #asciiValue doesn't do what you would expect it to do, it has the advantage to clearly mark the class of the receiver (in this case).
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Clearly. But isn't this a step in the right direction?
Yes and no. There are at least two ways to go:
1. Enhance the current stream library, even at the cost of breaking things. A patch here and there won't work. There are fundamental changes required, like stackable streams, to make it desirable to use it over other libraries.
2. Integrate an existing stream library with better features (e.g. Xtreams) If we were to do this, we could gradually migrate existing code to the new library, and finally make the current stream library unloadable.
Levente
Levente
Looks good from here.
Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote:
A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz ==================== Summary ==================== Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217 support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter. =============== Diff against Multilingual-tfel.217 =============== Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
! Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil].
- value1 := char1 asInteger.
- value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char"
^ char1 asCharacter
^ char1
].
"at least 2-byte char" char2 := aStream basicNext.
- char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
- value2 := char2 asInteger.
char2 ifNil:[^self errorMalformedInput: (String with: char1)].
value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63). ].
"at least 3-byte char" char3 := aStream basicNext.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
- value3 := char3 asInteger.
char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext.
char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
value4 := char4 asInteger.
char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].value4 := char4 asciiValue.
- unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. unicode > 16r10FFFD ifTrue: [
^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
^self errorMalformedInput: (String with: char1 with: char2 with: char3).
].
unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode.
!
On 22.01.2017, at 16:10, Levente Uzonyi leves@caesar.elte.hu wrote:
On Fri, 20 Jan 2017, Tobias Pape wrote:
On 19.01.2017, at 23:30, Levente Uzonyi leves@caesar.elte.hu wrote:
On Thu, 19 Jan 2017, Tobias Pape wrote:
Thanks Jacob. Any objections here I put this into trunk?
Yep. TextConverters are intended to work with MultiByte*Streams only.
Didn't know that.
Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream. I was confused to see that asciiValue returns something >127 in the first place.
#asInteger does the same thing as #asciiValue. While #asciiValue doesn't do what you would expect it to do, it has the advantage to clearly mark the class of the receiver (in this case).
Yes, and that's exactly why we should use #asInteger. To _not_ limit the receiver. Because the receiver isn't actually a Character, but some number, encoded in a Character, whose meaning is to be determined by this very method.
Also, how do we know that _basic_Next will always return a Character? (Yes, I know there's a binary check, but doesn't that only say something about #next, not #basicNext?)
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Clearly. But isn't this a step in the right direction?
Yes and no. There are at least two ways to go:
- Enhance the current stream library, even at the cost of breaking things.
A patch here and there won't work. There are fundamental changes required, like stackable streams, to make it desirable to use it over other libraries.
- Integrate an existing stream library with better features (e.g. Xtreams)
If we were to do this, we could gradually migrate existing code to the new library, and finally make the current stream library unloadable.
I like the idea of Xtreams, but I also like going baby steps.
The changes here help at least one person, won't hurt others and seem future proof. So?
Best regards -Tobias
Levente
Levente
Looks good from here. Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote:
A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz ==================== Summary ==================== Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217 support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter. =============== Diff against Multilingual-tfel.217 =============== Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
! Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil].
- value1 := char1 asInteger.
- value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char"
^ char1 asCharacter
^ char1
].
"at least 2-byte char" char2 := aStream basicNext.
- char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
- value2 := char2 asInteger.
char2 ifNil:[^self errorMalformedInput: (String with: char1)].
value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63). ].
"at least 3-byte char" char3 := aStream basicNext.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
- value3 := char3 asInteger.
char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext.
char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
value4 := char4 asInteger.
char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].value4 := char4 asciiValue.
- unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)]. unicode > 16r10FFFD ifTrue: [
^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
^self errorMalformedInput: (String with: char1 with: char2 with: char3).
].
unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode.
!
Below as a comparison the version in Pharo 5.0.
Noteworthy to say is that one can not speak about characters in an UTF8 encoded stream which is read byte by byte until one has examined the bytes.
So if I read the first thing it is actually a byte. Then I can examine if it is a one-byte character and then return the character. Then I go for the next byte. If it indicates that we have a two byte encoded UTF8 character then I can return the character.
So I should have
byte1 := aStream basicNext.
... check if we have a one byte character, if yes return the character
byte2 := aStream basicNext.
... check if we have a two byte character, if yes return the character
byte3 := aStream basicNext.
... check if we have a three byte character, if yes return the character
byte4 := aStream basicNext.
... check if we have a four byte character, if yes return the character
nextFromStream: aStream | character1 value1 character2 value2 unicode character3 value3 character4 value4 | aStream isBinary ifTrue: [ ^ aStream basicNext ]. character1 := aStream basicNext. character1 isNil ifTrue: [ ^ nil ]. value1 := character1 asciiValue. value1 <= 127 ifTrue: [ "1-byte character" ^ character1 ]. "at least 2-byte character" character2 := aStream basicNext. character2 isNil ifTrue: [ ^ self errorMalformedInput ]. value2 := character2 asciiValue. (value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63) ]. "at least 3-byte character" character3 := aStream basicNext. character3 isNil ifTrue: [ ^ self errorMalformedInput ]. value3 := character3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ]. (value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte character" character4 := aStream basicNext. character4 isNil ifTrue: [ ^ self errorMalformedInput ]. value4 := character4 asciiValue. unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ]. unicode isNil ifTrue: [ ^ self errorMalformedInput ]. unicode > 16r10FFFD ifTrue: [ ^ self errorMalformedInput ]. unicode = 16rFEFF ifTrue: [ ^ self nextFromStream: aStream ]. ^ Unicode value: unicode
On 1/22/17, Tobias Pape Das.Linux@gmx.de wrote:
On 22.01.2017, at 16:10, Levente Uzonyi leves@caesar.elte.hu wrote:
On Fri, 20 Jan 2017, Tobias Pape wrote:
On 19.01.2017, at 23:30, Levente Uzonyi leves@caesar.elte.hu wrote:
On Thu, 19 Jan 2017, Tobias Pape wrote:
Thanks Jacob. Any objections here I put this into trunk?
Yep. TextConverters are intended to work with MultiByte*Streams only.
Didn't know that.
Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream. I was confused to see that asciiValue returns something >127 in the first place.
#asInteger does the same thing as #asciiValue. While #asciiValue doesn't do what you would expect it to do, it has the advantage to clearly mark the class of the receiver (in this case).
Yes, and that's exactly why we should use #asInteger. To _not_ limit the receiver. Because the receiver isn't actually a Character, but some number, encoded in a Character, whose meaning is to be determined by this very method.
Also, how do we know that _basic_Next will always return a Character? (Yes, I know there's a binary check, but doesn't that only say something about #next, not #basicNext?)
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Clearly. But isn't this a step in the right direction?
Yes and no. There are at least two ways to go:
- Enhance the current stream library, even at the cost of breaking
things. A patch here and there won't work. There are fundamental changes required, like stackable streams, to make it desirable to use it over other libraries.
- Integrate an existing stream library with better features (e.g.
Xtreams) If we were to do this, we could gradually migrate existing code to the new library, and finally make the current stream library unloadable.
I like the idea of Xtreams, but I also like going baby steps.
The changes here help at least one person, won't hurt others and seem future proof. So?
Best regards -Tobias
Levente
Levente
Looks good from here. Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote:
A new version of Multilingual was added to project The Inbox: http://source.squeak.org/inbox/Multilingual-jr.218.mcz ==================== Summary ==================== Name: Multilingual-jr.218 Author: jr Time: 19 January 2017, 5:14:23.763655 pm UUID: 36416c42-a4b4-554f-8203-aba25eee794f Ancestors: Multilingual-tfel.217 support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter. =============== Diff against Multilingual-tfel.217 =============== Item was changed: ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') ----- encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
! Item was changed: ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') ----- nextFromStream: aStream
| char1 value1 char2 value2 unicode char3 value3 char4 value4 | aStream isBinary ifTrue: [^ aStream basicNext]. char1 := aStream basicNext. char1 ifNil:[^ nil].
- value1 := char1 asInteger.
- value1 := char1 asciiValue. value1 <= 127 ifTrue: [ "1-byte char"
^ char1 asCharacter
^ char1
].
"at least 2-byte char" char2 := aStream basicNext.
- char2 ifNil:[^self errorMalformedInput: (String with: char1
asCharacter)].
- value2 := char2 asInteger.
char2 ifNil:[^self errorMalformedInput: (String with: char1)].
value2 := char2 asciiValue.
(value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd:
63). ].
"at least 3-byte char" char3 := aStream basicNext.
- char3 ifNil:[^self errorMalformedInput: (String with: char1
asCharacter with: char2 asCharacter)].
- value3 := char3 asInteger.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
char2)].
- value3 := char3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63)
bitShift: 6) + (value3 bitAnd: 63). ].
(value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte char" char4 := aStream basicNext.
char4 ifNil:[^self errorMalformedInput: (String with: char1
asCharacter with: char2 asCharacter with: char3 asCharacter)].
value4 := char4 asInteger.
char4 ifNil:[^self errorMalformedInput: (String with: char1 with:
char2 with: char3)].
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63). ].value4 := char4 asciiValue.
- unicode ifNil:[^self errorMalformedInput: (String with: char1
asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with:
char2 with: char3)]. unicode > 16r10FFFD ifTrue: [
^self errorMalformedInput: (String with: char1 asCharacter with:
char2 asCharacter with: char3 asCharacter).
^self errorMalformedInput: (String with: char1 with: char2 with:
char3). ].
unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. ^ Unicode value: unicode. !
The Pharo version seems to be the Squeak version optimized for VisualWorks (ifNil: -> isNil ifTrue:).
Levente
On Mon, 23 Jan 2017, H. Hirzel wrote:
Below as a comparison the version in Pharo 5.0.
Noteworthy to say is that one can not speak about characters in an UTF8 encoded stream which is read byte by byte until one has examined the bytes.
So if I read the first thing it is actually a byte. Then I can examine if it is a one-byte character and then return the character. Then I go for the next byte. If it indicates that we have a two byte encoded UTF8 character then I can return the character.
So I should have
byte1 := aStream basicNext.
... check if we have a one byte character, if yes return the character
byte2 := aStream basicNext.
... check if we have a two byte character, if yes return the character
byte3 := aStream basicNext.
... check if we have a three byte character, if yes return the character
byte4 := aStream basicNext.
... check if we have a four byte character, if yes return the character
nextFromStream: aStream | character1 value1 character2 value2 unicode character3 value3 character4 value4 | aStream isBinary ifTrue: [ ^ aStream basicNext ]. character1 := aStream basicNext. character1 isNil ifTrue: [ ^ nil ]. value1 := character1 asciiValue. value1 <= 127 ifTrue: [ "1-byte character" ^ character1 ]. "at least 2-byte character" character2 := aStream basicNext. character2 isNil ifTrue: [ ^ self errorMalformedInput ]. value2 := character2 asciiValue. (value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63) ]. "at least 3-byte character" character3 := aStream basicNext. character3 isNil ifTrue: [ ^ self errorMalformedInput ]. value3 := character3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ]. (value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte character" character4 := aStream basicNext. character4 isNil ifTrue: [ ^ self errorMalformedInput ]. value4 := character4 asciiValue. unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ]. unicode isNil ifTrue: [ ^ self errorMalformedInput ]. unicode > 16r10FFFD ifTrue: [ ^ self errorMalformedInput ]. unicode = 16rFEFF ifTrue: [ ^ self nextFromStream: aStream ]. ^ Unicode value: unicode
On 1/22/17, Tobias Pape Das.Linux@gmx.de wrote:
On 22.01.2017, at 16:10, Levente Uzonyi leves@caesar.elte.hu wrote:
On Fri, 20 Jan 2017, Tobias Pape wrote:
On 19.01.2017, at 23:30, Levente Uzonyi leves@caesar.elte.hu wrote:
On Thu, 19 Jan 2017, Tobias Pape wrote:
Thanks Jacob. Any objections here I put this into trunk?
Yep. TextConverters are intended to work with MultiByte*Streams only.
Didn't know that.
Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream. I was confused to see that asciiValue returns something >127 in the first place.
#asInteger does the same thing as #asciiValue. While #asciiValue doesn't do what you would expect it to do, it has the advantage to clearly mark the class of the receiver (in this case).
Yes, and that's exactly why we should use #asInteger. To _not_ limit the receiver. Because the receiver isn't actually a Character, but some number, encoded in a Character, whose meaning is to be determined by this very method.
Also, how do we know that _basic_Next will always return a Character? (Yes, I know there's a binary check, but doesn't that only say something about #next, not #basicNext?)
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Clearly. But isn't this a step in the right direction?
Yes and no. There are at least two ways to go:
- Enhance the current stream library, even at the cost of breaking
things. A patch here and there won't work. There are fundamental changes required, like stackable streams, to make it desirable to use it over other libraries.
- Integrate an existing stream library with better features (e.g.
Xtreams) If we were to do this, we could gradually migrate existing code to the new library, and finally make the current stream library unloadable.
I like the idea of Xtreams, but I also like going baby steps.
The changes here help at least one person, won't hurt others and seem future proof. So?
Best regards -Tobias
Levente
Levente
Looks good from here. Best regards -Tobias On 19.01.2017, at 17:14, commits@source.squeak.org wrote: > A new version of Multilingual was added to project The Inbox: > http://source.squeak.org/inbox/Multilingual-jr.218.mcz > ==================== Summary ==================== > Name: Multilingual-jr.218 > Author: jr > Time: 19 January 2017, 5:14:23.763655 pm > UUID: 36416c42-a4b4-554f-8203-aba25eee794f > Ancestors: Multilingual-tfel.217 > support 'iso-8859-1' and do not let UTF8TextConverter expect that its > input stream returns Characters from basicNext > A stream implementation might always return bytes from basicNext and > expect the conversion to Character to be done solely by the > TextConverter, so use asInteger instead of asciiValue to support both > cases. Convert back with asCharacter. > =============== Diff against Multilingual-tfel.217 =============== > Item was changed: > ----- Method: Latin1TextConverter class>>encodingNames (in category > 'utilities') ----- > encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy. > - ^ #('latin-1' 'latin1') copy. > ! > Item was changed: > ----- Method: UTF8TextConverter>>nextFromStream: (in category > 'conversion') ----- > nextFromStream: aStream > > | char1 value1 char2 value2 unicode char3 value3 char4 value4 | > aStream isBinary ifTrue: [^ aStream basicNext]. > char1 := aStream basicNext. > char1 ifNil:[^ nil]. > + value1 := char1 asInteger. > - value1 := char1 asciiValue. > value1 <= 127 ifTrue: [ > "1-byte char" > + ^ char1 asCharacter > - ^ char1 > ]. > > "at least 2-byte char" > char2 := aStream basicNext. > + char2 ifNil:[^self errorMalformedInput: (String with: char1 > asCharacter)]. > + value2 := char2 asInteger. > - char2 ifNil:[^self errorMalformedInput: (String with: char1)]. > - value2 := char2 asciiValue. > > (value1 bitAnd: 16rE0) = 192 ifTrue: [ > ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: > 63). > ]. > > "at least 3-byte char" > char3 := aStream basicNext. > + char3 ifNil:[^self errorMalformedInput: (String with: char1 > asCharacter with: char2 asCharacter)]. > + value3 := char3 asInteger. > - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: > char2)]. > - value3 := char3 asciiValue. > (value1 bitAnd: 16rF0) = 224 ifTrue: [ > unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) > bitShift: 6) > + (value3 bitAnd: 63). > ]. > > (value1 bitAnd: 16rF8) = 240 ifTrue: [ > "4-byte char" > char4 := aStream basicNext. > + char4 ifNil:[^self errorMalformedInput: (String with: char1 > asCharacter with: char2 asCharacter with: char3 asCharacter)]. > + value4 := char4 asInteger. > - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: > char2 with: char3)]. > - value4 := char4 asciiValue. > unicode := ((value1 bitAnd: 16r7) bitShift: 18) + > ((value2 bitAnd: 63) bitShift: 12) + > ((value3 bitAnd: 63) bitShift: 6) + > (value4 bitAnd: 63). > ]. > + unicode ifNil:[^self errorMalformedInput: (String with: char1 > asCharacter with: char2 asCharacter with: char3 asCharacter)]. > - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: > char2 with: char3)]. > unicode > 16r10FFFD ifTrue: [ > + ^self errorMalformedInput: (String with: char1 asCharacter with: > char2 asCharacter with: char3 asCharacter). > - ^self errorMalformedInput: (String with: char1 with: char2 with: > char3). > ]. > > unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. > ^ Unicode value: unicode. > !
Interesting in this context the UTF8 decoding implementation of Pharo 5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)
ZnUTF8Encoder>> nextFromStream: stream | code byte next | (byte := stream next) < 128 ifTrue: [ ^ Character codePoint: byte ]. (byte bitAnd: 2r11100000) == 2r11000000 ifTrue: [ code := byte bitAnd: 2r00011111. ((next := stream next ifNil: [ self errorIncomplete ]) bitAnd: 2r11000000) == 2r10000000 ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ] ifFalse: [ ^ self errorIllegalContinuationByte ]. code < 128 ifTrue: [ self errorOverlong ]. ^ Character codePoint: code ]. (byte bitAnd: 2r11110000) == 2r11100000 ifTrue: [ code := byte bitAnd: 2r00001111. 2 timesRepeat: [ ((next := stream next ifNil: [ self errorIncomplete ]) bitAnd: 2r11000000) == 2r10000000 ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ] ifFalse: [ ^ self errorIllegalContinuationByte ] ]. code < 2048 ifTrue: [ self errorOverlong ]. code = 65279 "Unicode Byte Order Mark" ifTrue: [ stream atEnd ifTrue: [ self errorIncomplete ]. ^ self nextFromStream: stream ]. ^ Character codePoint: code ]. (byte bitAnd: 2r11111000) == 2r11110000 ifTrue: [ code := byte bitAnd: 2r00000111. 3 timesRepeat: [ ((next := stream next ifNil: [ self errorIncomplete ]) bitAnd: 2r11000000) == 2r10000000 ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ] ifFalse: [ ^ self errorIllegalContinuationByte ] ]. code < 65535 ifTrue: [ self errorOverlong ]. ^ Character codePoint: code ]. self errorIllegalLeadingByte
On 1/23/17, Levente Uzonyi leves@caesar.elte.hu wrote:
The Pharo version seems to be the Squeak version optimized for VisualWorks (ifNil: -> isNil ifTrue:).
Levente
On Mon, 23 Jan 2017, H. Hirzel wrote:
Below as a comparison the version in Pharo 5.0.
Noteworthy to say is that one can not speak about characters in an UTF8 encoded stream which is read byte by byte until one has examined the bytes.
So if I read the first thing it is actually a byte. Then I can examine if it is a one-byte character and then return the character. Then I go for the next byte. If it indicates that we have a two byte encoded UTF8 character then I can return the character.
So I should have
byte1 := aStream basicNext.
... check if we have a one byte character, if yes return the character
byte2 := aStream basicNext.
... check if we have a two byte character, if yes return the character
byte3 := aStream basicNext.
... check if we have a three byte character, if yes return the character
byte4 := aStream basicNext.
... check if we have a four byte character, if yes return the character
nextFromStream: aStream | character1 value1 character2 value2 unicode character3 value3 character4 value4 | aStream isBinary ifTrue: [ ^ aStream basicNext ]. character1 := aStream basicNext. character1 isNil ifTrue: [ ^ nil ]. value1 := character1 asciiValue. value1 <= 127 ifTrue: [ "1-byte character" ^ character1 ]. "at least 2-byte character" character2 := aStream basicNext. character2 isNil ifTrue: [ ^ self errorMalformedInput ]. value2 := character2 asciiValue. (value1 bitAnd: 16rE0) = 192 ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63) ]. "at least 3-byte character" character3 := aStream basicNext. character3 isNil ifTrue: [ ^ self errorMalformedInput ]. value3 := character3 asciiValue. (value1 bitAnd: 16rF0) = 224 ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ]. (value1 bitAnd: 16rF8) = 240 ifTrue: [ "4-byte character" character4 := aStream basicNext. character4 isNil ifTrue: [ ^ self errorMalformedInput ]. value4 := character4 asciiValue. unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd: 63) bitShift: 12) + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ]. unicode isNil ifTrue: [ ^ self errorMalformedInput ]. unicode > 16r10FFFD ifTrue: [ ^ self errorMalformedInput ]. unicode = 16rFEFF ifTrue: [ ^ self nextFromStream: aStream ]. ^ Unicode value: unicode
On 1/22/17, Tobias Pape Das.Linux@gmx.de wrote:
On 22.01.2017, at 16:10, Levente Uzonyi leves@caesar.elte.hu wrote:
On Fri, 20 Jan 2017, Tobias Pape wrote:
On 19.01.2017, at 23:30, Levente Uzonyi leves@caesar.elte.hu wrote:
On Thu, 19 Jan 2017, Tobias Pape wrote: > Thanks Jacob. > Any objections here I put this into trunk? Yep. TextConverters are intended to work with MultiByte*Streams only.
Didn't know that.
Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream. I was confused to see that asciiValue returns something >127 in the first place.
#asInteger does the same thing as #asciiValue. While #asciiValue doesn't do what you would expect it to do, it has the advantage to clearly mark the class of the receiver (in this case).
Yes, and that's exactly why we should use #asInteger. To _not_ limit the receiver. Because the receiver isn't actually a Character, but some number, encoded in a Character, whose meaning is to be determined by this very method.
Also, how do we know that _basic_Next will always return a Character? (Yes, I know there's a binary check, but doesn't that only say something about #next, not #basicNext?)
If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
Clearly. But isn't this a step in the right direction?
Yes and no. There are at least two ways to go:
- Enhance the current stream library, even at the cost of breaking
things. A patch here and there won't work. There are fundamental changes required, like stackable streams, to make it desirable to use it over other libraries.
- Integrate an existing stream library with better features (e.g.
Xtreams) If we were to do this, we could gradually migrate existing code to the new library, and finally make the current stream library unloadable.
I like the idea of Xtreams, but I also like going baby steps.
The changes here help at least one person, won't hurt others and seem future proof. So?
Best regards -Tobias
Levente
Levente > Looks good from here. > Best regards > -Tobias > On 19.01.2017, at 17:14, commits@source.squeak.org wrote: >> A new version of Multilingual was added to project The Inbox: >> http://source.squeak.org/inbox/Multilingual-jr.218.mcz >> ==================== Summary ==================== >> Name: Multilingual-jr.218 >> Author: jr >> Time: 19 January 2017, 5:14:23.763655 pm >> UUID: 36416c42-a4b4-554f-8203-aba25eee794f >> Ancestors: Multilingual-tfel.217 >> support 'iso-8859-1' and do not let UTF8TextConverter expect that >> its >> input stream returns Characters from basicNext >> A stream implementation might always return bytes from basicNext and >> expect the conversion to Character to be done solely by the >> TextConverter, so use asInteger instead of asciiValue to support >> both >> cases. Convert back with asCharacter. >> =============== Diff against Multilingual-tfel.217 =============== >> Item was changed: >> ----- Method: Latin1TextConverter class>>encodingNames (in category >> 'utilities') ----- >> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy. >> - ^ #('latin-1' 'latin1') copy. >> ! >> Item was changed: >> ----- Method: UTF8TextConverter>>nextFromStream: (in category >> 'conversion') ----- >> nextFromStream: aStream >> >> | char1 value1 char2 value2 unicode char3 value3 char4 value4 | >> aStream isBinary ifTrue: [^ aStream basicNext]. >> char1 := aStream basicNext. >> char1 ifNil:[^ nil]. >> + value1 := char1 asInteger. >> - value1 := char1 asciiValue. >> value1 <= 127 ifTrue: [ >> "1-byte char" >> + ^ char1 asCharacter >> - ^ char1 >> ]. >> >> "at least 2-byte char" >> char2 := aStream basicNext. >> + char2 ifNil:[^self errorMalformedInput: (String with: char1 >> asCharacter)]. >> + value2 := char2 asInteger. >> - char2 ifNil:[^self errorMalformedInput: (String with: char1)]. >> - value2 := char2 asciiValue. >> >> (value1 bitAnd: 16rE0) = 192 ifTrue: [ >> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 >> bitAnd: >> 63). >> ]. >> >> "at least 3-byte char" >> char3 := aStream basicNext. >> + char3 ifNil:[^self errorMalformedInput: (String with: char1 >> asCharacter with: char2 asCharacter)]. >> + value3 := char3 asInteger. >> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: >> char2)]. >> - value3 := char3 asciiValue. >> (value1 bitAnd: 16rF0) = 224 ifTrue: [ >> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: >> 63) >> bitShift: 6) >> + (value3 bitAnd: 63). >> ]. >> >> (value1 bitAnd: 16rF8) = 240 ifTrue: [ >> "4-byte char" >> char4 := aStream basicNext. >> + char4 ifNil:[^self errorMalformedInput: (String with: char1 >> asCharacter with: char2 asCharacter with: char3 asCharacter)]. >> + value4 := char4 asInteger. >> - char4 ifNil:[^self errorMalformedInput: (String with: char1 >> with: >> char2 with: char3)]. >> - value4 := char4 asciiValue. >> unicode := ((value1 bitAnd: 16r7) bitShift: 18) + >> ((value2 bitAnd: 63) bitShift: 12) + >> ((value3 bitAnd: 63) bitShift: 6) + >> (value4 bitAnd: 63). >> ]. >> + unicode ifNil:[^self errorMalformedInput: (String with: char1 >> asCharacter with: char2 asCharacter with: char3 asCharacter)]. >> - unicode ifNil:[^self errorMalformedInput: (String with: char1 >> with: >> char2 with: char3)]. >> unicode > 16r10FFFD ifTrue: [ >> + ^self errorMalformedInput: (String with: char1 asCharacter with: >> char2 asCharacter with: char3 asCharacter). >> - ^self errorMalformedInput: (String with: char1 with: char2 with: >> char3). >> ]. >> >> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream]. >> ^ Unicode value: unicode. >> !
squeak-dev@lists.squeakfoundation.org