[squeak-dev] The Inbox: Multilingual-jr.218.mcz

H. Hirzel hannes.hirzel at gmail.com
Mon Jan 23 18:24:25 UTC 2017


Interesting in this context  the UTF8 decoding implementation of Pharo
5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)

ZnUTF8Encoder>>
nextFromStream: stream
	| code byte next |
	(byte := stream next) < 128
		ifTrue: [ ^ Character codePoint: byte ].
	(byte bitAnd: 2r11100000) == 2r11000000
		ifTrue: [
			code := byte bitAnd: 2r00011111.
			((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
				ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
				ifFalse: [ ^ self errorIllegalContinuationByte ].
			code < 128 ifTrue: [ self errorOverlong ].
			^ Character codePoint: code ].
	(byte bitAnd: 2r11110000) == 2r11100000
		ifTrue: [
			code := byte bitAnd: 2r00001111.
			2 timesRepeat: [
				((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
					ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
					ifFalse: [ ^ self errorIllegalContinuationByte ] ].
			code < 2048 ifTrue: [ self errorOverlong ].
			code = 65279 "Unicode Byte Order Mark" ifTrue: [
				stream atEnd ifTrue: [ self errorIncomplete ].
				^ self nextFromStream: stream ].
			^ Character codePoint: code ].
	(byte bitAnd: 2r11111000) == 2r11110000
		ifTrue: [
			code := byte bitAnd: 2r00000111.
			3 timesRepeat: [
				((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
					ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
					ifFalse: [ ^ self errorIllegalContinuationByte ] ].
			code < 65535 ifTrue: [ self errorOverlong ].
			^ Character codePoint: code ].
	self errorIllegalLeadingByte

On 1/23/17, Levente Uzonyi <leves at caesar.elte.hu> wrote:
> The Pharo version seems to be the Squeak version optimized for
> VisualWorks (ifNil: -> isNil ifTrue:).
>
> Levente
>
> On Mon, 23 Jan 2017, H. Hirzel wrote:
>
>> Below as a comparison the version in Pharo 5.0.
>>
>> Noteworthy to say is that one can not speak about characters in an
>> UTF8 encoded stream which is read byte by byte until one has examined
>> the bytes.
>>
>> So if I read the first thing it is actually a byte. Then I can examine
>> if it is a one-byte character and then return the character. Then I go
>> for the next byte. If it indicates that we have a two byte encoded
>> UTF8 character then I can return the character.
>>
>> So I should have
>>
>> byte1 := aStream basicNext.
>>
>> ... check if we have a one byte character, if yes return the character
>>
>> byte2 := aStream basicNext.
>>
>> ... check if we have a two byte character, if yes return the character
>>
>> byte3 := aStream basicNext.
>>
>> ... check if we have a three byte character, if yes return the character
>>
>>
>> byte4 := aStream basicNext.
>>
>> ... check if we have a four byte character, if yes return the character
>>
>>
>>
>>
>>
>> nextFromStream: aStream
>> 	| character1 value1 character2 value2 unicode character3 value3
>> character4 value4 |
>> 	aStream isBinary
>> 		ifTrue: [ ^ aStream basicNext ].
>> 	character1 := aStream basicNext.
>> 	character1 isNil
>> 		ifTrue: [ ^ nil ].
>> 	value1 := character1 asciiValue.
>> 	value1 <= 127
>> 		ifTrue: [
>> 			"1-byte character"
>> 			^ character1 ].	"at least 2-byte character"
>> 	character2 := aStream basicNext.
>> 	character2 isNil
>> 		ifTrue: [ ^ self errorMalformedInput ].
>> 	value2 := character2 asciiValue.
>> 	(value1 bitAnd: 16rE0) = 192
>> 		ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
>> (value2 bitAnd: 63) ].	"at least 3-byte character"
>> 	character3 := aStream basicNext.
>> 	character3 isNil
>> 		ifTrue: [ ^ self errorMalformedInput ].
>> 	value3 := character3 asciiValue.
>> 	(value1 bitAnd: 16rF0) = 224
>> 		ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
>> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
>> 	(value1 bitAnd: 16rF8) = 240
>> 		ifTrue: [
>> 			"4-byte character"
>> 			character4 := aStream basicNext.
>> 			character4 isNil
>> 				ifTrue: [ ^ self errorMalformedInput ].
>> 			value4 := character4 asciiValue.
>> 			unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
>> 63) bitShift: 12)
>> 				+ ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
>> 	unicode isNil
>> 		ifTrue: [ ^ self errorMalformedInput ].
>> 	unicode > 16r10FFFD
>> 		ifTrue: [ ^ self errorMalformedInput ].
>> 	unicode = 16rFEFF
>> 		ifTrue: [ ^ self nextFromStream: aStream ].
>> 	^ Unicode value: unicode
>>
>>
>>
>>
>>
>> On 1/22/17, Tobias Pape <Das.Linux at gmx.de> wrote:
>>>
>>> On 22.01.2017, at 16:10, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>
>>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>>
>>>>>
>>>>> On 19.01.2017, at 23:30, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>>>
>>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>>> Thanks Jacob.
>>>>>>> Any objections here I put this into trunk?
>>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>>
>>>>> Didn't know that.
>>>>>
>>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>>> stream is not binary. This is why the #isBinary check is the first
>>>>>> thing
>>>>>> the method does.
>>>>>
>>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>>> is a character. Said bluntly, the responsibility of the TextConverter
>>>>> is
>>>>> to make Characters from that bloody numbers in that stream.
>>>>> I was confused to see that asciiValue returns something >127 in the
>>>>> first
>>>>> place.
>>>>
>>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>>> do what you would expect it to do, it has the advantage to clearly mark
>>>> the class of the receiver (in this case).
>>>>
>>>
>>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>>> receiver.
>>> Because the receiver isn't actually a Character, but some number, encoded
>>> in
>>> a Character, whose meaning is to be determined by
>>> this very method.
>>>
>>> Also, how do we know that _basic_Next will always return a Character?
>>> (Yes, I know there's a binary check, but doesn't that only say something
>>> about #next, not #basicNext?)
>>>
>>>
>>>>>
>>>>>> If there are plans to make TextConverters work with more general
>>>>>> streams, then I persume these changes won't be enough.
>>>>>
>>>>> Clearly.
>>>>> But isn't this a step in the right direction?
>>>>
>>>> Yes and no. There are at least two ways to go:
>>>>
>>>> 1. Enhance the current stream library, even at the cost of breaking
>>>> things.
>>>> A patch here and there won't work. There are fundamental changes
>>>> required,
>>>> like stackable streams, to make it desirable to use it over other
>>>> libraries.
>>>>
>>>> 2. Integrate an existing stream library with better features (e.g.
>>>> Xtreams)
>>>> If we were to do this, we could gradually migrate existing code to the
>>>> new
>>>> library, and finally make the current stream library unloadable.
>>>
>>> I like the idea of Xtreams, but I also like going baby steps.
>>>
>>> The changes here help at least one person, won't hurt others and seem
>>> future
>>> proof.
>>> So?
>>>
>>> Best regards
>>> 	-Tobias
>>>
>>>
>>>
>>>> Levente
>>>>
>>>>>
>>>>>> Levente
>>>>>>> Looks good from here.
>>>>>>> Best regards
>>>>>>> 	-Tobias
>>>>>>> On 19.01.2017, at 17:14, commits at source.squeak.org wrote:
>>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>>> ==================== Summary ====================
>>>>>>>> Name: Multilingual-jr.218
>>>>>>>> Author: jr
>>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that
>>>>>>>> its
>>>>>>>> input stream  returns Characters from basicNext
>>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>>> TextConverter, so use asInteger instead of asciiValue to support
>>>>>>>> both
>>>>>>>> cases. Convert back with asCharacter.
>>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>>> Item was changed:
>>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>>> 'utilities') -----
>>>>>>>> encodingNames + 	^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>>> - 	^ #('latin-1' 'latin1') copy.
>>>>>>>> !
>>>>>>>> Item was changed:
>>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>>> 'conversion') -----
>>>>>>>> nextFromStream: aStream
>>>>>>>>
>>>>>>>> 	| char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>>> 	aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>>> 	char1 := aStream basicNext.
>>>>>>>> 	char1 ifNil:[^ nil].
>>>>>>>> + 	value1 := char1 asInteger.
>>>>>>>> - 	value1 := char1 asciiValue.
>>>>>>>> 	value1 <= 127 ifTrue: [
>>>>>>>> 		"1-byte char"
>>>>>>>> + 		^ char1 asCharacter
>>>>>>>> - 		^ char1
>>>>>>>> 	].
>>>>>>>>
>>>>>>>> 	"at least 2-byte char"
>>>>>>>> 	char2 := aStream basicNext.
>>>>>>>> + 	char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter)].
>>>>>>>> + 	value2 := char2 asInteger.
>>>>>>>> - 	char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>>> - 	value2 := char2 asciiValue.
>>>>>>>>
>>>>>>>> 	(value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>>> 		^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2
>>>>>>>> bitAnd:
>>>>>>>> 63).
>>>>>>>> 	].
>>>>>>>>
>>>>>>>> 	"at least 3-byte char"
>>>>>>>> 	char3 := aStream basicNext.
>>>>>>>> + 	char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>>> + 	value3 := char3 asInteger.
>>>>>>>> - 	char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>>> char2)].
>>>>>>>> - 	value3 := char3 asciiValue.
>>>>>>>> 	(value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>>> 		unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd:
>>>>>>>> 63)
>>>>>>>> bitShift: 6)
>>>>>>>> 				+ (value3 bitAnd: 63).
>>>>>>>> 	].
>>>>>>>>
>>>>>>>> 	(value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>>> 		"4-byte char"
>>>>>>>> 		char4 := aStream basicNext.
>>>>>>>> + 		char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>> + 		value4 := char4 asInteger.
>>>>>>>> - 		char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> with:
>>>>>>>> char2 with: char3)].
>>>>>>>> - 		value4 := char4 asciiValue.
>>>>>>>> 		unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>>> 					((value2 bitAnd: 63) bitShift: 12) +
>>>>>>>> 					((value3 bitAnd: 63) bitShift: 6) +
>>>>>>>> 					(value4 bitAnd: 63).
>>>>>>>> 	].
>>>>>>>> + 	unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>> - 	unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> with:
>>>>>>>> char2 with: char3)].
>>>>>>>> 	unicode > 16r10FFFD ifTrue: [
>>>>>>>> + 		^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>>> - 		^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>>> char3).
>>>>>>>> 	].
>>>>>>>>
>>>>>>>> 	unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>>> 	^ Unicode value: unicode.
>>>>>>>> !
>>>
>>>
>>>
>
>


More information about the Squeak-dev mailing list