[squeak-dev] The Inbox: Multilingual-jr.218.mcz

Levente Uzonyi leves at caesar.elte.hu
Mon Jan 23 14:57:23 UTC 2017


The Pharo version seems to be the Squeak version optimized for 
VisualWorks (ifNil: -> isNil ifTrue:).

Levente

On Mon, 23 Jan 2017, H. Hirzel wrote:

> Below as a comparison the version in Pharo 5.0.
>
> Noteworthy to say is that one can not speak about characters in an
> UTF8 encoded stream which is read byte by byte until one has examined
> the bytes.
>
> So if I read the first thing it is actually a byte. Then I can examine
> if it is a one-byte character and then return the character. Then I go
> for the next byte. If it indicates that we have a two byte encoded
> UTF8 character then I can return the character.
>
> So I should have
>
> byte1 := aStream basicNext.
>
> ... check if we have a one byte character, if yes return the character
>
> byte2 := aStream basicNext.
>
> ... check if we have a two byte character, if yes return the character
>
> byte3 := aStream basicNext.
>
> ... check if we have a three byte character, if yes return the character
>
>
> byte4 := aStream basicNext.
>
> ... check if we have a four byte character, if yes return the character
>
>
>
>
>
> nextFromStream: aStream
> 	| character1 value1 character2 value2 unicode character3 value3
> character4 value4 |
> 	aStream isBinary
> 		ifTrue: [ ^ aStream basicNext ].
> 	character1 := aStream basicNext.
> 	character1 isNil
> 		ifTrue: [ ^ nil ].
> 	value1 := character1 asciiValue.
> 	value1 <= 127
> 		ifTrue: [
> 			"1-byte character"
> 			^ character1 ].	"at least 2-byte character"
> 	character2 := aStream basicNext.
> 	character2 isNil
> 		ifTrue: [ ^ self errorMalformedInput ].
> 	value2 := character2 asciiValue.
> 	(value1 bitAnd: 16rE0) = 192
> 		ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
> (value2 bitAnd: 63) ].	"at least 3-byte character"
> 	character3 := aStream basicNext.
> 	character3 isNil
> 		ifTrue: [ ^ self errorMalformedInput ].
> 	value3 := character3 asciiValue.
> 	(value1 bitAnd: 16rF0) = 224
> 		ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
> 	(value1 bitAnd: 16rF8) = 240
> 		ifTrue: [
> 			"4-byte character"
> 			character4 := aStream basicNext.
> 			character4 isNil
> 				ifTrue: [ ^ self errorMalformedInput ].
> 			value4 := character4 asciiValue.
> 			unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
> 63) bitShift: 12)
> 				+ ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
> 	unicode isNil
> 		ifTrue: [ ^ self errorMalformedInput ].
> 	unicode > 16r10FFFD
> 		ifTrue: [ ^ self errorMalformedInput ].
> 	unicode = 16rFEFF
> 		ifTrue: [ ^ self nextFromStream: aStream ].
> 	^ Unicode value: unicode
>
>
>
>
>
> On 1/22/17, Tobias Pape <Das.Linux at gmx.de> wrote:
>>
>> On 22.01.2017, at 16:10, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>
>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>
>>>>
>>>> On 19.01.2017, at 23:30, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>>
>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>> Thanks Jacob.
>>>>>> Any objections here I put this into trunk?
>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>
>>>> Didn't know that.
>>>>
>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>> stream is not binary. This is why the #isBinary check is the first thing
>>>>> the method does.
>>>>
>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>> is a character. Said bluntly, the responsibility of the TextConverter is
>>>> to make Characters from that bloody numbers in that stream.
>>>> I was confused to see that asciiValue returns something >127 in the first
>>>> place.
>>>
>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>> do what you would expect it to do, it has the advantage to clearly mark
>>> the class of the receiver (in this case).
>>>
>>
>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>> receiver.
>> Because the receiver isn't actually a Character, but some number, encoded in
>> a Character, whose meaning is to be determined by
>> this very method.
>>
>> Also, how do we know that _basic_Next will always return a Character?
>> (Yes, I know there's a binary check, but doesn't that only say something
>> about #next, not #basicNext?)
>>
>>
>>>>
>>>>> If there are plans to make TextConverters work with more general
>>>>> streams, then I persume these changes won't be enough.
>>>>
>>>> Clearly.
>>>> But isn't this a step in the right direction?
>>>
>>> Yes and no. There are at least two ways to go:
>>>
>>> 1. Enhance the current stream library, even at the cost of breaking
>>> things.
>>> A patch here and there won't work. There are fundamental changes required,
>>> like stackable streams, to make it desirable to use it over other
>>> libraries.
>>>
>>> 2. Integrate an existing stream library with better features (e.g.
>>> Xtreams)
>>> If we were to do this, we could gradually migrate existing code to the new
>>> library, and finally make the current stream library unloadable.
>>
>> I like the idea of Xtreams, but I also like going baby steps.
>>
>> The changes here help at least one person, won't hurt others and seem future
>> proof.
>> So?
>>
>> Best regards
>> 	-Tobias
>>
>>
>>
>>> Levente
>>>
>>>>
>>>>> Levente
>>>>>> Looks good from here.
>>>>>> Best regards
>>>>>> 	-Tobias
>>>>>> On 19.01.2017, at 17:14, commits at source.squeak.org wrote:
>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>> ==================== Summary ====================
>>>>>>> Name: Multilingual-jr.218
>>>>>>> Author: jr
>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its
>>>>>>> input stream  returns Characters from basicNext
>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>> TextConverter, so use asInteger instead of asciiValue to support both
>>>>>>> cases. Convert back with asCharacter.
>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>> Item was changed:
>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>> 'utilities') -----
>>>>>>> encodingNames + 	^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>> - 	^ #('latin-1' 'latin1') copy.
>>>>>>> !
>>>>>>> Item was changed:
>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>> 'conversion') -----
>>>>>>> nextFromStream: aStream
>>>>>>>
>>>>>>> 	| char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>> 	aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>> 	char1 := aStream basicNext.
>>>>>>> 	char1 ifNil:[^ nil].
>>>>>>> + 	value1 := char1 asInteger.
>>>>>>> - 	value1 := char1 asciiValue.
>>>>>>> 	value1 <= 127 ifTrue: [
>>>>>>> 		"1-byte char"
>>>>>>> + 		^ char1 asCharacter
>>>>>>> - 		^ char1
>>>>>>> 	].
>>>>>>>
>>>>>>> 	"at least 2-byte char"
>>>>>>> 	char2 := aStream basicNext.
>>>>>>> + 	char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter)].
>>>>>>> + 	value2 := char2 asInteger.
>>>>>>> - 	char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>> - 	value2 := char2 asciiValue.
>>>>>>>
>>>>>>> 	(value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>> 		^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd:
>>>>>>> 63).
>>>>>>> 	].
>>>>>>>
>>>>>>> 	"at least 3-byte char"
>>>>>>> 	char3 := aStream basicNext.
>>>>>>> + 	char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>> + 	value3 := char3 asInteger.
>>>>>>> - 	char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2)].
>>>>>>> - 	value3 := char3 asciiValue.
>>>>>>> 	(value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>> 		unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63)
>>>>>>> bitShift: 6)
>>>>>>> 				+ (value3 bitAnd: 63).
>>>>>>> 	].
>>>>>>>
>>>>>>> 	(value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>> 		"4-byte char"
>>>>>>> 		char4 := aStream basicNext.
>>>>>>> + 		char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>> + 		value4 := char4 asInteger.
>>>>>>> - 		char4 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2 with: char3)].
>>>>>>> - 		value4 := char4 asciiValue.
>>>>>>> 		unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>> 					((value2 bitAnd: 63) bitShift: 12) +
>>>>>>> 					((value3 bitAnd: 63) bitShift: 6) +
>>>>>>> 					(value4 bitAnd: 63).
>>>>>>> 	].
>>>>>>> + 	unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>> - 	unicode ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2 with: char3)].
>>>>>>> 	unicode > 16r10FFFD ifTrue: [
>>>>>>> + 		^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>> - 		^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>> char3).
>>>>>>> 	].
>>>>>>>
>>>>>>> 	unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>> 	^ Unicode value: unicode.
>>>>>>> !
>>
>>
>>


More information about the Squeak-dev mailing list