[squeak-dev] The Inbox: Multilingual-jr.218.mcz
Levente Uzonyi
leves at caesar.elte.hu
Mon Jan 23 14:57:23 UTC 2017
The Pharo version seems to be the Squeak version optimized for
VisualWorks (ifNil: -> isNil ifTrue:).
Levente
On Mon, 23 Jan 2017, H. Hirzel wrote:
> Below as a comparison the version in Pharo 5.0.
>
> Noteworthy to say is that one can not speak about characters in an
> UTF8 encoded stream which is read byte by byte until one has examined
> the bytes.
>
> So if I read the first thing it is actually a byte. Then I can examine
> if it is a one-byte character and then return the character. Then I go
> for the next byte. If it indicates that we have a two byte encoded
> UTF8 character then I can return the character.
>
> So I should have
>
> byte1 := aStream basicNext.
>
> ... check if we have a one byte character, if yes return the character
>
> byte2 := aStream basicNext.
>
> ... check if we have a two byte character, if yes return the character
>
> byte3 := aStream basicNext.
>
> ... check if we have a three byte character, if yes return the character
>
>
> byte4 := aStream basicNext.
>
> ... check if we have a four byte character, if yes return the character
>
>
>
>
>
> nextFromStream: aStream
> | character1 value1 character2 value2 unicode character3 value3
> character4 value4 |
> aStream isBinary
> ifTrue: [ ^ aStream basicNext ].
> character1 := aStream basicNext.
> character1 isNil
> ifTrue: [ ^ nil ].
> value1 := character1 asciiValue.
> value1 <= 127
> ifTrue: [
> "1-byte character"
> ^ character1 ]. "at least 2-byte character"
> character2 := aStream basicNext.
> character2 isNil
> ifTrue: [ ^ self errorMalformedInput ].
> value2 := character2 asciiValue.
> (value1 bitAnd: 16rE0) = 192
> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
> (value2 bitAnd: 63) ]. "at least 3-byte character"
> character3 := aStream basicNext.
> character3 isNil
> ifTrue: [ ^ self errorMalformedInput ].
> value3 := character3 asciiValue.
> (value1 bitAnd: 16rF0) = 224
> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
> (value1 bitAnd: 16rF8) = 240
> ifTrue: [
> "4-byte character"
> character4 := aStream basicNext.
> character4 isNil
> ifTrue: [ ^ self errorMalformedInput ].
> value4 := character4 asciiValue.
> unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
> 63) bitShift: 12)
> + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
> unicode isNil
> ifTrue: [ ^ self errorMalformedInput ].
> unicode > 16r10FFFD
> ifTrue: [ ^ self errorMalformedInput ].
> unicode = 16rFEFF
> ifTrue: [ ^ self nextFromStream: aStream ].
> ^ Unicode value: unicode
>
>
>
>
>
> On 1/22/17, Tobias Pape <Das.Linux at gmx.de> wrote:
>>
>> On 22.01.2017, at 16:10, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>
>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>
>>>>
>>>> On 19.01.2017, at 23:30, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>>
>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>> Thanks Jacob.
>>>>>> Any objections here I put this into trunk?
>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>
>>>> Didn't know that.
>>>>
>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>> stream is not binary. This is why the #isBinary check is the first thing
>>>>> the method does.
>>>>
>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>> is a character. Said bluntly, the responsibility of the TextConverter is
>>>> to make Characters from that bloody numbers in that stream.
>>>> I was confused to see that asciiValue returns something >127 in the first
>>>> place.
>>>
>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>> do what you would expect it to do, it has the advantage to clearly mark
>>> the class of the receiver (in this case).
>>>
>>
>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>> receiver.
>> Because the receiver isn't actually a Character, but some number, encoded in
>> a Character, whose meaning is to be determined by
>> this very method.
>>
>> Also, how do we know that _basic_Next will always return a Character?
>> (Yes, I know there's a binary check, but doesn't that only say something
>> about #next, not #basicNext?)
>>
>>
>>>>
>>>>> If there are plans to make TextConverters work with more general
>>>>> streams, then I persume these changes won't be enough.
>>>>
>>>> Clearly.
>>>> But isn't this a step in the right direction?
>>>
>>> Yes and no. There are at least two ways to go:
>>>
>>> 1. Enhance the current stream library, even at the cost of breaking
>>> things.
>>> A patch here and there won't work. There are fundamental changes required,
>>> like stackable streams, to make it desirable to use it over other
>>> libraries.
>>>
>>> 2. Integrate an existing stream library with better features (e.g.
>>> Xtreams)
>>> If we were to do this, we could gradually migrate existing code to the new
>>> library, and finally make the current stream library unloadable.
>>
>> I like the idea of Xtreams, but I also like going baby steps.
>>
>> The changes here help at least one person, won't hurt others and seem future
>> proof.
>> So?
>>
>> Best regards
>> -Tobias
>>
>>
>>
>>> Levente
>>>
>>>>
>>>>> Levente
>>>>>> Looks good from here.
>>>>>> Best regards
>>>>>> -Tobias
>>>>>> On 19.01.2017, at 17:14, commits at source.squeak.org wrote:
>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>> ==================== Summary ====================
>>>>>>> Name: Multilingual-jr.218
>>>>>>> Author: jr
>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its
>>>>>>> input stream returns Characters from basicNext
>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>> TextConverter, so use asInteger instead of asciiValue to support both
>>>>>>> cases. Convert back with asCharacter.
>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>> Item was changed:
>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>> 'utilities') -----
>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>> !
>>>>>>> Item was changed:
>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>> 'conversion') -----
>>>>>>> nextFromStream: aStream
>>>>>>>
>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>> char1 := aStream basicNext.
>>>>>>> char1 ifNil:[^ nil].
>>>>>>> + value1 := char1 asInteger.
>>>>>>> - value1 := char1 asciiValue.
>>>>>>> value1 <= 127 ifTrue: [
>>>>>>> "1-byte char"
>>>>>>> + ^ char1 asCharacter
>>>>>>> - ^ char1
>>>>>>> ].
>>>>>>>
>>>>>>> "at least 2-byte char"
>>>>>>> char2 := aStream basicNext.
>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter)].
>>>>>>> + value2 := char2 asInteger.
>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>> - value2 := char2 asciiValue.
>>>>>>>
>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd:
>>>>>>> 63).
>>>>>>> ].
>>>>>>>
>>>>>>> "at least 3-byte char"
>>>>>>> char3 := aStream basicNext.
>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>> + value3 := char3 asInteger.
>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2)].
>>>>>>> - value3 := char3 asciiValue.
>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63)
>>>>>>> bitShift: 6)
>>>>>>> + (value3 bitAnd: 63).
>>>>>>> ].
>>>>>>>
>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>> "4-byte char"
>>>>>>> char4 := aStream basicNext.
>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>> + value4 := char4 asInteger.
>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2 with: char3)].
>>>>>>> - value4 := char4 asciiValue.
>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>>>> (value4 bitAnd: 63).
>>>>>>> ].
>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2 with: char3)].
>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>> char3).
>>>>>>> ].
>>>>>>>
>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>> ^ Unicode value: unicode.
>>>>>>> !
>>
>>
>>
More information about the Squeak-dev
mailing list
|