[squeak-dev] The Inbox: Multilingual-jr.218.mcz
H. Hirzel
hannes.hirzel at gmail.com
Mon Jan 23 18:24:25 UTC 2017
Interesting in this context the UTF8 decoding implementation of Pharo
5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)
ZnUTF8Encoder>>
nextFromStream: stream
| code byte next |
(byte := stream next) < 128
ifTrue: [ ^ Character codePoint: byte ].
(byte bitAnd: 2r11100000) == 2r11000000
ifTrue: [
code := byte bitAnd: 2r00011111.
((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
ifFalse: [ ^ self errorIllegalContinuationByte ].
code < 128 ifTrue: [ self errorOverlong ].
^ Character codePoint: code ].
(byte bitAnd: 2r11110000) == 2r11100000
ifTrue: [
code := byte bitAnd: 2r00001111.
2 timesRepeat: [
((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
ifFalse: [ ^ self errorIllegalContinuationByte ] ].
code < 2048 ifTrue: [ self errorOverlong ].
code = 65279 "Unicode Byte Order Mark" ifTrue: [
stream atEnd ifTrue: [ self errorIncomplete ].
^ self nextFromStream: stream ].
^ Character codePoint: code ].
(byte bitAnd: 2r11111000) == 2r11110000
ifTrue: [
code := byte bitAnd: 2r00000111.
3 timesRepeat: [
((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
ifFalse: [ ^ self errorIllegalContinuationByte ] ].
code < 65535 ifTrue: [ self errorOverlong ].
^ Character codePoint: code ].
self errorIllegalLeadingByte
On 1/23/17, Levente Uzonyi <leves at caesar.elte.hu> wrote:
> The Pharo version seems to be the Squeak version optimized for
> VisualWorks (ifNil: -> isNil ifTrue:).
>
> Levente
>
> On Mon, 23 Jan 2017, H. Hirzel wrote:
>
>> Below as a comparison the version in Pharo 5.0.
>>
>> Noteworthy to say is that one can not speak about characters in an
>> UTF8 encoded stream which is read byte by byte until one has examined
>> the bytes.
>>
>> So if I read the first thing it is actually a byte. Then I can examine
>> if it is a one-byte character and then return the character. Then I go
>> for the next byte. If it indicates that we have a two byte encoded
>> UTF8 character then I can return the character.
>>
>> So I should have
>>
>> byte1 := aStream basicNext.
>>
>> ... check if we have a one byte character, if yes return the character
>>
>> byte2 := aStream basicNext.
>>
>> ... check if we have a two byte character, if yes return the character
>>
>> byte3 := aStream basicNext.
>>
>> ... check if we have a three byte character, if yes return the character
>>
>>
>> byte4 := aStream basicNext.
>>
>> ... check if we have a four byte character, if yes return the character
>>
>>
>>
>>
>>
>> nextFromStream: aStream
>> | character1 value1 character2 value2 unicode character3 value3
>> character4 value4 |
>> aStream isBinary
>> ifTrue: [ ^ aStream basicNext ].
>> character1 := aStream basicNext.
>> character1 isNil
>> ifTrue: [ ^ nil ].
>> value1 := character1 asciiValue.
>> value1 <= 127
>> ifTrue: [
>> "1-byte character"
>> ^ character1 ]. "at least 2-byte character"
>> character2 := aStream basicNext.
>> character2 isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> value2 := character2 asciiValue.
>> (value1 bitAnd: 16rE0) = 192
>> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
>> (value2 bitAnd: 63) ]. "at least 3-byte character"
>> character3 := aStream basicNext.
>> character3 isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> value3 := character3 asciiValue.
>> (value1 bitAnd: 16rF0) = 224
>> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
>> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
>> (value1 bitAnd: 16rF8) = 240
>> ifTrue: [
>> "4-byte character"
>> character4 := aStream basicNext.
>> character4 isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> value4 := character4 asciiValue.
>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
>> 63) bitShift: 12)
>> + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
>> unicode isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> unicode > 16r10FFFD
>> ifTrue: [ ^ self errorMalformedInput ].
>> unicode = 16rFEFF
>> ifTrue: [ ^ self nextFromStream: aStream ].
>> ^ Unicode value: unicode
>>
>>
>>
>>
>>
>> On 1/22/17, Tobias Pape <Das.Linux at gmx.de> wrote:
>>>
>>> On 22.01.2017, at 16:10, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>
>>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>>
>>>>>
>>>>> On 19.01.2017, at 23:30, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>>>>>
>>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>>> Thanks Jacob.
>>>>>>> Any objections here I put this into trunk?
>>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>>
>>>>> Didn't know that.
>>>>>
>>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>>> stream is not binary. This is why the #isBinary check is the first
>>>>>> thing
>>>>>> the method does.
>>>>>
>>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>>> is a character. Said bluntly, the responsibility of the TextConverter
>>>>> is
>>>>> to make Characters from that bloody numbers in that stream.
>>>>> I was confused to see that asciiValue returns something >127 in the
>>>>> first
>>>>> place.
>>>>
>>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>>> do what you would expect it to do, it has the advantage to clearly mark
>>>> the class of the receiver (in this case).
>>>>
>>>
>>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>>> receiver.
>>> Because the receiver isn't actually a Character, but some number, encoded
>>> in
>>> a Character, whose meaning is to be determined by
>>> this very method.
>>>
>>> Also, how do we know that _basic_Next will always return a Character?
>>> (Yes, I know there's a binary check, but doesn't that only say something
>>> about #next, not #basicNext?)
>>>
>>>
>>>>>
>>>>>> If there are plans to make TextConverters work with more general
>>>>>> streams, then I persume these changes won't be enough.
>>>>>
>>>>> Clearly.
>>>>> But isn't this a step in the right direction?
>>>>
>>>> Yes and no. There are at least two ways to go:
>>>>
>>>> 1. Enhance the current stream library, even at the cost of breaking
>>>> things.
>>>> A patch here and there won't work. There are fundamental changes
>>>> required,
>>>> like stackable streams, to make it desirable to use it over other
>>>> libraries.
>>>>
>>>> 2. Integrate an existing stream library with better features (e.g.
>>>> Xtreams)
>>>> If we were to do this, we could gradually migrate existing code to the
>>>> new
>>>> library, and finally make the current stream library unloadable.
>>>
>>> I like the idea of Xtreams, but I also like going baby steps.
>>>
>>> The changes here help at least one person, won't hurt others and seem
>>> future
>>> proof.
>>> So?
>>>
>>> Best regards
>>> -Tobias
>>>
>>>
>>>
>>>> Levente
>>>>
>>>>>
>>>>>> Levente
>>>>>>> Looks good from here.
>>>>>>> Best regards
>>>>>>> -Tobias
>>>>>>> On 19.01.2017, at 17:14, commits at source.squeak.org wrote:
>>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>>> ==================== Summary ====================
>>>>>>>> Name: Multilingual-jr.218
>>>>>>>> Author: jr
>>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that
>>>>>>>> its
>>>>>>>> input stream returns Characters from basicNext
>>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>>> TextConverter, so use asInteger instead of asciiValue to support
>>>>>>>> both
>>>>>>>> cases. Convert back with asCharacter.
>>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>>> Item was changed:
>>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>>> 'utilities') -----
>>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>>> !
>>>>>>>> Item was changed:
>>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>>> 'conversion') -----
>>>>>>>> nextFromStream: aStream
>>>>>>>>
>>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>>> char1 := aStream basicNext.
>>>>>>>> char1 ifNil:[^ nil].
>>>>>>>> + value1 := char1 asInteger.
>>>>>>>> - value1 := char1 asciiValue.
>>>>>>>> value1 <= 127 ifTrue: [
>>>>>>>> "1-byte char"
>>>>>>>> + ^ char1 asCharacter
>>>>>>>> - ^ char1
>>>>>>>> ].
>>>>>>>>
>>>>>>>> "at least 2-byte char"
>>>>>>>> char2 := aStream basicNext.
>>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter)].
>>>>>>>> + value2 := char2 asInteger.
>>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>>> - value2 := char2 asciiValue.
>>>>>>>>
>>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2
>>>>>>>> bitAnd:
>>>>>>>> 63).
>>>>>>>> ].
>>>>>>>>
>>>>>>>> "at least 3-byte char"
>>>>>>>> char3 := aStream basicNext.
>>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>>> + value3 := char3 asInteger.
>>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>>> char2)].
>>>>>>>> - value3 := char3 asciiValue.
>>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd:
>>>>>>>> 63)
>>>>>>>> bitShift: 6)
>>>>>>>> + (value3 bitAnd: 63).
>>>>>>>> ].
>>>>>>>>
>>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>>> "4-byte char"
>>>>>>>> char4 := aStream basicNext.
>>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>> + value4 := char4 asInteger.
>>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> with:
>>>>>>>> char2 with: char3)].
>>>>>>>> - value4 := char4 asciiValue.
>>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>>>>> (value4 bitAnd: 63).
>>>>>>>> ].
>>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> with:
>>>>>>>> char2 with: char3)].
>>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>>> char3).
>>>>>>>> ].
>>>>>>>>
>>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>>> ^ Unicode value: unicode.
>>>>>>>> !
>>>
>>>
>>>
>
>
More information about the Squeak-dev
mailing list
|