[squeak-dev] #nextChunk speedup, the future of multibyte streams

Sun Jan 31 18:54:47 UTC 2010

2010/1/31 Levente Uzonyi <leves at elte.hu>:
> On Sat, 30 Jan 2010, Igor Stasenko wrote:
>
>> On 30 January 2010 09:15, Bert Freudenberg <bert at freudenbergs.de> wrote:
>>>
>>> On 29.01.2010, at 20:07, Chris Cunningham wrote:
>>>>
>>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <leves at elte.hu> wrote:
>>>>>
>>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs
>>>>> in
>>>>>  the encoded stream that byte is an encoded ! character
>>>>
>>>> The "whenever byte 33 occurs in the encoded stream that byte is an
>>>> encoded ! character" part of this seems suspect to me.  Are you
>>>> checking the bytes for byte 33, or are you still checking characters,
>>>> and one of the characters is byte 33, then you assume it is ! ?  If
>>>> you are just scanning bytes, I would assume that some UTF-8 characters
>>>> could have a byte 33 encoded in them.
>>>
>>> Wrong.
>>>
>>>> Although I'm not a UTF-8 expert.
>>>
>>> Obviously ;) See
>>>
>>> http://en.wikipedia.org/wiki/UTF-8#Description
>>>
>> Either way, the presence of ! character should be tested after
>> decoding utf8 data.
>
> Why? UTF-8 is ASCII compatible.
>

Well, utf8 is an octet stream (bytes), not characters. While we are
seeking for '!' character, not byte.
Logically, the data flow should be following:
<primitive> -> ByteArray -> utf8 reader -> character stream -> '!'

sure, due to nature of utf8 encoding you could shortcut, but then
because of such hacks, you won't be able to
switch to different encoding without pain:

<primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'

>
> Levente
>

-- 
Best regards,
Igor Stasenko AKA sig.