[squeak-dev] #nextChunk speedup, the future of multibyte streams

Levente Uzonyi leves at elte.hu
Sun Jan 31 23:26:11 UTC 2010


On Sun, 31 Jan 2010, Igor Stasenko wrote:

> 2010/1/31 Levente Uzonyi <leves at elte.hu>:
>> On Sat, 30 Jan 2010, Igor Stasenko wrote:
>>
>>> On 30 January 2010 09:15, Bert Freudenberg <bert at freudenbergs.de> wrote:
>>>>
>>>> On 29.01.2010, at 20:07, Chris Cunningham wrote:
>>>>>
>>>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <leves at elte.hu> wrote:
>>>>>>
>>>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs
>>>>>> in
>>>>>>  the encoded stream that byte is an encoded ! character
>>>>>
>>>>> The "whenever byte 33 occurs in the encoded stream that byte is an
>>>>> encoded ! character" part of this seems suspect to me.  Are you
>>>>> checking the bytes for byte 33, or are you still checking characters,
>>>>> and one of the characters is byte 33, then you assume it is ! ?  If
>>>>> you are just scanning bytes, I would assume that some UTF-8 characters
>>>>> could have a byte 33 encoded in them.
>>>>
>>>> Wrong.
>>>>
>>>>> Although I'm not a UTF-8 expert.
>>>>
>>>> Obviously ;) See
>>>>
>>>> http://en.wikipedia.org/wiki/UTF-8#Description
>>>>
>>> Either way, the presence of ! character should be tested after
>>> decoding utf8 data.
>>
>> Why? UTF-8 is ASCII compatible.
>>
>
> Well, utf8 is an octet stream (bytes), not characters. While we are
> seeking for '!' character, not byte.
> Logically, the data flow should be following:
> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'

This is far from reality, because
- #nextChunk doesn't work in binary mode:
   'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
   'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary

>
> sure, due to nature of utf8 encoding you could shortcut, but then
> because of such hacks, you won't be able to
> switch to different encoding without pain:
>
> <primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'
>

That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be
   compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?


Levente

>>
>> Levente
>>
>
>
> -- 
> Best regards,
> Igor Stasenko AKA sig.
>
>


More information about the Squeak-dev mailing list