[squeak-dev] Re: #nextChunk speedup, the future of multibyte streams

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Wed Feb 3 08:13:21 UTC 2010


I don't like at all having a String being a blob of bits subject to
encoding interpretation.
String is a collection of characters, and there should be a canonical
encoding known from the VM.
utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks.

We should use ByteArray, or better, introduce an UTF8String if it
becomes that important.
Code will be much much much cleaner and foolproof.

Nicolas

2010/2/2 Andreas Raab <andreas.raab at gmx.de>:
> Levente Uzonyi wrote:
>>
>> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>>>
>>> Well, utf8 is an octet stream (bytes), not characters. While we are
>>> seeking for '!' character, not byte.
>>> Logically, the data flow should be following:
>>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>>
>> This is far from reality, because
>> - #nextChunk doesn't work in binary mode:
>>  'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>>  'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
>> - text converters don't do any conversion if the stream is binary
>
> Right, although I think Igor's point is slightly different. You could
> implement #upTo: for example by applying to encoding to the argument and
> then do #upToAllEncoded: which takes an encoded character sequence as the
> argument. This would preserve the generality of #upTo: with the potential
> for more general speedup. I.e.,
>
> upTo: aCharacter
>  => upToEncoded: bytes
>     => primitive read
>     <= return encodedBytes
>  <= converter decode: encodedBytes
> <= returns characters
>
> (one assumption here is that the converter doesn't "embed" a particular
> character sequence as a part of another one which is true for UTF-8 but I'm
> not sure about other encodings).
>
>> That's what my original questions were about (which are still unanswered):
>> - is it safe to assume that the encoding of source files will be
>>  compatible with this "hack"?
>> - is it safe to assume that the source files are always UTF-8 encoded?
>
> I think UTF-8 is going to be the only standard going forward. Precisely
> because it has such (often overlooked) extremely useful properties. So yes,
> I think it'd be safe to assume that this will work going forward.
>
> Cheers,
>  - Andreas
>
>
>



More information about the Squeak-dev mailing list