[squeak-dev] Re: #nextChunk speedup, the future of multibyte streams

Andreas Raab andreas.raab at gmx.de
Tue Feb 2 03:39:13 UTC 2010


Levente Uzonyi wrote:
> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>> Well, utf8 is an octet stream (bytes), not characters. While we are
>> seeking for '!' character, not byte.
>> Logically, the data flow should be following:
>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
> 
> This is far from reality, because
> - #nextChunk doesn't work in binary mode:
>   'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>   'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
> - text converters don't do any conversion if the stream is binary

Right, although I think Igor's point is slightly different. You could 
implement #upTo: for example by applying to encoding to the argument and 
then do #upToAllEncoded: which takes an encoded character sequence as 
the argument. This would preserve the generality of #upTo: with the 
potential for more general speedup. I.e.,

upTo: aCharacter
   => upToEncoded: bytes
      => primitive read
      <= return encodedBytes
   <= converter decode: encodedBytes
<= returns characters

(one assumption here is that the converter doesn't "embed" a particular 
character sequence as a part of another one which is true for UTF-8 but 
I'm not sure about other encodings).

> That's what my original questions were about (which are still unanswered):
> - is it safe to assume that the encoding of source files will be
>   compatible with this "hack"?
> - is it safe to assume that the source files are always UTF-8 encoded?

I think UTF-8 is going to be the only standard going forward. Precisely 
because it has such (often overlooked) extremely useful properties. So 
yes, I think it'd be safe to assume that this will work going forward.

Cheers,
   - Andreas




More information about the Squeak-dev mailing list