[squeak-dev] Re: #nextChunk speedup, the future of multibyte streams

2 Feb 2010

      Levente Uzonyi wrote:
...
On Sun, 31 Jan 2010, Igor Stasenko wrote:
...
Well, utf8 is an octet stream (bytes), not characters. While we are
seeking for '!' character, not byte.
Logically, the data flow should be following:
<primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because

#nextChunk doesn't work in binary mode:
'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
text converters don't do any conversion if the stream is binary

Right, although I think Igor's point is slightly different. You could 
implement #upTo: for example by applying to encoding to the argument and 
then do #upToAllEncoded: which takes an encoded character sequence as 
the argument. This would preserve the generality of #upTo: with the 
potential for more general speedup. I.e.,
upTo: aCharacter
   => upToEncoded: bytes
      => primitive read
      <= return encodedBytes
   <= converter decode: encodedBytes
<= returns characters
(one assumption here is that the converter doesn't "embed" a particular 
character sequence as a part of another one which is true for UTF-8 but 
I'm not sure about other encodings).
...
That's what my original questions were about (which are still unanswered):

is it safe to assume that the encoding of source files will be
compatible with this "hack"?
is it safe to assume that the source files are always UTF-8 encoded?

I think UTF-8 is going to be the only standard going forward. Precisely 
because it has such (often overlooked) extremely useful properties. So 
yes, I think it'd be safe to assume that this will work going forward.
Cheers,
   - Andreas