[squeak-dev] Re: #nextChunk speedup, the future of multibyte streams
Andreas Raab
andreas.raab at gmx.de
Tue Feb 2 03:39:13 UTC 2010
Levente Uzonyi wrote:
> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>> Well, utf8 is an octet stream (bytes), not characters. While we are
>> seeking for '!' character, not byte.
>> Logically, the data flow should be following:
>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>
> This is far from reality, because
> - #nextChunk doesn't work in binary mode:
> 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
> 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
> - text converters don't do any conversion if the stream is binary
Right, although I think Igor's point is slightly different. You could
implement #upTo: for example by applying to encoding to the argument and
then do #upToAllEncoded: which takes an encoded character sequence as
the argument. This would preserve the generality of #upTo: with the
potential for more general speedup. I.e.,
upTo: aCharacter
=> upToEncoded: bytes
=> primitive read
<= return encodedBytes
<= converter decode: encodedBytes
<= returns characters
(one assumption here is that the converter doesn't "embed" a particular
character sequence as a part of another one which is true for UTF-8 but
I'm not sure about other encodings).
> That's what my original questions were about (which are still unanswered):
> - is it safe to assume that the encoding of source files will be
> compatible with this "hack"?
> - is it safe to assume that the source files are always UTF-8 encoded?
I think UTF-8 is going to be the only standard going forward. Precisely
because it has such (often overlooked) extremely useful properties. So
yes, I think it'd be safe to assume that this will work going forward.
Cheers,
- Andreas
More information about the Squeak-dev
mailing list
|