[squeak-dev] Re: #nextChunk speedup, the future of multibyte
leves at elte.hu
Thu Feb 4 04:26:55 UTC 2010
On Mon, 1 Feb 2010, Andreas Raab wrote:
> Levente Uzonyi wrote:
>> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>>> Well, utf8 is an octet stream (bytes), not characters. While we are
>>> seeking for '!' character, not byte.
>>> Logically, the data flow should be following:
>>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>> This is far from reality, because
>> - #nextChunk doesn't work in binary mode:
>> 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>> 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
>> - text converters don't do any conversion if the stream is binary
> Right, although I think Igor's point is slightly different. You could
> implement #upTo: for example by applying to encoding to the argument and then
> do #upToAllEncoded: which takes an encoded character sequence as the
> argument. This would preserve the generality of #upTo: with the potential for
> more general speedup. I.e.,
> upTo: aCharacter
> => upToEncoded: bytes
> => primitive read
> <= return encodedBytes
> <= converter decode: encodedBytes
> <= returns characters
> (one assumption here is that the converter doesn't "embed" a particular
> character sequence as a part of another one which is true for UTF-8 but I'm
> not sure about other encodings).
Another way to do this is to let the converter read the next chunk.
"TextConverter >> #nextChunkFrom: stream" could use the current
implementation of MultiByteFileStream >> #nextChunk, while
UTF8TextConverter could use #upTo: (this would also let us avoid the
#basicUpTo: hack). So we could use any encoding, while speeding up the
Maybe we could also move the encoding/decoding related methods/tables
from String and subclasses to the (class side of the) TextConverters.
>> That's what my original questions were about (which are still unanswered):
>> - is it safe to assume that the encoding of source files will be
>> compatible with this "hack"?
>> - is it safe to assume that the source files are always UTF-8 encoded?
> I think UTF-8 is going to be the only standard going forward. Precisely
> because it has such (often overlooked) extremely useful properties. So yes, I
> think it'd be safe to assume that this will work going forward.
> - Andreas
More information about the Squeak-dev