[squeak-dev] Re: #nextChunk speedup, the future of multibyte streams

Thu Feb 4 04:26:55 UTC 2010

On Mon, 1 Feb 2010, Andreas Raab wrote:

> Levente Uzonyi wrote:
>> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>>> Well, utf8 is an octet stream (bytes), not characters. While we are
>>> seeking for '!' character, not byte.
>>> Logically, the data flow should be following:
>>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>> 
>> This is far from reality, because
>> - #nextChunk doesn't work in binary mode:
>>   'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>>   'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
>> - text converters don't do any conversion if the stream is binary
>
> Right, although I think Igor's point is slightly different. You could 
> implement #upTo: for example by applying to encoding to the argument and then 
> do #upToAllEncoded: which takes an encoded character sequence as the 
> argument. This would preserve the generality of #upTo: with the potential for 
> more general speedup. I.e.,
>
> upTo: aCharacter
>  => upToEncoded: bytes
>     => primitive read
>     <= return encodedBytes
>  <= converter decode: encodedBytes
> <= returns characters
>
> (one assumption here is that the converter doesn't "embed" a particular 
> character sequence as a part of another one which is true for UTF-8 but I'm 
> not sure about other encodings).

Another way to do this is to let the converter read the next chunk.
"TextConverter >> #nextChunkFrom: stream" could use the current 
implementation of MultiByteFileStream >> #nextChunk, while 
UTF8TextConverter could use #upTo: (this would also let us avoid the 
#basicUpTo: hack). So we could use any encoding, while speeding up the 
UTF-8 case.

Maybe we could also move the encoding/decoding related methods/tables 
from String and subclasses to the (class side of the) TextConverters.

Levente

>
>> That's what my original questions were about (which are still unanswered):
>> - is it safe to assume that the encoding of source files will be
>>   compatible with this "hack"?
>> - is it safe to assume that the source files are always UTF-8 encoded?
>
> I think UTF-8 is going to be the only standard going forward. Precisely 
> because it has such (often overlooked) extremely useful properties. So yes, I 
> think it'd be safe to assume that this will work going forward.
>
> Cheers,
>  - Andreas
>
>
>