Levente Uzonyi wrote:
On Sun, 31 Jan 2010, Igor Stasenko wrote:
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because
- #nextChunk doesn't work in binary mode: 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary
Right, although I think Igor's point is slightly different. You could implement #upTo: for example by applying to encoding to the argument and then do #upToAllEncoded: which takes an encoded character sequence as the argument. This would preserve the generality of #upTo: with the potential for more general speedup. I.e.,
upTo: aCharacter => upToEncoded: bytes => primitive read <= return encodedBytes <= converter decode: encodedBytes <= returns characters
(one assumption here is that the converter doesn't "embed" a particular character sequence as a part of another one which is true for UTF-8 but I'm not sure about other encodings).
That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?
I think UTF-8 is going to be the only standard going forward. Precisely because it has such (often overlooked) extremely useful properties. So yes, I think it'd be safe to assume that this will work going forward.
Cheers, - Andreas