I don't like at all having a String being a blob of bits subject to encoding interpretation. String is a collection of characters, and there should be a canonical encoding known from the VM. utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks.
We should use ByteArray, or better, introduce an UTF8String if it becomes that important. Code will be much much much cleaner and foolproof.
Nicolas
2010/2/2 Andreas Raab andreas.raab@gmx.de:
Levente Uzonyi wrote:
On Sun, 31 Jan 2010, Igor Stasenko wrote:
Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because
- #nextChunk doesn't work in binary mode:
'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary
Right, although I think Igor's point is slightly different. You could implement #upTo: for example by applying to encoding to the argument and then do #upToAllEncoded: which takes an encoded character sequence as the argument. This would preserve the generality of #upTo: with the potential for more general speedup. I.e.,
upTo: aCharacter => upToEncoded: bytes => primitive read <= return encodedBytes <= converter decode: encodedBytes <= returns characters
(one assumption here is that the converter doesn't "embed" a particular character sequence as a part of another one which is true for UTF-8 but I'm not sure about other encodings).
That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be
compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?
I think UTF-8 is going to be the only standard going forward. Precisely because it has such (often overlooked) extremely useful properties. So yes, I think it'd be safe to assume that this will work going forward.
Cheers, - Andreas