On 23.07.2014, at 02:01, tim Rowledge <tim@rowledge.org> wrote:
I guess Tim spent too much time in the Scratch image, which still was MacRoman.
Oh. Well, I was basing my comment on comments in code that I came across. Guess they need fixing. This isn't something I've ever felt a need to think about before so it's all new and clunky to me...
One thinks springs to mind though - if the basic ByteString is Latin-1/utf why do we have any code to convert ? Right now (in my 4.5) it looks like there is a relatively slow check for any non-compliant chars in the #squeakToUtf8 method. Can we drop that now? It would likely be nice if any old ByteString were acceptable to the Cairo/pango plugin.
Well, Latin1 matches the first 256 codepoints in Unicode, but only codepoints < 128 (a.k.a. ASCII) have a one-byte encoding in UTF-8. That's why we need to check. If all chars are < 128 then the ByteString is return unmodified.
Hopefully we don't really need to go back in my usage case - Scratch i18n short strings with very little editing. I can probably keep the 'real' string and convert as and when needed for the displaying methods, maybe even caching the converted form. For the longer term we should at least consider doing a better cleaner job so as to life in a world where it at least appears that UTF8 is becoming a new standard. I have no idea how everyone is handling editing variable length encoded texts.
UTF8 is only a standard for externalizing strings. Internally it's too cumbersome to work with.
Certainly a possibility. A simple version might just do a convert/edit/reconvert for every operation, but there has to be a better way.
A string-like class storing its chars in ByteArrays plus an encoding would be nice indeed. Not sure it should be a String subclass (like Scratch's "UTF8" class), because operations would be weird at least for UTF8 with its varying bytes-per-char. Rather have conversion methods from/to actual Strings.
That way we would have objects that know their encoding, rather than the current squeakToUtf8 which results in an invalid String and hence must be used only temporarily for passing to a primitive (file save, socket send etc).
But I'd say you should put squeakToUtf8 sends in the primitive call code and if the repeated conversion is actually slowing things down then replace the strings by some encoded thing which would return self in response to that message.