Squeak to/from UTF-8 conversions
andreas.raab at gmx.de
Tue Jun 26 18:22:19 UTC 2007
nicolas cellier wrote:
> However, if you strip the language tag, you will run into very minor
> bugs with the A macron and a macron, because their encodings have been
> hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever
> trick when Characters were 256). I searched how these damned characters
> could ever work in Squeak and Sophie, and found black magic was this
> language tag.
Ah, how interesting. I wasn't even aware of that but it makes good
sense. Which, in a sense, only emphasizes question 2) below given that
the default Latin-1 environment doesn't seem to set a language code
> Andreas, maybe you could have a look at how RTF text are converted in
> SOphie, it seems to deal with language tag correctly, at least with
> extended latin characters.
Good point. Unfortunately, I don't have the time to get into Sophie in
detail (I was just trying to understand why UTF-8 conversion is lossy
and what to do about it) but if someone would give me a primer on how
Sophie deals with these issues I'd appreciate it.
> Bert Freudenberg a écrit :
>> On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
>>> Hi -
>>> I was working on a little improvement in UTF-8 conversion speed (so
>>> far it's about 150x faster for latin-1 text ;-) and for measuring the
>>> improvements was running a test that said:
>>> strings := String allSubInstances.
>>> 1 to: strings size do:[:i|
>>> original := strings at: i.
>>> utf8 := original squeakToUtf8.
>>> copy := utf8 utf8ToSqueak.
>>> original = copy ifFalse:[self error: 'Encoding problem'].
>>> When I ran this test it failed on each and every WideString instance.
>>> Digging into it, it seems that all of the WideStrings in Squeak have
>>> a language tag that is being supplied implicitly by the current
>>> 1) From what it looks like right now there is no way to preserve that
>>> language tag through a UTF-8 conversion. Is this indeed the case or
>>> am I missing something?
>>> 2) Given that my language environment is being set to Latin-1, how
>>> should clients treat UTF-8 to provide the "proper" language tag? For
>>> example, I expected that a client be able to read and write UTF-8
>>> text without implicitly providing that language tag. If that's the
>>> case, then how does one store these in common text files? (I could
>>> see how to do this for formatted text but not for "plain text files"
>>> without further attributation)
>>> 3) More generally asking, isn't the language tag here more of a
>>> "decorator" along the lines of text attributes? This would certainly
>>> model more closely the effect that I'm seeing here (some attributes
>>> are dropped by the squeak -> utf8 -> squeak conversion) *except* that
>>> I didn't expect any lossy conversion for strings (contrary to Text
>>> where dropping text attributes is obviously lossy).
>> Nice catch. We had the discussion before, and this to me is another
>> hint that we really should strip the language tag from Strings and
>> move it to Text attributes. For rendering bare strings the default
>> language could be taken from the current environment. The problem is,
>> IIUC, that currently a lot of bare strings are passed around so it was
>> simpler to just tag the language onto the string itself.
>> - Bert -
More information about the Squeak-dev