Squeak to/from UTF-8 conversions

Tue Jun 26 18:22:19 UTC 2007

nicolas cellier wrote:
> However, if you strip the language tag, you will run into very minor 
> bugs with the A macron and a macron, because their encodings have been 
> hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever 
> trick when Characters were 256). I searched how these damned characters 
> could ever work in Squeak and Sophie, and found black magic was this 
> language tag.

Ah, how interesting. I wasn't even aware of that but it makes good 
sense. Which, in a sense, only emphasizes question 2) below given that 
the default Latin-1 environment doesn't seem to set a language code 
whatsoever.

> Andreas, maybe you could have a look at how RTF text are converted in 
> SOphie, it seems to deal with language tag correctly, at least with 
> extended latin characters.

Good point. Unfortunately, I don't have the time to get into Sophie in 
detail (I was just trying to understand why UTF-8 conversion is lossy 
and what to do about it) but if someone would give me a primer on how 
Sophie deals with these issues I'd appreciate it.

Cheers,
   - Andreas

> 
> Nicolas
> 
> Bert Freudenberg a écrit :
>>
>> On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
>>
>>> Hi -
>>>
>>> I was working on a little improvement in UTF-8 conversion speed (so 
>>> far it's about 150x faster for latin-1 text ;-) and for measuring the 
>>> improvements was running a test that said:
>>>
>>> strings := String allSubInstances.
>>> 1 to: strings size do:[:i|
>>>     original := strings at: i.
>>>     utf8 := original squeakToUtf8.
>>>     copy := utf8 utf8ToSqueak.
>>>     original = copy ifFalse:[self error: 'Encoding problem'].
>>> ].
>>>
>>> When I ran this test it failed on each and every WideString instance. 
>>> Digging into it, it seems that all of the WideStrings in Squeak have 
>>> a language tag that is being supplied implicitly by the current 
>>> LanguageEnvironment.
>>>
>>> Questions:
>>> 1) From what it looks like right now there is no way to preserve that 
>>> language tag through a UTF-8 conversion. Is this indeed the case or 
>>> am I missing something?
>>> 2) Given that my language environment is being set to Latin-1, how 
>>> should clients treat UTF-8 to provide the "proper" language tag? For 
>>> example, I expected that a client be able to read and write UTF-8 
>>> text without implicitly providing that language tag. If that's the 
>>> case, then how does one store these in common text files? (I could 
>>> see how to do this for formatted text but not for "plain text files" 
>>> without further attributation)
>>> 3) More generally asking, isn't the language tag here more of a 
>>> "decorator" along the lines of text attributes? This would certainly 
>>> model more closely the effect that I'm seeing here (some attributes 
>>> are dropped by the squeak -> utf8 -> squeak conversion) *except* that 
>>> I didn't expect any lossy conversion for strings (contrary to Text 
>>> where dropping text attributes is obviously lossy).
>>
>> Nice catch. We had the discussion before, and this to me is another 
>> hint that we really should strip the language tag from Strings and 
>> move it to Text attributes. For rendering bare strings the default 
>> language could be taken from the current environment. The problem is, 
>> IIUC, that currently a lot of bare strings are passed around so it was 
>> simpler to just tag the language onto the string itself.
>>
>> - Bert -
>>
>>
>>
>>
> 
> 
>