Squeak to/from UTF-8 conversions

Tue Jun 26 18:08:01 UTC 2007

However, if you strip the language tag, you will run into very minor 
bugs with the A macron and a macron, because their encodings have been 
hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever 
trick when Characters were 256). I searched how these damned characters 
could ever work in Squeak and Sophie, and found black magic was this 
language tag.

Andreas, maybe you could have a look at how RTF text are converted in 
SOphie, it seems to deal with language tag correctly, at least with 
extended latin characters.

Nicolas

Bert Freudenberg a écrit :
> 
> On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
> 
>> Hi -
>>
>> I was working on a little improvement in UTF-8 conversion speed (so 
>> far it's about 150x faster for latin-1 text ;-) and for measuring the 
>> improvements was running a test that said:
>>
>> strings := String allSubInstances.
>> 1 to: strings size do:[:i|
>>     original := strings at: i.
>>     utf8 := original squeakToUtf8.
>>     copy := utf8 utf8ToSqueak.
>>     original = copy ifFalse:[self error: 'Encoding problem'].
>> ].
>>
>> When I ran this test it failed on each and every WideString instance. 
>> Digging into it, it seems that all of the WideStrings in Squeak have a 
>> language tag that is being supplied implicitly by the current 
>> LanguageEnvironment.
>>
>> Questions:
>> 1) From what it looks like right now there is no way to preserve that 
>> language tag through a UTF-8 conversion. Is this indeed the case or am 
>> I missing something?
>> 2) Given that my language environment is being set to Latin-1, how 
>> should clients treat UTF-8 to provide the "proper" language tag? For 
>> example, I expected that a client be able to read and write UTF-8 text 
>> without implicitly providing that language tag. If that's the case, 
>> then how does one store these in common text files? (I could see how 
>> to do this for formatted text but not for "plain text files" without 
>> further attributation)
>> 3) More generally asking, isn't the language tag here more of a 
>> "decorator" along the lines of text attributes? This would certainly 
>> model more closely the effect that I'm seeing here (some attributes 
>> are dropped by the squeak -> utf8 -> squeak conversion) *except* that 
>> I didn't expect any lossy conversion for strings (contrary to Text 
>> where dropping text attributes is obviously lossy).
> 
> Nice catch. We had the discussion before, and this to me is another hint 
> that we really should strip the language tag from Strings and move it to 
> Text attributes. For rendering bare strings the default language could 
> be taken from the current environment. The problem is, IIUC, that 
> currently a lot of bare strings are passed around so it was simpler to 
> just tag the language onto the string itself.
> 
> - Bert -
> 
> 
> 
>