Squeak to/from UTF-8 conversions
ncellier at ifrance.com
Tue Jun 26 18:08:01 UTC 2007
However, if you strip the language tag, you will run into very minor
bugs with the A macron and a macron, because their encodings have been
hijacked as CrossedX and EndOfRun in the CharacterScanner family (clever
trick when Characters were 256). I searched how these damned characters
could ever work in Squeak and Sophie, and found black magic was this
Andreas, maybe you could have a look at how RTF text are converted in
SOphie, it seems to deal with language tag correctly, at least with
extended latin characters.
Bert Freudenberg a écrit :
> On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
>> Hi -
>> I was working on a little improvement in UTF-8 conversion speed (so
>> far it's about 150x faster for latin-1 text ;-) and for measuring the
>> improvements was running a test that said:
>> strings := String allSubInstances.
>> 1 to: strings size do:[:i|
>> original := strings at: i.
>> utf8 := original squeakToUtf8.
>> copy := utf8 utf8ToSqueak.
>> original = copy ifFalse:[self error: 'Encoding problem'].
>> When I ran this test it failed on each and every WideString instance.
>> Digging into it, it seems that all of the WideStrings in Squeak have a
>> language tag that is being supplied implicitly by the current
>> 1) From what it looks like right now there is no way to preserve that
>> language tag through a UTF-8 conversion. Is this indeed the case or am
>> I missing something?
>> 2) Given that my language environment is being set to Latin-1, how
>> should clients treat UTF-8 to provide the "proper" language tag? For
>> example, I expected that a client be able to read and write UTF-8 text
>> without implicitly providing that language tag. If that's the case,
>> then how does one store these in common text files? (I could see how
>> to do this for formatted text but not for "plain text files" without
>> further attributation)
>> 3) More generally asking, isn't the language tag here more of a
>> "decorator" along the lines of text attributes? This would certainly
>> model more closely the effect that I'm seeing here (some attributes
>> are dropped by the squeak -> utf8 -> squeak conversion) *except* that
>> I didn't expect any lossy conversion for strings (contrary to Text
>> where dropping text attributes is obviously lossy).
> Nice catch. We had the discussion before, and this to me is another hint
> that we really should strip the language tag from Strings and move it to
> Text attributes. For rendering bare strings the default language could
> be taken from the current environment. The problem is, IIUC, that
> currently a lot of bare strings are passed around so it was simpler to
> just tag the language onto the string itself.
> - Bert -
More information about the Squeak-dev