Squeak to/from UTF-8 conversions

Andreas Raab andreas.raab at gmx.de
Fri Jun 29 05:41:42 UTC 2007


Yoshiki Ohshima wrote:
>   As Bert suggested, the Right Thing is to build a system on an
> assumption that bare String and Characters cannot be really displayed.

I'm sure it is, but unfortunately we don't have the time to do the Right 
Thing since we need to get a product out the door ;-)

> For method source, the tag is encoded as the text property so they are
> retained.  A XML-like (or whatever) format in UTF-8 for storing Squeak
> Text and use it almost always is the consecuence from it.

Thanks. So if I hear you correctly you are recommending to preserve the 
language tag via additional attributes. Is that correct?

Cheers,
   - Andreas

> 
> -- Yoshiki
> 
> At Tue, 26 Jun 2007 00:19:04 -0700,
> Andreas Raab wrote:
>> Hi -
>>
>> I was working on a little improvement in UTF-8 conversion speed (so far 
>> it's about 150x faster for latin-1 text ;-) and for measuring the 
>> improvements was running a test that said:
>>
>> strings := String allSubInstances.
>> 1 to: strings size do:[:i|
>> 	original := strings at: i.
>> 	utf8 := original squeakToUtf8.
>> 	copy := utf8 utf8ToSqueak.
>> 	original = copy ifFalse:[self error: 'Encoding problem'].
>> ].
>>
>> When I ran this test it failed on each and every WideString instance. 
>> Digging into it, it seems that all of the WideStrings in Squeak have a 
>> language tag that is being supplied implicitly by the current 
>> LanguageEnvironment.
>>
>> Questions:
>> 1) From what it looks like right now there is no way to preserve that 
>> language tag through a UTF-8 conversion. Is this indeed the case or am I 
>> missing something?
>> 2) Given that my language environment is being set to Latin-1, how 
>> should clients treat UTF-8 to provide the "proper" language tag? For 
>> example, I expected that a client be able to read and write UTF-8 text 
>> without implicitly providing that language tag. If that's the case, then 
>> how does one store these in common text files? (I could see how to do 
>> this for formatted text but not for "plain text files" without further 
>> attributation)
>> 3) More generally asking, isn't the language tag here more of a 
>> "decorator" along the lines of text attributes? This would certainly 
>> model more closely the effect that I'm seeing here (some attributes are 
>> dropped by the squeak -> utf8 -> squeak conversion) *except* that I 
>> didn't expect any lossy conversion for strings (contrary to Text where 
>> dropping text attributes is obviously lossy).
>>
>> Thanks for any help,
>>    - Andreas
> 
> 




More information about the Squeak-dev mailing list