Squeak to/from UTF-8 conversions

Yoshiki Ohshima yoshiki at squeakland.org
Fri Jun 29 01:53:07 UTC 2007


  As Bert suggested, the Right Thing is to build a system on an
assumption that bare String and Characters cannot be really displayed.
For method source, the tag is encoded as the text property so they are
retained.  A XML-like (or whatever) format in UTF-8 for storing Squeak
Text and use it almost always is the consecuence from it.

-- Yoshiki

At Tue, 26 Jun 2007 00:19:04 -0700,
Andreas Raab wrote:
> 
> Hi -
> 
> I was working on a little improvement in UTF-8 conversion speed (so far 
> it's about 150x faster for latin-1 text ;-) and for measuring the 
> improvements was running a test that said:
> 
> strings := String allSubInstances.
> 1 to: strings size do:[:i|
> 	original := strings at: i.
> 	utf8 := original squeakToUtf8.
> 	copy := utf8 utf8ToSqueak.
> 	original = copy ifFalse:[self error: 'Encoding problem'].
> ].
> 
> When I ran this test it failed on each and every WideString instance. 
> Digging into it, it seems that all of the WideStrings in Squeak have a 
> language tag that is being supplied implicitly by the current 
> LanguageEnvironment.
> 
> Questions:
> 1) From what it looks like right now there is no way to preserve that 
> language tag through a UTF-8 conversion. Is this indeed the case or am I 
> missing something?
> 2) Given that my language environment is being set to Latin-1, how 
> should clients treat UTF-8 to provide the "proper" language tag? For 
> example, I expected that a client be able to read and write UTF-8 text 
> without implicitly providing that language tag. If that's the case, then 
> how does one store these in common text files? (I could see how to do 
> this for formatted text but not for "plain text files" without further 
> attributation)
> 3) More generally asking, isn't the language tag here more of a 
> "decorator" along the lines of text attributes? This would certainly 
> model more closely the effect that I'm seeing here (some attributes are 
> dropped by the squeak -> utf8 -> squeak conversion) *except* that I 
> didn't expect any lossy conversion for strings (contrary to Text where 
> dropping text attributes is obviously lossy).
> 
> Thanks for any help,
>    - Andreas



More information about the Squeak-dev mailing list