Squeak to/from UTF-8 conversions
Andreas Raab
andreas.raab at gmx.de
Tue Jun 26 07:19:04 UTC 2007
Hi -
I was working on a little improvement in UTF-8 conversion speed (so far
it's about 150x faster for latin-1 text ;-) and for measuring the
improvements was running a test that said:
strings := String allSubInstances.
1 to: strings size do:[:i|
original := strings at: i.
utf8 := original squeakToUtf8.
copy := utf8 utf8ToSqueak.
original = copy ifFalse:[self error: 'Encoding problem'].
].
When I ran this test it failed on each and every WideString instance.
Digging into it, it seems that all of the WideStrings in Squeak have a
language tag that is being supplied implicitly by the current
LanguageEnvironment.
Questions:
1) From what it looks like right now there is no way to preserve that
language tag through a UTF-8 conversion. Is this indeed the case or am I
missing something?
2) Given that my language environment is being set to Latin-1, how
should clients treat UTF-8 to provide the "proper" language tag? For
example, I expected that a client be able to read and write UTF-8 text
without implicitly providing that language tag. If that's the case, then
how does one store these in common text files? (I could see how to do
this for formatted text but not for "plain text files" without further
attributation)
3) More generally asking, isn't the language tag here more of a
"decorator" along the lines of text attributes? This would certainly
model more closely the effect that I'm seeing here (some attributes are
dropped by the squeak -> utf8 -> squeak conversion) *except* that I
didn't expect any lossy conversion for strings (contrary to Text where
dropping text attributes is obviously lossy).
Thanks for any help,
- Andreas
More information about the Squeak-dev
mailing list
|