Squeak to/from UTF-8 conversions
bert at freudenbergs.de
Tue Jun 26 07:48:53 UTC 2007
On Jun 26, 2007, at 9:19 , Andreas Raab wrote:
> Hi -
> I was working on a little improvement in UTF-8 conversion speed (so
> far it's about 150x faster for latin-1 text ;-) and for measuring
> the improvements was running a test that said:
> strings := String allSubInstances.
> 1 to: strings size do:[:i|
> original := strings at: i.
> utf8 := original squeakToUtf8.
> copy := utf8 utf8ToSqueak.
> original = copy ifFalse:[self error: 'Encoding problem'].
> When I ran this test it failed on each and every WideString
> instance. Digging into it, it seems that all of the WideStrings in
> Squeak have a language tag that is being supplied implicitly by the
> current LanguageEnvironment.
> 1) From what it looks like right now there is no way to preserve
> that language tag through a UTF-8 conversion. Is this indeed the
> case or am I missing something?
> 2) Given that my language environment is being set to Latin-1, how
> should clients treat UTF-8 to provide the "proper" language tag?
> For example, I expected that a client be able to read and write
> UTF-8 text without implicitly providing that language tag. If
> that's the case, then how does one store these in common text
> files? (I could see how to do this for formatted text but not for
> "plain text files" without further attributation)
> 3) More generally asking, isn't the language tag here more of a
> "decorator" along the lines of text attributes? This would
> certainly model more closely the effect that I'm seeing here (some
> attributes are dropped by the squeak -> utf8 -> squeak conversion)
> *except* that I didn't expect any lossy conversion for strings
> (contrary to Text where dropping text attributes is obviously lossy).
Nice catch. We had the discussion before, and this to me is another
hint that we really should strip the language tag from Strings and
move it to Text attributes. For rendering bare strings the default
language could be taken from the current environment. The problem is,
IIUC, that currently a lot of bare strings are passed around so it
was simpler to just tag the language onto the string itself.
- Bert -
More information about the Squeak-dev