[Seaside] [CONFUSED]: WAKom, WAKomEncoded or WAKomEncoded 3.9 - utf8 internal encoding?

Tue Feb 24 19:19:25 UTC 2009

2009/2/24 Michal <michal-list at auf.net>:
>
> Lukas wrote:
>> However since the internal encoding of Squeak is *not UTF-8* many
>> strings will appear scrambled when looking at them using an inspector.
>> It works well though as long as you do not perform heavy string
>> scrambling, because the strings are sent back as is. If you have
>> string literals with foreign characters in your application code you
>> need to make sure that these are valid UTF-8 as well. This is very
>> efficient, but you need to be aware of the implications.
>
> What happens if squeak is made to use UTF-8 internally?

String and Character loose all semantics. For example #size will
answer the number of bytes, not the number of characters. #at: will
answer the byte at the given index, not the Character at the given
index. For example ä will be represented as (String with: (Character
value: 195) with: (Character value: 164)) 'Ã€'.

> Ie the unix
> man page and various postings on squeak-dev/newbies suggest that a
> recent squeak VM/image combo started with '-encoding utf8' should work
> well as a utf8 image (provided the correct font is supplied, etc).

That's unrelated.

> In such a case, should plain WAKom be used?

If you're cool with the behavior described above, then use WAKom.

> With no issue wrt to
> string operations like #=, #size and #copyFrom:to: ?

#= has limited usability due to missing Unicode normalization. It's
actually a bit more useful because for WideStrings it would take the
leadingChar into account with is more or less random. #size and
#copyFrom:to: answer "random" data unless you know the ins and outs of
utf-8 and Unicode.

> Or is there still
> a need to convert from the incoming utf-8 and squeak's WideString (and
> vice versa)?

Yes, utf-8 conversion won't happen automatically. If you want it, you
need to do it yourself.

>> WAKomEncoded converts incoming data from UTF-8 to the internal
>> encoding of Squeak, as well it converts outgoing data from the
>> internal encoding to UTF-8.
>
> The code and comments in #utf8ToSqueak: suggest that this is only true
> if squeak uses latin-1 internally (which is does by the default), right?

Nope, it's required for non-ASCII input.

>> Since there all incoming and outgoing data needs to be converted,
>> this approach is slightly less efficient.
>
> Has anybody quantified the inefficiency?

Not that I'm aware of.

> I'm starting a clean slate
> seaside server, so I'd like to pick the optimal configuration...

What do you want to optimize for?

Cheers
Philippe