UTF8 Squeak

Lukas Renggli renggli at gmail.com
Fri Jun 8 06:27:56 UTC 2007


Just as a side-note: In Seaside the encoding and decoding turns out to
be very  complicated and expensive. In fact so expensive, that almost
nobody is willing to pay for it. What most people do is to work with
(Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
is received, stored, and sent exactly the way it comes from the
socket. Byte identical strings are sent back as they were received.

There are many cravats:

1. Most string operations don't work (except concatenation), e.g.
asking a string for its #size might return a wrong number.

2. All literal strings have to be encoded manually to the right
format. This clutters the code and is ugly.

3. Data in inspectors is sometimes not readable without a manual conversion.

I am no expert with encodings, so I have no idea how this could be
cleanly solved. There is definitely the need for improvement.

Another issue I observed is that Characters in Squeak have an
inconsistent behavior for #==. For characters with codePoint > 256 the
identity is not preserved. This gives problems with code that uses #==
to compare characters, legacy code and code ported from VisualWorks
(SmaCC for example). In VisualWorks Characters are unique, just like
Symbols are.

Lukas

-- 
Lukas Renggli
http://www.lukas-renggli.ch



More information about the Squeak-dev mailing list