UTF8 Squeak

Fri Jun 8 06:55:58 UTC 2007

Lukas Renggli wrote:
> Just as a side-note: In Seaside the encoding and decoding turns out to
> be very  complicated and expensive. In fact so expensive, that almost
> nobody is willing to pay for it.

But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the 
first, just fix it ;-) If the second, what conversions are slow? If the 
third, why not speed it up by a primitive? (UTF-8 translation isn't that 
hard)

> What most people do is to work with
> (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
> is received, stored, and sent exactly the way it comes from the
> socket. Byte identical strings are sent back as they were received.

I assume you mean Seaside 2.7 above not Squeak 2.7.

> I am no expert with encodings, so I have no idea how this could be
> cleanly solved. There is definitely the need for improvement.

How about trying to improve the speed of conversions? You seem to imply 
that this is the major issue here, so if the conversions where 
blindingly fast (which I think they easily could by writing one or two 
primitives) this should improve matters.

> Another issue I observed is that Characters in Squeak have an
> inconsistent behavior for #==. For characters with codePoint > 256 the
> identity is not preserved. This gives problems with code that uses #==
> to compare characters, legacy code and code ported from VisualWorks
> (SmaCC for example). In VisualWorks Characters are unique, just like
> Symbols are.

Yeah, but there isn't really an easy workaround unless you have 
immediate characters. Which Squeak doesn't so fixing those comparisons 
to use equality is really your only option (FWIW, given that VW has a 
good JIT I would expect that they can inline this trivially so there 
shouldn't be a speed difference for VW).

Cheers,
   - Andreas