UTF8 Squeak
Andreas Raab
andreas.raab at gmx.de
Fri Jun 8 06:55:58 UTC 2007
Lukas Renggli wrote:
> Just as a side-note: In Seaside the encoding and decoding turns out to
> be very complicated and expensive. In fact so expensive, that almost
> nobody is willing to pay for it.
But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the
first, just fix it ;-) If the second, what conversions are slow? If the
third, why not speed it up by a primitive? (UTF-8 translation isn't that
hard)
> What most people do is to work with
> (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data
> is received, stored, and sent exactly the way it comes from the
> socket. Byte identical strings are sent back as they were received.
I assume you mean Seaside 2.7 above not Squeak 2.7.
> I am no expert with encodings, so I have no idea how this could be
> cleanly solved. There is definitely the need for improvement.
How about trying to improve the speed of conversions? You seem to imply
that this is the major issue here, so if the conversions where
blindingly fast (which I think they easily could by writing one or two
primitives) this should improve matters.
> Another issue I observed is that Characters in Squeak have an
> inconsistent behavior for #==. For characters with codePoint > 256 the
> identity is not preserved. This gives problems with code that uses #==
> to compare characters, legacy code and code ported from VisualWorks
> (SmaCC for example). In VisualWorks Characters are unique, just like
> Symbols are.
Yeah, but there isn't really an easy workaround unless you have
immediate characters. Which Squeak doesn't so fixing those comparisons
to use equality is really your only option (FWIW, given that VW has a
good JIT I would expect that they can inline this trivially so there
shouldn't be a speed difference for VW).
Cheers,
- Andreas
More information about the Squeak-dev
mailing list
|