UTF8 Squeak
Bert Freudenberg
bert at freudenbergs.de
Tue Jun 12 11:28:55 UTC 2007
On Jun 12, 2007, at 8:29 , Colin Putney wrote:
> Your proposal is actually to have strings encoded as ISO 8859-1,
> UCS-2 or UCS-4.
Actually, the idea is that a String has Unicode throughout, with no
encoding. A string is simply a flat array of Unicode code points.
To optimize space usage we choose the lowest number of bytes per
character that can encompass all code points in a String. This is
implemented as specialized subclasses of String. So for code points
below 256 we use ByteString (8 bit per char), for all others
WideString (32 bits per char). This is purely space optimization, not
a change in encoding.
Now, the proposal is to use an intermediate 2 byte representation for
code points below 65536. Nobody has demonstrated the general
usefulness of this optimization, yet. In particular since the Squeak
VM does not support 16-bit arrays directly but they have to be
emulated using 8 bit words or 32 bit words. For the latter, prims 144
and 145 might help, but the problem of non-even length would have to
be addressed.
Also, the "purity" of Unicode strings does not translate directly
into the implementation, which reserves the most significant byte in
a WideString word for a "language code". That byte is otherwise
unused (code points range from 0 to 16r10FFFF) and is supposed to
help choosing glyph shapes that share a code point but differ in
appearance depending on the language. I suppose this was to restrict
changes to the String hierarchy, a better place for language info
would be text attributes - but then potentially a lot of code would
have to be adapted to pass Texts rather than Strings. It might be
worth to revise that design.
For dealing with encodings perhaps it would be useful to wrap a
ByteArray with a codec into an EncodedString - that way encoded data
could be passed from a webserver and back unmodified. #asString would
use the codec to convert to a proper String, which might also be used
for displaying that EncodedString. I'd not actually make it a String
subclass so perhaps a name other than EncodedString would be better.
My €/50 ...
- Bert -
More information about the Squeak-dev
mailing list
|