janko.mivsek at eranova.si
Mon Jun 11 20:51:32 UTC 2007
Yoshiki Ohshima wrote:
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>> Yoshiki, am I right?
> For storing the bare Unicode code points, I think so. I'm not
> convinced that adding 16-bit variation solves any real problems. But
> there may be something.
For Slovenian language with Latin 2 script we need to have
TwoByteStrings, same goes for all East Europe, Greek, and Cyrillic. And
because I'm using an image as a database, I just cannot afford 4 byte
strings... And for shorter Slovenian strings even ByteStrings suffice.
> - While vast majority of strings for, say, Japanese can be
> represented with in the characters in BMP, you would use
> FourByteString for Chinese/Japanese/Korean and some others. Does
> this mean that you would *always* use FourByteString for these
> "languages" (and not scripts?)
My proposal allows strings to "scale" to support wider characters, by
widen themselves, from Byte to TwoByte and then FourByteString.
Determination of width of a string is automatic (as is already for
WideString): you start with ByteString and when you put a first
character with code point above 256, a ByteString is automatically
converted to TwoByteString or even FourByteString. Same goes for
TwoByteString when you add a character > 2**16.
Strings therefore don't need to be aware at all about languages they
> - Suppose you would like to use different line wrapping algorithms
> for different languages, how would you keep that information?
Line ends internally should be Character cr only (how is that in Squeak
anyway?). Different line-ends are again a responsibility of streams to
the external world.
What about a separate Locale object for all that language specific
Smalltalk Web Application Server
More information about the Squeak-dev