UTF8 Squeak

Mon Jun 11 20:51:32 UTC 2007

Hi Yoshiki,

Yoshiki Ohshima wrote:
>   Janko,
> 
>> It seems that this was already a Yoshiki idea with WideString, so I'm 
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
> 
>   For storing the bare Unicode code points, I think so.  I'm not
> convinced that adding 16-bit variation solves any real problems.  But
> there may be something.

For Slovenian language with Latin 2 script we need to have 
TwoByteStrings, same goes for all East Europe, Greek, and Cyrillic. And 
because I'm using an image as a database, I just cannot afford 4 byte 
strings... And for shorter Slovenian strings even ByteStrings suffice.

>   - While vast majority of strings for, say, Japanese can be
>     represented with in the characters in BMP, you would use
>     FourByteString for Chinese/Japanese/Korean and some others.  Does
>     this mean that you would *always* use FourByteString for these
>     "languages" (and not scripts?)

My proposal allows strings to "scale" to support wider characters, by 
widen themselves, from Byte to TwoByte and then FourByteString.

Determination of width of a string is automatic (as is already for 
WideString): you start with ByteString and when you put a first 
character with code point above 256, a ByteString is automatically 
converted to TwoByteString or even FourByteString. Same goes for 
TwoByteString when you add a character > 2**16.

Strings therefore don't need to be aware at all about languages they 
support.

>   - Suppose you would like to use different line wrapping algorithms
>     for different languages, how would you keep that information?

Line ends internally should be Character cr only (how is that in Squeak 
anyway?). Different line-ends are again a responsibility of streams to 
the external world.

What about a separate Locale object for all that language specific 
information?

Best regards
JAnko

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si