multilingual Squeak (Re: Must _ go like the Dodo?)

Michael S. Klein mklein at alumni.caltech.edu
Tue Mar 16 15:31:25 UTC 1999


What about comparing two strings?

If you got two strings each with one character that is the same code 
point in unicode, but the strings are in diferent encodings, are the 
strings equal, or not.

In other words, for example (from the first Han character in Unicode)

is U+4e00  =  G(5027) ?
is G(5027)  = J(1676) ? 

U means Unicode 2.0
G means GB 2312-80
J means JIS X 0208-1990

For the Eurocentric amongst us (most of the list),

is a Unicode $r   the same as an ASCII $r  ?
is an English 'r' the same as a French 'r' ?

Still, representation is a start, independent of answering the
above questions.  This way you could have an object that was a
Unicode $r ~= ASCII $r  (Even though Unicode says they are the same, 
semanticly.

In defence of Unicode, it preserves round trip transcoding.
It would be my standard of choice.

-- Mike Klein

mklein at alumni.caltech.edu

On Tue, 16 Mar 1999, Marcel Weiher wrote:

> >   Suppose there is an (imagenary) multilingual Smalltalk.
> > On the system, the instances of Character should carry
> > enough information about the character itself.  But, as I
> > wrote above, if the internal representation would be
> > Unicode, this couldn't be true.
> 
> Apple/NeXT's Yellow-Box has a nice way of handling this.  Strings  
> are defined to be in a certain encoding, with encoding objects  
> specifying just which encoding that could be.  There are methods for  
> (a) accessing the characters in the string's 'native' encoding (b)  
> converting the string to another encoding (c) accessing as Unicode  
> and (d) getting a string's encoding.  There are private subclasses to  
> deal with commin formats ( ASCII and other eight bit encodings,  
> Unicode) efficiently.
> 
> This way, there need not be a common encoding scheme for all  
> characters, although it may be advantageous to stick to 7-bit ASCII  
> for some system level stuff...  Fonts handle the mapping of  
> character-codes to glyphs depending on the character encoding and  
> their own code->glyph maps.
> 
> Marcel
> 
> 
> 





More information about the Squeak-dev mailing list