Unicode support

Todd Blanchard tblanchard at etranslate.com
Wed Sep 22 18:41:28 UTC 1999


> > Well, maybe.  Tokenizing is less easy than you might think.
>
> This is the point of having an abstract protocol.  Where the root  
behavior or
> performance is inadequate, the subclass fixes this by doing it right. 

> > Some  languages do not use whitespace as word delimiters.
>
> This is the point of having delimiter parameters in the tokenizing  
routine.
> (See String>>findTokens:)

Uh, some languages do not have delimiters between words at all.
But rather than get pedantic about that - lets divorce "tokenization"
from the concept of word-spotting.  The are not always the same and
a distinction needs to be made.  As long as the word breaking routines 
can be plugged in based on language - we are OK.

> > Other languages  (Hebrew and arabic) are bi-directional in
> > nature.  The direction of  consideration can change within
> > the string.  For instance, Hebrew  reads from right to left
> > but if you embed an arabic-type number or  western language
> > phrase in it then you read the digits or phrase
> > conventionally left to right.  Tricky.
>
> Not sure why left-to-right or right-to-left matters with respect to  
> tokenizing.  As understood, tokens are simply a way to "clump"  
consecutive
> characters together, based upon the equivalence classes defined by a  
> relation, say, isDelimiter, or less generally, the equivalence  
relation
> imputed by membership in an existing string (per findTokens:)  If  
a sequence
> of characters follows:

Consider this:

hebr3 "This is a fine mess" hebr2 hebr1 hebr0

if you iterate through the tokens, in what order would you expect to  
get tokens?
if you iterate throught the words, is the order the same?
I'd argue that in the word case you want:

hebr0
hebr1
hebr2
this
is
a
fine
mess
hebr3

> Still, while all this raises lots of
> amusing issues, I don't see how it relates to tokenization.   
However it is
> stored (so long as strings are processed consistently, the  
delimiters are in
> the same place, and the result will still be tokenized either as
>
> 	('MLCh' '123') or ('MLCh' '321')
>
> Of course, if the number '2' were a delimiter, the representation  
of the
> string would suddenly make a difference.  But I'm not sure that  
this is a
> concern for the String class -- the non-semantic operation is still  
> well-defined, and the details can always be worked out when  
semantic meanings
> are intended to be imputed, w.l.o.g., either in a subclass or in  
particular
> codes.
>
> In short, I see nothing in the structure of Hebrew character  
strings anything
> that requires deviating from the more-or-less obvious tokenization 
> protocols.

Well, hopefully you do now.

BTW, this also raises the issues of string rendering.  How do you  
select an appropriate font/glyph?  I confess I know little about how  
that works.

What we do here is use utf-8/unicode for internal storage and use  
that as a pivot to convert among all the other encodings.  You can  
generally resolve the issues with unicode  code point conflicts by  
then specifying a font with the correct glyphs for the language you  
are trying to render.  How that information might be associated with  
a given String still has me thinking and so far I don't have a good  
solution.



--
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
eTranslate, Inc.                                    The Power of Language 
Todd Blanchard                                  main +1.415.487.7850
Chief Technology Architect                      fax +1.415.371.0010
http://www.etranslate.com/
520 Third Street, Suite 505,      San Francisco, California 94107, U.S.A.





More information about the Squeak-dev mailing list