Unicode support

agree at carltonfields.com agree at carltonfields.com
Wed Sep 22 17:29:04 UTC 1999


> Well, maybe.  Tokenizing is less easy than you might think.  

This is the point of having an abstract protocol.  Where the root behavior or performance is inadequate, the subclass fixes this by doing it right.

> Some  languages do not use whitespace as word delimiters.  

This is the point of having delimiter parameters in the tokenizing routine.  (See String>>findTokens:)

> Other languages  (Hebrew and arabic) are bi-directional in > nature.  The direction of  consideration can change within > the string.  For instance, Hebrew  reads from right to left > but if you embed an arabic-type number or  western language > phrase in it then you read the digits or phrase  > conventionally left to right.  Tricky.

Not sure why left-to-right or right-to-left matters with respect to tokenizing.  As understood, tokens are simply a way to "clump" consecutive characters together, based upon the equivalence classes defined by a relation, say, isDelimiter, or less generally, the equivalence relation imputed by membership in an existing string (per findTokens:)  If a sequence of characters follows:

	<delim> hebrew-letter-sequence <delim> arabic-number-sequence <delim>

how does it matter whether the tokenizing is done by determining delimiterness left to right or right to left.  Indeed, how does right to leftness relate to a string at all?  Isn't that solely a question of printing implementations?  Presumably, the Hebrew word Melach, which APPEARS when printed 

	Chet  Lamud   Mem

(perhaps with vowels; sorry if I am butchering the transliterations) is still stored in ascending sequence in memory, with index 1 being the Mem ("M" sound), index 2 being the Lamud ("L" sound), and index 3 being the Chet.  The string for the word Melach, followed by, the arabic number 123, might well be stored  "MLCh 123" or "MLCh 321," depending upon how you structure your world, (though I imagine that it might best be stored with Arabic numbers non-reversed for collation purposes).  Still, while all this raises lots of amusing issues, I don't see how it relates to tokenization.  However it is stored (so long as strings are processed consistently, the delimiters are in the same place, and the result will still be tokenized either as

	('MLCh' '123') or ('MLCh' '321')

Of course, if the number '2' were a delimiter, the representation of the string would suddenly make a difference.  But I'm not sure that this is a concern for the String class -- the non-semantic operation is still well-defined, and the details can always be worked out when semantic meanings are intended to be imputed, w.l.o.g., either in a subclass or in particular codes.

In short, I see nothing in the structure of Hebrew character strings anything that requires deviating from the more-or-less obvious tokenization protocols.





More information about the Squeak-dev mailing list