Unicode support

Todd Blanchard tblanchard at etranslate.com
Wed Sep 22 20:57:02 UTC 1999


> > Uh, some languages do not have delimiters between words at all.
>
> Neat.  Hebrew, of course, is not one of them.  (What languages  
apart from
> ideographic languages don't use delimiters?)  Further, such a  
class doesn't
> need the notion of tokenization, by definition, I suppose, unless  
there are
> END-OF-TOKEN forms of letters, in which case the model of tokens  
suggested would
> not be applicable.  They would, of course, resolve that issue in a  
subclass
> implementation that ignores the token parameter.

German has the lovely habit of running multiple words together to  
make bigger and bigger words.  You often want to navigate on the  
consituent words in the superword.  There are software text editors  
that do this correctly. Some eastern asian languages also have  
different ways of busting up things into words that don't relate to  
whitespace.

> > But rather than get pedantic about that - lets divorce "tokenization" 
> > from the concept of word-spotting.
>
> Why?  I think the point that was made here by othersd, and with  
which I agree, is that
> tokenization is an appropriate string operation, and semantic
> "word-spotting" is probably not.

I don't agree.

String is a mechanism for representing *language* and languages  
typically have *words*.  Tokens are something  else  - more  
arbitrary.  We got here because of  this:

> A newbie recently asked how to compute the equivalent of:
>
> 	word 4 of line 7
>
> and
>
> 	set word 4 of line 7 to "foobar"

Which I think is certainly an appropriate operation for String.

OTOH, this operation cannot always be as simpleminded as delimiter  
based tokenization.

Although - by your definition of what a string is - perhaps thats  
not appropriate.

>
> > hebr3 "This is a fine mess" hebr2 hebr1 hebr0
> > > if you iterate through the tokens, in what order would you >  
expect to  get
> tokens?
>
> You are imposing a particular sequencing of information onto the  
string based
> upon the semantics of an underlying language, and then asking me  
to describe the
> tokens that are derived.  My suggestion is that if you want the  
tokens to be
> semantically meaningful, then your program (or a subclass of  
string) must first
> organize the sequence of characters so that the purely mechanical, 
> non-semantic sequencing will yield a meaningful result.  Thus, the  
answer is
> this: they would be the tokens, read left to right or right to  
left in sequence, as
> defined by the delimiters.

Which is probably useless for anything but single-direction languages.

> It is not the duty of the String object to understand the  
semantics of the
> underlying language in which characters are represented, but only  
to provide
> underlying operations in which most reasonable operations (including 
> semantics-based operations) might be accomplished.

Well which is it?  String is either a class for representing chunks  
of languages, or its a mechanism for representing arrays of  
characters (whatever those are - a whole other topic).
I think string is implemented as the latter but used as the former  
and we english speakers are lucky in that these just happen to  
coincide.  Unfortuneatly, the coincidence is a rather lucky fluke  
with english and not something you can rely on globally.



--
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
eTranslate, Inc.                                    The Power of Language 
Todd Blanchard                                  main +1.415.487.7850
Chief Technology Architect                      fax +1.415.371.0010
http://www.etranslate.com/
520 Third Street, Suite 505,      San Francisco, California 94107, U.S.A.





More information about the Squeak-dev mailing list