tblanchard at etranslate.com
Wed Sep 22 18:41:28 UTC 1999
> > Well, maybe. Tokenizing is less easy than you might think.
> This is the point of having an abstract protocol. Where the root
> performance is inadequate, the subclass fixes this by doing it right.
> > Some languages do not use whitespace as word delimiters.
> This is the point of having delimiter parameters in the tokenizing
> (See String>>findTokens:)
Uh, some languages do not have delimiters between words at all.
But rather than get pedantic about that - lets divorce "tokenization"
from the concept of word-spotting. The are not always the same and
a distinction needs to be made. As long as the word breaking routines
can be plugged in based on language - we are OK.
> > Other languages (Hebrew and arabic) are bi-directional in
> > nature. The direction of consideration can change within
> > the string. For instance, Hebrew reads from right to left
> > but if you embed an arabic-type number or western language
> > phrase in it then you read the digits or phrase
> > conventionally left to right. Tricky.
> Not sure why left-to-right or right-to-left matters with respect to
> tokenizing. As understood, tokens are simply a way to "clump"
> characters together, based upon the equivalence classes defined by a
> relation, say, isDelimiter, or less generally, the equivalence
> imputed by membership in an existing string (per findTokens:) If
> of characters follows:
hebr3 "This is a fine mess" hebr2 hebr1 hebr0
if you iterate through the tokens, in what order would you expect to
if you iterate throught the words, is the order the same?
I'd argue that in the word case you want:
> Still, while all this raises lots of
> amusing issues, I don't see how it relates to tokenization.
However it is
> stored (so long as strings are processed consistently, the
delimiters are in
> the same place, and the result will still be tokenized either as
> ('MLCh' '123') or ('MLCh' '321')
> Of course, if the number '2' were a delimiter, the representation
> string would suddenly make a difference. But I'm not sure that
this is a
> concern for the String class -- the non-semantic operation is still
> well-defined, and the details can always be worked out when
> are intended to be imputed, w.l.o.g., either in a subclass or in
> In short, I see nothing in the structure of Hebrew character
> that requires deviating from the more-or-less obvious tokenization
Well, hopefully you do now.
BTW, this also raises the issues of string rendering. How do you
select an appropriate font/glyph? I confess I know little about how
What we do here is use utf-8/unicode for internal storage and use
that as a pivot to convert among all the other encodings. You can
generally resolve the issues with unicode code point conflicts by
then specifying a font with the correct glyphs for the language you
are trying to render. How that information might be associated with
a given String still has me thinking and so far I don't have a good
eTranslate, Inc. The Power of Language
Todd Blanchard main +1.415.487.7850
Chief Technology Architect fax +1.415.371.0010
520 Third Street, Suite 505, San Francisco, California 94107, U.S.A.
More information about the Squeak-dev