Unicode support
Todd Blanchard
tblanchard at etranslate.com
Wed Sep 22 20:57:02 UTC 1999
> > Uh, some languages do not have delimiters between words at all.
>
> Neat. Hebrew, of course, is not one of them. (What languages
apart from
> ideographic languages don't use delimiters?) Further, such a
class doesn't
> need the notion of tokenization, by definition, I suppose, unless
there are
> END-OF-TOKEN forms of letters, in which case the model of tokens
suggested would
> not be applicable. They would, of course, resolve that issue in a
subclass
> implementation that ignores the token parameter.
German has the lovely habit of running multiple words together to
make bigger and bigger words. You often want to navigate on the
consituent words in the superword. There are software text editors
that do this correctly. Some eastern asian languages also have
different ways of busting up things into words that don't relate to
whitespace.
> > But rather than get pedantic about that - lets divorce "tokenization"
> > from the concept of word-spotting.
>
> Why? I think the point that was made here by othersd, and with
which I agree, is that
> tokenization is an appropriate string operation, and semantic
> "word-spotting" is probably not.
I don't agree.
String is a mechanism for representing *language* and languages
typically have *words*. Tokens are something else - more
arbitrary. We got here because of this:
> A newbie recently asked how to compute the equivalent of:
>
> word 4 of line 7
>
> and
>
> set word 4 of line 7 to "foobar"
Which I think is certainly an appropriate operation for String.
OTOH, this operation cannot always be as simpleminded as delimiter
based tokenization.
Although - by your definition of what a string is - perhaps thats
not appropriate.
>
> > hebr3 "This is a fine mess" hebr2 hebr1 hebr0
> > > if you iterate through the tokens, in what order would you >
expect to get
> tokens?
>
> You are imposing a particular sequencing of information onto the
string based
> upon the semantics of an underlying language, and then asking me
to describe the
> tokens that are derived. My suggestion is that if you want the
tokens to be
> semantically meaningful, then your program (or a subclass of
string) must first
> organize the sequence of characters so that the purely mechanical,
> non-semantic sequencing will yield a meaningful result. Thus, the
answer is
> this: they would be the tokens, read left to right or right to
left in sequence, as
> defined by the delimiters.
Which is probably useless for anything but single-direction languages.
> It is not the duty of the String object to understand the
semantics of the
> underlying language in which characters are represented, but only
to provide
> underlying operations in which most reasonable operations (including
> semantics-based operations) might be accomplished.
Well which is it? String is either a class for representing chunks
of languages, or its a mechanism for representing arrays of
characters (whatever those are - a whole other topic).
I think string is implemented as the latter but used as the former
and we english speakers are lucky in that these just happen to
coincide. Unfortuneatly, the coincidence is a rather lucky fluke
with english and not something you can rely on globally.
--
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
eTranslate, Inc. The Power of Language
Todd Blanchard main +1.415.487.7850
Chief Technology Architect fax +1.415.371.0010
http://www.etranslate.com/
520 Third Street, Suite 505, San Francisco, California 94107, U.S.A.
More information about the Squeak-dev
mailing list
|