Unicode support
Todd Blanchard
tblanchard at etranslate.com
Wed Sep 22 17:46:41 UTC 1999
> From: agree at carltonfields.com
> > -----Original Message-----
> > From: MIME :rowledge at interval.com > Sent: Wednesday, September
22, 1999
> 12:01 PM
> > To: squeak at cs.uiuc.edu
> > Subject: Re: Unicode support
> > > > On Tue 21 Sep, Andrew C. Greenberg wrote:
> > > > A newbie recently asked how to compute the equivalent of:
> > > > word 4 of line 7
> > > > and
> > > > set word 4 of line 7 to "foobar"
> > I haven't been tracking this discussion too thoroughly, but >
the above point
> > filtered through even though I haven't yet been caffeinated >
this morning.
> > My claim is that the above sorts of action have nothing to do >
with String.
> > Strings do not have lines, nor even words. Paragraphs (or just maybe
> > FormattedSentences) have lines and words. A String is just a >
long list of
> > characters. Linebreaks, words etc only have meaning once the
string is
> > formatted as part of a larger document-like concept.
>
> This makes sense to me. However, there exist certain underlying
operations that
> may be performed on strings to facilitate such computations that
may well be
> string-like. I raised this example to investigate what those
underlying
> operations are or should be (beyond the obvious single-character
and substring
> reads and writes). Should indexing be a part of the protocol?
Searching? (that
> is, beyond the general collection facilities)? How about
tokenizing with
> respect to certain delimiters (or predicates) and related operations?
>
> While I agree that "words" per se, are a semantic or syntactic
notion not inherent
> in the mere linear aggregation of characters; perhaps less
structure-imposing
> operations, such as the tokenizing operations are appropriate?
>
Well, maybe. Tokenizing is less easy than you might think. Some
languages do not use whitespace as word delimiters. Other languages
(Hebrew and arabic) are bi-directional in nature. The direction of
consideration can change within the string. For instance, Hebrew
reads from right to left but if you embed an arabic-type number or
western language phrase in it then you read the digits or phrase
conventionally left to right. Tricky.
--
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
eTranslate, Inc. The Power of Language
Todd Blanchard main +1.415.487.7850
Chief Technology Architect fax +1.415.371.0010
http://www.etranslate.com/
520 Third Street, Suite 505, San Francisco, California 94107, U.S.A.
More information about the Squeak-dev
mailing list
|