Unicode support

Todd Blanchard tblanchard at etranslate.com
Wed Sep 22 17:46:41 UTC 1999


> From: agree at carltonfields.com
> > -----Original Message-----
> > From: MIME :rowledge at interval.com > Sent: Wednesday, September  
22, 1999
> 12:01 PM
> > To: squeak at cs.uiuc.edu
> > Subject: Re: Unicode support
> > > > On Tue 21 Sep, Andrew C. Greenberg wrote:
> > > > A newbie recently asked how to compute the equivalent of:
> > > > 	word 4 of line 7
> > > > and
> > > > 	set word 4 of line 7 to "foobar"
> > I haven't been tracking this discussion too thoroughly, but >  
the above point
> > filtered through even though I haven't yet been caffeinated >  
this morning.
> > My claim is that the above sorts of action have nothing to do >  
with String.
> > Strings do not have lines, nor even words. Paragraphs (or just maybe 
> > FormattedSentences) have lines and words. A String is just a >  
long list of
> > characters. Linebreaks, words etc only have meaning once the  
string is
> > formatted as part of a larger document-like concept.
>
> This makes sense to me.  However, there exist certain underlying  
operations that
> may be performed on strings to facilitate such computations that  
may well be
> string-like.  I raised this example to investigate what those  
underlying
> operations are or should be (beyond the obvious single-character  
and substring
> reads and writes).  Should indexing be a part of the protocol?   
Searching?  (that
> is, beyond the general collection facilities)?  How about  
tokenizing with
> respect to certain delimiters (or predicates) and related operations? 
>
> While I agree that "words" per se, are a semantic or syntactic  
notion not inherent
> in the mere linear aggregation of characters; perhaps less  
structure-imposing
> operations, such as the tokenizing operations are appropriate?
>
Well, maybe.  Tokenizing is less easy than you might think.  Some  
languages do not use whitespace as word delimiters.  Other languages  
(Hebrew and arabic) are bi-directional in nature.  The direction of  
consideration can change within the string.  For instance, Hebrew  
reads from right to left but if you embed an arabic-type number or  
western language phrase in it then you read the digits or phrase  
conventionally left to right.  Tricky.


--
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
eTranslate, Inc.                                    The Power of Language 
Todd Blanchard                                  main +1.415.487.7850
Chief Technology Architect                      fax +1.415.371.0010
http://www.etranslate.com/
520 Third Street, Suite 505,      San Francisco, California 94107, U.S.A.





More information about the Squeak-dev mailing list