Unicode support

agree at carltonfields.com agree at carltonfields.com
Wed Sep 22 20:32:08 UTC 1999


> I think the point that was made here by othersd, and > with  which I agree, is that
> > tokenization is an appropriate string operation, and semantic
> > "word-spotting" is probably not.
> > I don't agree.
> > String is a mechanism for representing *language* and > languages  typically have *words*.  

This is simply a tautology.  It neither proves the point, nor explains why you disagree.

I agree that tokens are more arbitrary than words.  As you have pointed out, words may be an intractable problem, and certainly require special casing based on LANGUAGE (as opposed to character sets).

In my view, we should BEGIN with the more tractable problem, one that is readily implementable, and which admits efficient implementations.  Provided that it provides adequate functionality that can be used to build langauge and word stuff, great.  I'd like to have a class with a single doWhatIMean operation -- but that's not what we seem to be negotiating.

In my view, strings can do many things, perhaps represent language as well, but what is being proposed here seems to me more of an application USING strings than of the essential essence of what the string data type is or should be.

> Tokens are something else  - more  arbitrary.  

Granted that tokens are different from words.  That was my point.

> We got here because of  this:
> > > A newbie recently asked how to compute the equivalent of:
> >
> > 	word 4 of line 7
> >
> > and
> >
> > 	set word 4 of line 7 to "foobar"
> > Which I think is certainly an appropriate operation for String.

There were others who disagreed with this proposition, suggesting that it would be better to focus on simpler operations from which a "word" processor can be built.  I agree with them, for reasons set forth in the preceding messages.

> OTOH, this operation cannot always be as simpleminded as > delimiter  based tokenization.

Noone said that it could.

> Although - by your definition of what a string is - perhaps > thats  not appropriate.

I haven't really defined what a string is -- that's what we seem to be trying to reach consensus upon.  I note, with interest, that the present String operation does not have a meaningful semantic "word" operation, even for English.  I take this to be probative evidence in favor of my position.


> Which is probably useless for anything but single-direction languages.

Again, I disagree, for reasons previously stated.  Agreed that such operations may not, without additional processing, provide trivial answers for non-trivial problems.  So what?  This criticism is true for every class ever written.
 > > It is not the duty of the String object to understand the  > semantics of the
> > underlying language in which characters are represented, > but only  to provide
> > underlying operations in which most reasonable operations > (including > semantics-based operations) might be accomplished.
> > Well which is it?  String is either a class for representing > chunks  of languages, or its a mechanism for representing > arrays of  characters (whatever those are - a whole other topic).

I disagree with this binary choice.  It is a straw man argument.  However, if you made me choose, I'd be more confortable with the latter.  My answer, had a fairer question been asked is that it is more like the latter, but with additional features and constraints, as described in earlier postings.

> I think string is implemented as the latter but used as the > former  and we english speakers are lucky in that these just > happen to  coincide.  Unfortuneatly, the coincidence is a > rather lucky fluke  with english and not something you can > rely on globally.

Here, I disagree again.  The token operation is of general utility.  While it may not be capable of picking out words from a string of characters encoded as you would have it encoded, this does not mean that the code, together with other code, could not be effectively useful to obtain your desired result.  Even if it couldn't be done, so what?  This would not make the token (or indexing) operations useless, nor does it mean that such operations cannot, in conjunction with other operations, be effective in solving the problems you wish to address.  To suggest that Strings have no utility because they don't do what you desire seems largely overblown.  A package that does what is suggested seems to me both too much, and perhaps not enough for the myriad other purposes for which Strings are used in connection with a computer.





More information about the Squeak-dev mailing list