Unicode support

Wed Sep 22 18:13:58 UTC 1999

> Uh, some languages do not have delimiters between words at all.

Neat.  Hebrew, of course, is not one of them.  (What languages apart from ideographic languages don't use delimiters?)  Further, such a class doesn't need the notion of tokenization, by definition, I suppose, unless there are END-OF-TOKEN forms of letters, in which case the model of tokens suggested would not be applicable.  They would, of course, resolve that issue in a subclass implementation that ignores the token parameter.

> But rather than get pedantic about that - lets divorce "tokenization"
> from the concept of word-spotting.  

Why?  I think the point that was made here by othersd, and with which I agree, is that tokenization is an appropriate string operation, and semantic "word-spotting" is probably not.

> hebr3 "This is a fine mess" hebr2 hebr1 hebr0
> > if you iterate through the tokens, in what order would you > expect to  get tokens?

You are imposing a particular sequencing of information onto the string based upon the semantics of an underlying language, and then asking me to describe the tokens that are derived.  My suggestion is that if you want the tokens to be semantically meaningful, then your program (or a subclass of string) must first organize the sequence of characters so that the purely mechanical, non-semantic sequencing will yield a meaningful result.  Thus, the answer is this: they would be the tokens, read left to right or right to left in sequence, as defined by the delimiters.

It is not the duty of the String object to understand the semantics of the underlying language in which characters are represented, but only to provide underlying operations in which most reasonable operations (including semantics-based operations) might be accomplished.

Smalltalk does not presently provide this functionality for me in English with ASCIIStrings, so I am not particularly concerned with providing it in Hebrew with GeneralStrings.  To do so would unduly burden the abstract class, and would raise some interesting questions -- the same alphabet can be used for plural languages, which semantics do you use?
> > In short, I see nothing in the structure of Hebrew > character  strings anything
> > that requires deviating from the more-or-less obvious > tokenization > protocols.
> > Well, hopefully you do now.

Sorry, I don't see it.  Could you try again.

> BTW, this also raises the issues of string rendering.  How do > you  select an appropriate font/glyph?  I confess I know > little about how  that works.

What does this have to do with Strings?
 > What we do here is use utf-8/unicode for internal storage and > use  that as a pivot to convert among all the other > encodings.  You can  generally resolve the issues with > unicode  code point conflicts by  then specifying a font with > the correct glyphs for the language you  are trying to > render.  How that information might be associated with  a > given String still has me thinking and so far I don't have a > good  solution. 

And you can continue to do that with GeneralStrings, or define yourself a subclass that does this automatically for you.