[squeak-dev] #isBreakableAt:in:

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Thu Sep 26 20:42:37 UTC 2013


At Thu, 26 Sep 2013 22:37:04 +0200,
Nicolas Cellier wrote:
> 
> A Character codePoint contains both
> - a charCode
> - a language tag (so called #leadingChar)
> 
> The leadingChar can encode either a CharacterSet, or a LanguageEnvironment
> (see EncodedCharSet initialize).
> The CharacterSet tells how to interpret the charCode (whether 16r41 encodes
> a capital A or something else).
> 
> All this is complex, and has strange side effects, because a letter A in a
> given char set could be different from a character A in another char set
> (they don't have same leadingChar, and eventually not same charCode, though
> maybe not true for A since most encodings are superset of ASCII)...
> With Unicode (iso 10646) we can have a canonical (hem, almost) encoding for
> all languages, so all this is getting a bit obsolete, except for eastern
> asian languages for historical reasons.

It is not quite "historic reasons" but the leadingChar concept (again
borrowed from Emacs) is a practical need.  The idea of encoding them
in the character object themselves should be obsoleted, but unified
chars should be able to be distinguished for some applications (such
as, a Chinese-Japanese dictionary application).

-- Yoshiki


More information about the Squeak-dev mailing list