[squeak-dev] #isBreakableAt:in:

Thu Sep 26 20:37:04 UTC 2013

A Character codePoint contains both
- a charCode
- a language tag (so called #leadingChar)

The leadingChar can encode either a CharacterSet, or a LanguageEnvironment
(see EncodedCharSet initialize).
The CharacterSet tells how to interpret the charCode (whether 16r41 encodes
a capital A or something else).

All this is complex, and has strange side effects, because a letter A in a
given char set could be different from a character A in another char set
(they don't have same leadingChar, and eventually not same charCode, though
maybe not true for A since most encodings are superset of ASCII)...
With Unicode (iso 10646) we can have a canonical (hem, almost) encoding for
all languages, so all this is getting a bit obsolete, except for eastern
asian languages for historical reasons.

I've tried to generalize the use of Unicode in the image, except for
eastern Asian environments.

The latin1 character set is a subset of Unicode (it matches the first 256
codes), so with the promotion of Unicode, it is effectively obsolescent.

2013/9/26 tim Rowledge <tim at rowledge.org>

>
> On 26-09-2013, at 10:19 AM, tim Rowledge <tim at rowledge.org> wrote:
>
> >
> > On 26-09-2013, at 7:14 AM, Bob Arning <arning315 at comcast.net> wrote:
> >
> >> Well, something is a little wrong
> >
> > I rather thought so. I'll use your StringHolder to work out something.
> Actually I reckon a quick hack to add #space and simply use
> #registerBreakableIndex should be good start.
>
>
> Well, that wasn't much fun.
>
> The current implementations of registerBreakableIndex and crossedX are
> nastily intertwined with assumptions about how they are used in such a way
> that I suspect laws of nature are being broken. Certainly I'm not going to
> spend any more time today trying to work out WTF is going on.
>
> So I've returned the use of isBreakableAt:in:in: & registerBreakableIndex
> to their previous status and it no longer makes nasty with widestrings and
> wrapping.
>
> It raises more questions (still lots from previous message unanswered
> folks!)-
> EncodeCharSets - there are several commented out in EncodeCharSet
> class>initialise Why?
> Why is Unicode also commented as 'Latin1Environment'?
> What is Latin2Environment?
> Why is there a separate Latin1 class?
> Why are there mixed up encodedcharset classes and language environment
> classes?
>
>
> tim
> --
> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> Oxymorons: Clearly misunderstood
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20130926/1e83deec/attachment.htm