[squeak-dev] #isBreakableAt:in:

tim Rowledge tim at rowledge.org
Thu Sep 26 23:20:59 UTC 2013


On 26-09-2013, at 1:37 PM, Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com> wrote:

> A Character codePoint contains both
> - a charCode
> - a language tag (so called #leadingChar)

I see the use of bits 22-30 (I think) as leadingChar in Character. Is it correct to say that any character extracted from a ByteString will actually be just the 8bit value, or am I missing some devious encryption somewhere? If all BytesStrings include only 8bit valued characters then one would be safe in expecting only the 0 encoded characterset (aka Latin1Environment). Since that is simple and comprehensible, I have to anticipate that it isn't like that. Life would be too simple.

So far as I can work out from the code we are very much assuming that a ByteString is simple ascii encoded (see basicScanCharactersFrom… etc). I'd love some assurance that life is simple.

> 
> The leadingChar can encode either a CharacterSet, or a LanguageEnvironment (see EncodedCharSet initialize).

Why? Why on earth would we make it that way? That seems crazy.

> The CharacterSet tells how to interpret the charCode (whether 16r41 encodes a capital A or something else).

Yeah, got that part. The interesting followup question is why, in #scanMultiCharactersFrom:to:in:rightX:stopConditions:kern:, did we 
a) find the encoding
b) check that it is the same as we started with (I see the point to the endOfRun if it changes)
c) insist that encoding ==0 before testing the stops (and I see that your latest suggested changes drop that)
d) ignore the encoding in favour of Latin1Environment when sending isBreakableAt… ?

I can't see any good reason for it but obviously someone did at some point. That clearly means I may be missing something important.


tim
--
tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
C++ is history repeated as tragedy. Java is history repeated as farce.




More information about the Squeak-dev mailing list