[squeak-dev] leadingChar proposal

Andreas Raab andreas.raab at gmx.de
Fri Aug 28 04:09:48 UTC 2009


Folks -

I think it's time to do something about the leadingChar in Characters 
that has been on the TODO list for a while. I have been looking over 
this stuff for some time now, fixing things here and there and laying 
some of the ground work for the things to come.

Here is the good news: Squeak doesn't need the leadingChar any longer. 
If you are running an updated trunk image you can run entirely without 
the leadingChar being used, and I've done this for about a week now with 
no ill side effects (disclaimer: I haven't been using very much of m17n 
support stuff so there may still be breakage but it means it won't 
explode in your face straightaway). If you would like to try yourself, 
all you need to do is to hack Character>>setValue: to say, e.g.,

	value := newValue bitClear: 16r3FC00000.

and you're good (and won't ever see a leadingChar). However, the removal 
of the leading char could be used to do a couple of other things that I 
would like to discuss and solicit feedback in particular from the folks 
who care about the leadingChar.

The main insight is that although we *can* run without the leadingChar, 
it doesn't mean we *have* to. As it stands, the leading char is used for 
two purposes: Character set selection (EncodedCharSet) and (parts of) 
language support. There is a significant amount of confusion between the 
two with Latin1/Latin2Environment subclasses of LanguageEnvironment 
(although these are character encodings not languagse).

What I would propose to do here is to define that "leadingChar = 0" 
which currently means "Latin1 encoding, language neutral" is being 
redefined to "Unicode encoding, language neutral". What this does is 
that "Character value: 353" and "Unicode value: 353" become the same, if 
the environment is considered language neutral which by default it would be.

All but the environment which care about the connotations of the 
language tag should be able to work with this definition without any 
change whatsovever. The only thing that changes is that the default 
LanguageEnvironment is Unicode based, using leadingChar=0, most of the 
subclasses go away (being replaced by the default LanguageEnvironment) 
and those that we care about, or need a transition plan (i.e., the CJK 
languages) we keep using the language tag for the time being.

That means that *if* you set your language environment to be one of the 
CJK languages you get a language tag in your strings, but by default the 
language neutral environment will produce "plain Unicode". Which should 
make the server/seaside/aida people a lot more happy when dealing with 
this stuff.

For the CJK languages (or other languages requiring support that has 
been so far expressed via the languag tag) we can use this opportunity 
and phase the use of the language tag out in favor of using text 
attributes (which would have to be written first).

The main advantage of the proposal is that the people who would like to 
use plain Unicode get to use it, and the people who care about the 
language tag and its consequences can still use that as well.

How does that sound?

Cheers,
   - Andreas



More information about the Squeak-dev mailing list