[squeak-dev] leadingChar proposal
Andreas Raab
andreas.raab at gmx.de
Fri Aug 28 04:09:48 UTC 2009
Folks -
I think it's time to do something about the leadingChar in Characters
that has been on the TODO list for a while. I have been looking over
this stuff for some time now, fixing things here and there and laying
some of the ground work for the things to come.
Here is the good news: Squeak doesn't need the leadingChar any longer.
If you are running an updated trunk image you can run entirely without
the leadingChar being used, and I've done this for about a week now with
no ill side effects (disclaimer: I haven't been using very much of m17n
support stuff so there may still be breakage but it means it won't
explode in your face straightaway). If you would like to try yourself,
all you need to do is to hack Character>>setValue: to say, e.g.,
value := newValue bitClear: 16r3FC00000.
and you're good (and won't ever see a leadingChar). However, the removal
of the leading char could be used to do a couple of other things that I
would like to discuss and solicit feedback in particular from the folks
who care about the leadingChar.
The main insight is that although we *can* run without the leadingChar,
it doesn't mean we *have* to. As it stands, the leading char is used for
two purposes: Character set selection (EncodedCharSet) and (parts of)
language support. There is a significant amount of confusion between the
two with Latin1/Latin2Environment subclasses of LanguageEnvironment
(although these are character encodings not languagse).
What I would propose to do here is to define that "leadingChar = 0"
which currently means "Latin1 encoding, language neutral" is being
redefined to "Unicode encoding, language neutral". What this does is
that "Character value: 353" and "Unicode value: 353" become the same, if
the environment is considered language neutral which by default it would be.
All but the environment which care about the connotations of the
language tag should be able to work with this definition without any
change whatsovever. The only thing that changes is that the default
LanguageEnvironment is Unicode based, using leadingChar=0, most of the
subclasses go away (being replaced by the default LanguageEnvironment)
and those that we care about, or need a transition plan (i.e., the CJK
languages) we keep using the language tag for the time being.
That means that *if* you set your language environment to be one of the
CJK languages you get a language tag in your strings, but by default the
language neutral environment will produce "plain Unicode". Which should
make the server/seaside/aida people a lot more happy when dealing with
this stuff.
For the CJK languages (or other languages requiring support that has
been so far expressed via the languag tag) we can use this opportunity
and phase the use of the language tag out in favor of using text
attributes (which would have to be written first).
The main advantage of the proposal is that the people who would like to
use plain Unicode get to use it, and the people who care about the
language tag and its consequences can still use that as well.
How does that sound?
Cheers,
- Andreas
More information about the Squeak-dev
mailing list
|