MacRoman, Latin1, squeak fonts, and non breaking spaces.

Mon Apr 10 09:35:20 UTC 2006

Am 10.04.2006 um 02:19 schrieb Peace Jerome:

> Hi Bert and other concerned folk.
>
> In reading Bert's post about fixing fonts to show the
> invisible characters I was reminded of tripping over
> the nonbreaking space (nbsp).
>
> See mantis report:
>
> http://bugs.impara.de/view.php?id=2446
>
>
> I use a Mac and MacRoman defines nbsp as char 202. And
> this can be gotten from Character nbsp.

Doesn't have anything to do with the host operating system. We  
switched to Unicode, of which latin-1 (iso-8859-1) is the 8-bit  
subset (nitpicking aside).

> In the default font in 7021 this appears as the
> British pound sign.

It should be Ê (E circumflex).

> There are some squeak fonts
> (atlantis for example) that will show a blank space
> for that character.

Only because Atlantis never had a glyph for "E circumflex". That's  
why it was blank. That's why it's replaced with a rectangle with my  
fix now.

> Now Bert's fix uses char 160. Which is used by
> browsers as nbsp but the Latin1 standard I was pointed
> to has 160 in a range of undefined character values.

Codepoints 128-159 do have a meaning but no glyphs in Unicode. 160 is  
indeed the non-breaking space. It's "reserved" in that there is no  
actual glyph associated with it, in that respect it's more like a  
control character. However, for our particular implementation of  
bitmap fonts it's convenient to just use a blank glyph.

See http://www.unicode.org/charts/PDF/U0080.pdf

> So the question is there is (at least one) bug in
> this. What is the bug?
>
> 1) Should nbsp be define as the latin1 value?

Yes.

> 2) Should squeak fonts have a way of saying what set
> of characters they represent?

I guess so.

> 3) Should the available fonts in squeak be consistent
> in choice of encodeing?

In an ideal world, yes. For practical reasons I think we have to deal  
with whatever we get.

> 4) Should Character class be refactored to reflect the
>  ability to choose different encodings?

No. Characters are not encoded, they represent Unicode values.

Or at least by default they are. We support some non-unicode 16-bit  
encodings for asian languages, too, IIRC. Yoshiki would know best.

> 5) Should Character class be debugged to reflect
> Latin1 rather than MacRoman encodings?

Yes.

> If so what do you do about MacRoman?

Use the appropriate converter class.

> I have enough knowledge to know these questions are
> significant to the well being and maintenence of
> squeak. I am out of my depth in trying to suggest
> answers.
>
> It would be good it someone who understands the issue
> more deeply would formulate a mantis issue around it.

Sure. There's a whole lot still to do in that area.

- Bert -