[squeak-dev] MultiCharacterScanner>addCharToPresentation: and conversion to pre-composed unicode code points

Bert Freudenberg bert at freudenbergs.de
Mon Sep 23 12:10:46 UTC 2013


On 2013-09-22, at 19:50, Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com> wrote:

> As I understand it, MultiCharacterScanner is transforming a String of decomposed unicode into a string of pre-composed unicode code points, with help of UnicodeCompositionStream.
> It store the result in presentation.
> 
> As I understand it, this was necessary because some keyboard/vm do produce such decomposed sequences.
> I presume this once helped measuring and displaying those codes with fonts having only pre-composed codes.
> 
> First remark, this is a pity that the base character comes first, before the diacritical.
> This forces the composition algorithm to look ahead.
> We can't change it, it's a standard, but I wonder the motivation for such ordering...
> Ref: http://www.unicode.org/standard/principles.html
> 
> Second remark, transforming unicodes sequence to a canonical form is not only useful for measuring/displaying text.
> It's usefull for comparing strings (for equality, for collation, ...)
> So the transformation could happen somewhere else than at display time.
> Unicode define standard ways to do it, and bad news, UnicodeCompositionStream is not conforming.
> Ref: https://en.wikipedia.org/wiki/Unicode_equivalence

Yep.

> Third remark, I wonder if this composition is really necessary at all for measuring/displaying.
> Doesn't unicode fonts provide special kerning pairs for those diacriticals?
> I couldn't find good references on this one...


This would work if we had the diacriticals in our fonts and if rendering glyphs would take into account kerning info. Neither is the case currently, so the next-best thing was compositing which allows us to use the pre-composed Latin-1 characters.

Just paste this into Squeak:

	A + combining diaeresis: Ä
	Precomposed: Ä

Both look the same in my email client but in Squeak I get: 

	

which indicates the presentation thing is not working currently. In case this doesn't make it through via email, the combining diaeresis is Character value: 16r0308.

- Bert -


-------------- next part --------------
Skipped content of type multipart/related


More information about the Squeak-dev mailing list