[squeak-dev] Unicode Character “à” (U+00E0) and XTreams-Parsing and just ignore the combining mark sequences?

gettimothy gettimothy at zoho.com
Sun Feb 28 17:41:54 UTC 2021


Hi folks,



TL;DR in XTreams-Parsing do I need to add support to account for the "combining mark" as described in this regex tutorial here: https://www.regular-expressions.info/unicode.html







The Unicode code point U+0300 (grave accent) is a "combining mark"

Any code point that is not a combining mark can be followed by any number of combining marks.

This sequence, like U+0061 U+0300 above, is displayed as a single grapheme on the screen.





per: https://www.compart.com/en/unicode/U+00E0   “à” can be encoded several ways:



UTF-8 Encoding:

0xC3 0xA0


UTF-16 Encoding:

0x00E0


UTF-32 Encoding:

0x000000E0








I assume that the sequence 0xC3 0xA0  is the combination the regex dude refers to.


Here are some relevant Printit (values render correctly in Squeak with unifont installed, not so much in the browser where I print them)





Character codePoint:224 





Character value: 16r0000E0 



Character value: 16r0061

Character value: 16r0300 

(Character value: 16r0061) asString , (Character value: 16r0300) asString



Escape: backslash character: character hexes: hexes


<action: 'Escape' arguments: #( 1 2 3 )>



backslash = '\' ifTrue:

[character = $s ifTrue: [^Character space].

character = $t ifTrue: [^Character tab].

character = $n ifTrue: [^Character cr].

character = $r ifTrue: [^Character lf].

character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].

^character

Character value: 16r00C3

Character value: 16r00A0

(Character value: 16r00C1) asString , (Character value: 16r00A0) asString













The reason I ask is that just as Character does not (nor should it?) support combining marks

Neither does XTreams-Parsing...from the PEG grammar and the relevant callback I have the following rules:



Escape				<-	BACKSLASH [x] [0-9A-F]{6}  /	BACKSLASH [nrts\-\\\[\]\''\"] /	EscapeError

EscapeError		<-	BACKSLASH .





with callback:

Escape: backslash character: character hexes: hexes

<action: 'Escape' arguments: #( 1 2 3 )>



backslash = '\' ifTrue:

[character = $s ifTrue: [^Character space].

character = $t ifTrue: [^Character tab].

character = $n ifTrue: [^Character cr].

character = $r ifTrue: [^Character lf].

character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].

^character



which you can see does not support capture of the pair 0x00C3 0x00A0 to return “à”



I am strongly leaning towards ignoring the pairs and assuming that all characters such as above are part of the extension.



Thoughts appreciated.



thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210228/1a2be01d/attachment.html>


More information about the Squeak-dev mailing list