[squeak-dev] specifiying the character class range for some funky characters from İstanbul

Tobias Pape Das.Linux at gmx.de
Sun Jan 10 15:13:26 UTC 2021



> On 10. Jan 2021, at 14:17, gettimothy via Squeak-dev <squeak-dev at lists.squeakfoundation.org> wrote:
> 
> Hi Folks,
> 
> My parser rules are not being invoked for certain character classes.
> 
> For example, look at the İstanbul at this link:  https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit
> 
> 
> The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.
> 
> Escape	<-	BACKSLASH [x] [0-9A-F]{6}
> 
> 
> which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges.  i.e. \x000FOO  
> 
> 
> I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul
> 
> I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....
> 
> DotNot	<-	[a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]  

I think the \w does not do here what you think.

What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…

What kind of Regex-lib do you use?


Best regards
	-Tobias

> 
> That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:
> 
> LinkFreeCaptioned	<- OPEN_BRACKET{2} DotNot*  PipeCaption
> 
> which does a great job on English, but barfs on Instanbul
> 
> [[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]
> As you can see, the funky "I" is not in DotNot.
> 
> 
> I have flailed around aimlessly here: https://regexr.com/ to no avail.
> 
> Pointers appreciated.
> 
> cordially,
> 
> 
> 
> 




More information about the Squeak-dev mailing list