[squeak-dev] specifiying the character class range for some funky characters from İstanbul

gettimothy gettimothy at zoho.com
Sun Jan 10 13:17:07 UTC 2021

Hi Folks,

My parser rules are not being invoked for certain character classes.

For example, look at the İstanbul at this link:  https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit

The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.

Escape				<-	BACKSLASH [x] [0-9A-F]{6}

which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges.  i.e. \x000FOO  

I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul

I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....

DotNot				<-	[a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]  

That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:

LinkFreeCaptioned				<- OPEN_BRACKET{2} DotNot*  PipeCaption

which does a great job on English, but barfs on Instanbul

[[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]

As you can see, the funky "I" is not in DotNot.

I have flailed around aimlessly here: https://regexr.com/ to no avail.

Pointers appreciated.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210110/f73a9126/attachment.html>

More information about the Squeak-dev mailing list