[squeak-dev] specifiying the character class range for some funky characters from İstanbul

gettimothy gettimothy at zoho.com
Sun Jan 10 13:17:07 UTC 2021


Hi Folks,



My parser rules are not being invoked for certain character classes.



For example, look at the İstanbul at this link:  https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit





The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.



Escape				<-	BACKSLASH [x] [0-9A-F]{6}






which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges.  i.e. \x000FOO  





I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul



I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....



DotNot				<-	[a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]  




That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:



LinkFreeCaptioned				<- OPEN_BRACKET{2} DotNot*  PipeCaption




which does a great job on English, but barfs on Instanbul



[[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]


As you can see, the funky "I" is not in DotNot.




I have flailed around aimlessly here: https://regexr.com/ to no avail.



Pointers appreciated.



cordially,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210110/f73a9126/attachment.html>


More information about the Squeak-dev mailing list