[squeak-dev] specifiying the character class range for some funky characters from İstanbul
gettimothy
gettimothy at zoho.com
Sun Jan 10 13:17:07 UTC 2021
Hi Folks,
My parser rules are not being invoked for certain character classes.
For example, look at the İstanbul at this link: https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit
The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.
Escape <- BACKSLASH [x] [0-9A-F]{6}
which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges. i.e. \x000FOO
I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul
I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....
DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]
That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:
LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot* PipeCaption
which does a great job on English, but barfs on Instanbul
[[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]
As you can see, the funky "I" is not in DotNot.
I have flailed around aimlessly here: https://regexr.com/ to no avail.
Pointers appreciated.
cordially,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210110/f73a9126/attachment.html>
More information about the Squeak-dev
mailing list
|