[squeak-dev] specifiying the character class range for some funky characters from İstanbul
Tobias Pape
Das.Linux at gmx.de
Sun Jan 10 15:13:26 UTC 2021
> On 10. Jan 2021, at 14:17, gettimothy via Squeak-dev <squeak-dev at lists.squeakfoundation.org> wrote:
>
> Hi Folks,
>
> My parser rules are not being invoked for certain character classes.
>
> For example, look at the İstanbul at this link: https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit
>
>
> The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.
>
> Escape <- BACKSLASH [x] [0-9A-F]{6}
>
>
> which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges. i.e. \x000FOO
>
>
> I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul
>
> I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....
>
> DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]
I think the \w does not do here what you think.
What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…
What kind of Regex-lib do you use?
Best regards
-Tobias
>
> That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:
>
> LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot* PipeCaption
>
> which does a great job on English, but barfs on Instanbul
>
> [[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]
> As you can see, the funky "I" is not in DotNot.
>
>
> I have flailed around aimlessly here: https://regexr.com/ to no avail.
>
> Pointers appreciated.
>
> cordially,
>
>
>
>
More information about the Squeak-dev
mailing list
|