[squeak-dev] specifiying the character class range for some funky characters from İstanbul
gettimothy
gettimothy at zoho.com
Sun Jan 10 16:10:02 UTC 2021
Hi Tobias.
Thanks for the reply.
I think the \w does not do here what you think.
What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…
What kind of Regex-lib do you use?
I have no idea.
I have basically inferred the functionality of the Grammar as I go with valuable insight from Levente.
There are a couple of PEG Grammar rules in Xtreams-Parsing that uses the character class to define some rules, example:
whitespace <- [\s\t\n\r]
Identifier <- [a-zA-Z_] [a-zA-Z0-9_]*
NumLiteral <- "Infinity" / "0" / [1-9] [0-9]*
Escape <- BACKSLASH [x] [0-9A-F]{6} / BACKSLASH [nrts\-\\\[\]\''\"] / EscapeError
So, whatever Xtreams or Squeak use for character classes? I have no idea.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210110/4b3ec6b9/attachment.html>
More information about the Squeak-dev
mailing list
|