[squeak-dev] specifiying the character class range for some funky characters from İstanbul

Sun Jan 10 16:10:02 UTC 2021

Hi Tobias.

Thanks for the reply.

I think the \w does not do here what you think. 

What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded. 

So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8. 

That is not what you DotNot does. It can only ascii, I presume… 

What kind of Regex-lib do you use? 

I have no idea. 

I have basically inferred the functionality of the Grammar as I go with valuable insight from Levente.

There are a couple of PEG Grammar rules in Xtreams-Parsing that uses the character class to define some rules, example:

whitespace			<-	[\s\t\n\r]

Identifier			<-	[a-zA-Z_] [a-zA-Z0-9_]*

NumLiteral			<-	"Infinity" / "0" / [1-9] [0-9]*

Escape				<-	BACKSLASH [x] [0-9A-F]{6} /	BACKSLASH [nrts\-\\\[\]\''\"] /	EscapeError

So, whatever Xtreams or Squeak use for character classes? I have no idea.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210110/4b3ec6b9/attachment.html>