[squeak-dev] specifiying the character class range for some funky characters from İstanbul

gettimothy gettimothy at zoho.com
Sun Jan 10 16:10:02 UTC 2021


Hi Tobias.





Thanks for the reply.



I think the \w does not do here what you think. 



What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded. 

So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8. 

That is not what you DotNot does. It can only ascii, I presume… 



What kind of Regex-lib do you use? 





I have no idea. 


I have basically inferred the functionality of the Grammar as I go with valuable insight from Levente.



There are a couple of PEG Grammar rules in Xtreams-Parsing that uses the character class to define some rules, example:







whitespace			<-	[\s\t\n\r]



Identifier			<-	[a-zA-Z_] [a-zA-Z0-9_]*



NumLiteral			<-	"Infinity" / "0" / [1-9] [0-9]*



Escape				<-	BACKSLASH [x] [0-9A-F]{6} /	BACKSLASH [nrts\-\\\[\]\''\"] /	EscapeError





So, whatever Xtreams or Squeak use for character classes? I have no idea.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20210110/4b3ec6b9/attachment.html>


More information about the Squeak-dev mailing list