[squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz

Thiede, Christoph Christoph.Thiede at student.hpi.uni-potsdam.de
Tue Feb 14 16:55:35 UTC 2023


Hi Tom,

thank you for the context! In general, I believe that Squeak does not try to adhere to any particular regex flavor, and we are probably not consistent with any common flavor because of specialties like :isDigit: or </> boundaries. This brings us the great freedom of building our own (better) flavor. On the downside, missing compatibility is a pity, of course. I would love to hear some third opinions on this question. :-)

> Specifically, I am receiving ECMAScript regexes from an external, trusted source and have been parsing those quite happily with the Squeak parser so far, save for some minor quirks that could be fixed via string replace and now this issue

Well, I think sanitizing braces would also be just another quick string replacement ;P

(jsRegexString copyWithRegex: '\{(?!(\d+,\d*|,\d+)\})' matchesReplacedWith: '\{') copyWithRegex: '(?<!\{(\d+,\d*|,\d+))\}' matchesReplacedWith: '\}'

(By the way, did you already take a look at regexp-tree transforms? :-))

Best,
Christoph







Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tom Beckmann <tomjonabc at gmail.com>
Gesendet: Dienstag, 14. Februar 2023 11:16 Uhr
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz
    
Hi Christoph,

good point, there's multiple ways to proceed given an input such as
`x{1`.

The current code would either fail with an unrelated error or assume
nil values for the ranges, which is of course not ideal. But you're
right: a solution that would reject the above example early and
explicitly is simpler to implement than my proposed solution.

The reason I chose to go for accepting incomplete ranges as literal
characters was to align with the behavior of ECMAScript regexes.
(Specifically, I am receiving ECMAScript regexes from an external,
trusted source and have been parsing those quite happily with the
Squeak parser so far, save for some minor quirks that could be fixed
via string replace and now this issue).

The major argument from a user's perspective I see for accepting
incomplete ranges is to reduce the need for escaping. I do also agree
with your points, so there's a tradeoff to be decided on :)

Best,
Tom

On Tue, 2023-02-14 at 08:32 +0000, Thiede, Christoph wrote:
> Hi Tom,
> 
> thank you for your contribution! However, could you maybe share some
> reasoning for the intended parser behavior with us? Why do you want
> to treat incomplete quantifier sequences as literal characters
> instead of raising a syntax error?
> 
> Here are some possible arguments in favor of raising a syntax error
> that come to my mind:
> 
> - Debugging incorrect expressions gets easier (e.g., if you missed a
> closing curly brace by accident).
> - Without backtracking, the design of the parser remains simpler and
> duplication-free, and its performance remains higher.
> - For other incomplete patterns such as '[a' or ':isDigit' we also
> raise a syntax error instead of parsing the pattern as literals.
> - Other parsers behave inconsistently: Some treat the incomplete
> examples as literals (e.g., JavaScript, .NET), while others raise
> syntax errors (e.g., Java).
> 
> Best,
> Christoph
> 
> Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im
> Auftrag von commits at source.squeak.org <commits at source.squeak.org>
> Gesendet: Montag, 13. Februar 2023 10:30:59
> An: squeak-dev at lists.squeakfoundation.org
> Betreff: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz
>  
> A new version of Regex-Tests-Core was added to project The Inbox:
> http://source.squeak.org/inbox/Regex-Tests-Core-tobe.34.mcz
> 
> ==================== Summary ====================
> 
> Name: Regex-Tests-Core-tobe.34
> Author: tobe
> Time: 13 February 2023, 10:30:58.863775 am
> UUID: 897286b7-bca5-405c-92c0-52e09604b1fc
> Ancestors: Regex-Tests-Core-ct.33
> 
> Complements Regex-Core-tobe.86
> 
> =============== Diff against Regex-Tests-Core-ct.33 ===============
> 
> Item was added:
> + ----- Method: RxParserTest>>testNonQuantifier (in category 'tests')
> -----
> + testNonQuantifier
> +        "Test expressions that look like quantifier expressions but
> do not fully match"
> +        self assert: ('a{x}'  matchesRegex: 'a{x}').
> +        self assert: ('a{,x}'  matchesRegex: 'a{,x}').
> +        self assert: ('a{,}'  matchesRegex: 'a{,}').
> +        self assert: ('a{,,}'  matchesRegex: 'a{,,}').
> +        self assert: ('a{1,2,}'  matchesRegex: 'a{1,2,}').
> +        self assert: ('a{,'  matchesRegex: 'a{,').
> +        self assert: ('a{'  matchesRegex: 'a{').
> +        self assert: ('a{1'  matchesRegex: 'a{1').
> +        self assert: ('a{1,'  matchesRegex: 'a{1,').!
> 
> 
> 


    


More information about the Squeak-dev mailing list