The Inbox: Regex-Tests-Core-tobe.34.mcz

List overview All Threads
Download

newer

older

[ANN] Squeak Inbox Talk update --...

The Inbox:...

commits＠source.squeak.org

13 Feb 2023 13 Feb '23

10:30 a.m.

A new version of Regex-Tests-Core was added to project The Inbox: http://source.squeak.org/inbox/Regex-Tests-Core-tobe.34.mcz

==================== Summary ====================

Name: Regex-Tests-Core-tobe.34 Author: tobe Time: 13 February 2023, 10:30:58.863775 am UUID: 897286b7-bca5-405c-92c0-52e09604b1fc Ancestors: Regex-Tests-Core-ct.33

Complements Regex-Core-tobe.86

=============== Diff against Regex-Tests-Core-ct.33 ===============

Item was added: + ----- Method: RxParserTest>>testNonQuantifier (in category 'tests') ----- + testNonQuantifier + "Test expressions that look like quantifier expressions but do not fully match" + self assert: ('a{x}' matchesRegex: 'a{x}'). + self assert: ('a{,x}' matchesRegex: 'a{,x}'). + self assert: ('a{,}' matchesRegex: 'a{,}'). + self assert: ('a{,,}' matchesRegex: 'a{,,}'). + self assert: ('a{1,2,}' matchesRegex: 'a{1,2,}'). + self assert: ('a{,' matchesRegex: 'a{,'). + self assert: ('a{' matchesRegex: 'a{'). + self assert: ('a{1' matchesRegex: 'a{1'). + self assert: ('a{1,' matchesRegex: 'a{1,').!

Show replies by date

Thiede, Christoph

14 Feb 14 Feb

9:32 a.m.

Hi Tom,

thank you for your contribution! However, could you maybe share some reasoning for the intended parser behavior with us? Why do you want to treat incomplete quantifier sequences as literal characters instead of raising a syntax error?

Here are some possible arguments in favor of raising a syntax error that come to my mind:

- Debugging incorrect expressions gets easier (e.g., if you missed a closing curly brace by accident).

- Without backtracking, the design of the parser remains simpler and duplication-free, and its performance remains higher.

- For other incomplete patterns such as '[a' or ':isDigit' we also raise a syntax error instead of parsing the pattern as literals.

- Other parsers behave inconsistently: Some treat the incomplete examples as literals (e.g., JavaScript, .NET), while others raise syntax errors (e.g., Java).

Best,

Christoph

________________________________ Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von commits@source.squeak.org commits@source.squeak.org Gesendet: Montag, 13. Februar 2023 10:30:59 An: squeak-dev@lists.squeakfoundation.org Betreff: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz

A new version of Regex-Tests-Core was added to project The Inbox: http://source.squeak.org/inbox/Regex-Tests-Core-tobe.34.mcz

==================== Summary ====================

Name: Regex-Tests-Core-tobe.34 Author: tobe Time: 13 February 2023, 10:30:58.863775 am UUID: 897286b7-bca5-405c-92c0-52e09604b1fc Ancestors: Regex-Tests-Core-ct.33

Complements Regex-Core-tobe.86

=============== Diff against Regex-Tests-Core-ct.33 ===============

Tom Beckmann

11:16 a.m.

Hi Christoph,

good point, there's multiple ways to proceed given an input such as `x{1`.

The current code would either fail with an unrelated error or assume nil values for the ranges, which is of course not ideal. But you're right: a solution that would reject the above example early and explicitly is simpler to implement than my proposed solution.

The reason I chose to go for accepting incomplete ranges as literal characters was to align with the behavior of ECMAScript regexes. (Specifically, I am receiving ECMAScript regexes from an external, trusted source and have been parsing those quite happily with the Squeak parser so far, save for some minor quirks that could be fixed via string replace and now this issue).

The major argument from a user's perspective I see for accepting incomplete ranges is to reduce the need for escaping. I do also agree with your points, so there's a tradeoff to be decided on :)

Best, Tom

On Tue, 2023-02-14 at 08:32 +0000, Thiede, Christoph wrote:

...

Hi Tom,

thank you for your contribution! However, could you maybe share some reasoning for the intended parser behavior with us? Why do you want to treat incomplete quantifier sequences as literal characters instead of raising a syntax error?

Here are some possible arguments in favor of raising a syntax error that come to my mind:

Debugging incorrect expressions gets easier (e.g., if you missed a

closing curly brace by accident).

Without backtracking, the design of the parser remains simpler and

duplication-free, and its performance remains higher.

For other incomplete patterns such as '[a' or ':isDigit' we also

raise a syntax error instead of parsing the pattern as literals.

Other parsers behave inconsistently: Some treat the incomplete

examples as literals (e.g., JavaScript, .NET), while others raise syntax errors (e.g., Java).

Best, Christoph

Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von commits@source.squeak.org commits@source.squeak.org Gesendet: Montag, 13. Februar 2023 10:30:59 An: squeak-dev@lists.squeakfoundation.org Betreff: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz A new version of Regex-Tests-Core was added to project The Inbox: http://source.squeak.org/inbox/Regex-Tests-Core-tobe.34.mcz

==================== Summary ====================

Name: Regex-Tests-Core-tobe.34 Author: tobe Time: 13 February 2023, 10:30:58.863775 am UUID: 897286b7-bca5-405c-92c0-52e09604b1fc Ancestors: Regex-Tests-Core-ct.33

Complements Regex-Core-tobe.86

=============== Diff against Regex-Tests-Core-ct.33 ===============

Item was added:

----- Method: RxParserTest>>testNonQuantifier (in category 'tests')

testNonQuantifier

+ "Test expressions that look like quantifier expressions but do not fully match" + self assert: ('a{x}' matchesRegex: 'a{x}'). + self assert: ('a{,x}' matchesRegex: 'a{,x}'). + self assert: ('a{,}' matchesRegex: 'a{,}'). + self assert: ('a{,,}' matchesRegex: 'a{,,}'). + self assert: ('a{1,2,}' matchesRegex: 'a{1,2,}'). + self assert: ('a{,' matchesRegex: 'a{,'). + self assert: ('a{' matchesRegex: 'a{'). + self assert: ('a{1' matchesRegex: 'a{1'). + self assert: ('a{1,' matchesRegex: 'a{1,').!

Thiede, Christoph

5:55 p.m.

Hi Tom,

thank you for the context! In general, I believe that Squeak does not try to adhere to any particular regex flavor, and we are probably not consistent with any common flavor because of specialties like :isDigit: or </> boundaries. This brings us the great freedom of building our own (better) flavor. On the downside, missing compatibility is a pity, of course. I would love to hear some third opinions on this question. :-)

...

Specifically, I am receiving ECMAScript regexes from an external, trusted source and have been parsing those quite happily with the Squeak parser so far, save for some minor quirks that could be fixed via string replace and now this issue

Well, I think sanitizing braces would also be just another quick string replacement ;P

(jsRegexString copyWithRegex: '{(?!(\d+,\d*|,\d+)})' matchesReplacedWith: '{') copyWithRegex: '(?<!{(\d+,\d*|,\d+))}' matchesReplacedWith: '}'

(By the way, did you already take a look at regexp-tree transforms? :-))

Best, Christoph

Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von Tom Beckmann tomjonabc@gmail.com Gesendet: Dienstag, 14. Februar 2023 11:16 Uhr An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz Hi Christoph,

good point, there's multiple ways to proceed given an input such as `x{1`.

The major argument from a user's perspective I see for accepting incomplete ranges is to reduce the need for escaping. I do also agree with your points, so there's a tradeoff to be decided on :)

Best, Tom

On Tue, 2023-02-14 at 08:32 +0000, Thiede, Christoph wrote:

...

Hi Tom,

thank you for your contribution! However, could you maybe share some reasoning for the intended parser behavior with us? Why do you want to treat incomplete quantifier sequences as literal characters instead of raising a syntax error?

Here are some possible arguments in favor of raising a syntax error that come to my mind:

Debugging incorrect expressions gets easier (e.g., if you missed a

closing curly brace by accident).

Without backtracking, the design of the parser remains simpler and

duplication-free, and its performance remains higher.

For other incomplete patterns such as '[a' or ':isDigit' we also

raise a syntax error instead of parsing the pattern as literals.

Other parsers behave inconsistently: Some treat the incomplete

examples as literals (e.g., JavaScript, .NET), while others raise syntax errors (e.g., Java).

Best, Christoph

Von: Squeak-dev squeak-dev-bounces@lists.squeakfoundation.org im Auftrag von commits@source.squeak.org commits@source.squeak.org Gesendet: Montag, 13. Februar 2023 10:30:59 An: squeak-dev@lists.squeakfoundation.org Betreff: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz A new version of Regex-Tests-Core was added to project The Inbox: http://source.squeak.org/inbox/Regex-Tests-Core-tobe.34.mcz

==================== Summary ====================

Name: Regex-Tests-Core-tobe.34 Author: tobe Time: 13 February 2023, 10:30:58.863775 am UUID: 897286b7-bca5-405c-92c0-52e09604b1fc Ancestors: Regex-Tests-Core-ct.33

Complements Regex-Core-tobe.86

=============== Diff against Regex-Tests-Core-ct.33 ===============

Item was added:

----- Method: RxParserTest>>testNonQuantifier (in category 'tests')

testNonQuantifier

+ "Test expressions that look like quantifier expressions but do not fully match" + self assert: ('a{x}' matchesRegex: 'a{x}'). + self assert: ('a{,x}' matchesRegex: 'a{,x}'). + self assert: ('a{,}' matchesRegex: 'a{,}'). + self assert: ('a{,,}' matchesRegex: 'a{,,}'). + self assert: ('a{1,2,}' matchesRegex: 'a{1,2,}'). + self assert: ('a{,' matchesRegex: 'a{,'). + self assert: ('a{' matchesRegex: 'a{'). + self assert: ('a{1' matchesRegex: 'a{1'). + self assert: ('a{1,' matchesRegex: 'a{1,').!

christoph.thiede＠student.hpi.uni-potsdam.de

25 Jun 25 Jun

10:08 p.m.

Hi all,

quick bump, as I just stumbled upon this bug found by Tom again. :-)

Tom proposed to treat incomplete quantifier syntax such as a{,} as literal braces. However, I still would prefer raising a syntax error instead (it's more in line with [], ::, (), etc. and provides a better debuggability). Third opinions would be great so that we can finally define this behavior and avoid the current situation (which involves unexpected MNUs from the parser). :-)

Best, Christoph

--- Sent from Squeak Inbox Talk

On 2023-02-14T16:55:35+00:00, christoph.thiede@student.hpi.uni-potsdam.de wrote:

...

Hi Tom,

thank you for the context! In general, I believe that Squeak does not try to adhere to any particular regex flavor, and we are probably not consistent with any common flavor because of specialties like :isDigit: or </> boundaries. This brings us the great freedom of building our own (better) flavor. On the downside, missing compatibility is a pity, of course. I would love to hear some third opinions on this question. :-)

...
Specifically, I am receiving ECMAScript regexes from an external, trusted source and have been parsing those quite happily with the Squeak parser so far, save for some minor quirks that could be fixed via string replace and now this issue

Well, I think sanitizing braces would also be just another quick string replacement ;P

(jsRegexString copyWithRegex: '{(?!(\d+,\d*|,\d+)})' matchesReplacedWith: '{') copyWithRegex: '(?<!{(\d+,\d*|,\d+))}' matchesReplacedWith: '}'

(By the way, did you already take a look at regexp-tree transforms? :-))

Best, Christoph

Von: Squeak-dev <squeak-dev-bounces(a)lists.squeakfoundation.org> im Auftrag von Tom Beckmann <tomjonabc(a)gmail.com> Gesendet: Dienstag, 14. Februar 2023 11:16 Uhr An: The general-purpose Squeak developers list Betreff: Re: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz Hi Christoph,

good point, there's multiple ways to proceed given an input such as `x{1`.

The current code would either fail with an unrelated error or assume nil values for the ranges, which is of course not ideal. But you're right: a solution that would reject the above example early and explicitly is simpler to implement than my proposed solution.

The reason I chose to go for accepting incomplete ranges as literal characters was to align with the behavior of ECMAScript regexes. (Specifically, I am receiving ECMAScript regexes from an external, trusted source and have been parsing those quite happily with the Squeak parser so far, save for some minor quirks that could be fixed via string replace and now this issue).

The major argument from a user's perspective I see for accepting incomplete ranges is to reduce the need for escaping. I do also agree with your points, so there's a tradeoff to be decided on :)

Best, Tom

On Tue, 2023-02-14 at 08:32 +0000, Thiede, Christoph wrote:

...
Hi Tom,

thank you for your contribution! However, could you maybe share some reasoning for the intended parser behavior with us? Why do you want to treat incomplete quantifier sequences as literal characters instead of raising a syntax error?

Here are some possible arguments in favor of raising a syntax error that come to my mind:

Debugging incorrect expressions gets easier (e.g., if you missed a

closing curly brace by accident).

Without backtracking, the design of the parser remains simpler and

duplication-free, and its performance remains higher.

For other incomplete patterns such as '[a' or ':isDigit' we also

raise a syntax error instead of parsing the pattern as literals.

Other parsers behave inconsistently: Some treat the incomplete

examples as literals (e.g., JavaScript, .NET), while others raise syntax errors (e.g., Java).

Best, Christoph

Von: Squeak-dev <squeak-dev-bounces(a)lists.squeakfoundation.org> im Auftrag von commits(a)source.squeak.org <commits(a)source.squeak.org> Gesendet: Montag, 13. Februar 2023 10:30:59 An: squeak-dev(a)lists.squeakfoundation.org Betreff: [squeak-dev] The Inbox: Regex-Tests-Core-tobe.34.mcz A new version of Regex-Tests-Core was added to project The Inbox: http://source.squeak.org/inbox/Regex-Tests-Core-tobe.34.mcz

==================== Summary ====================

Name: Regex-Tests-Core-tobe.34 Author: tobe Time: 13 February 2023, 10:30:58.863775 am UUID: 897286b7-bca5-405c-92c0-52e09604b1fc Ancestors: Regex-Tests-Core-ct.33

Complements Regex-Core-tobe.86

=============== Diff against Regex-Tests-Core-ct.33 ===============

Item was added:

----- Method: RxParserTest>>testNonQuantifier (in category 'tests')

testNonQuantifier

+ "Test expressions that look like quantifier expressions but do not fully match" + self assert: ('a{x}' matchesRegex: 'a{x}'). + self assert: ('a{,x}' matchesRegex: 'a{,x}'). + self assert: ('a{,}' matchesRegex: 'a{,}'). + self assert: ('a{,,}' matchesRegex: 'a{,,}'). + self assert: ('a{1,2,}' matchesRegex: 'a{1,2,}'). + self assert: ('a{,' matchesRegex: 'a{,'). + self assert: ('a{' matchesRegex: 'a{'). + self assert: ('a{1' matchesRegex: 'a{1'). + self assert: ('a{1,' matchesRegex: 'a{1,').!

324

Age (days ago)

456

Last active (days ago)

squeak-dev@lists.squeakfoundation.org

4 comments

4 participants

tags (0)

participants (4)

christoph.thiede＠student.hpi.uni-potsdam.de
commits＠source.squeak.org
Thiede, Christoph
Tom Beckmann