[squeak-dev] Second expression in sequence confusion FreeLink <- "[[" .{&[>\]]} "]]"

Wed Oct 9 20:10:30 UTC 2019

Hi Levente,

Quick status update. 

I have built through Links https://en.wikipedia.org/wiki/Help:Wikitext#Links_and_URLs

and Images https://en.wikipedia.org/wiki/Help:Visual_file_markup

And, I am able to hit the db for Wikimedia markup, parse it and serve it via Seaside in amazing speed.

I am now on Tables: https://en.wikipedia.org/wiki/Help:Table

And some of the cruft I have on the first iterations of the Grammar is starting to bite me.

So! I am rebuilding the grammar from first things using the grammarPEG as the baseline and building up.

Page <- (Break / Paragraph)*

Paragraph <-   .{1,"\n"}

Break <- "\n"{2}   /* https://en.wikipedia.org/wiki/Help:Wikitext#Line_breaks */

/* Rules from grammarPEG */

s					<-	S*                     /* s is zero or more whitespace */

S					<-	whitespace+   /* S is one or more whitespace */

/* Primals from grammarPEG */

whitespace			<-	[\s\t\n\r]

SLASH				<-	"/"

BACKSLASH		<-	"\\"

AND				<-	"&"

NOT				<-	"!"

COMMA				<-	","

QUESTION			<-	"?"

STAR				<-	"*"

PLUS				<-	"+"

DASH				<-	"-"

DOT				<-	"."

QUOTE				<-	"''"

DOUBLE_QUOTE	<-	''"''

OPEN_BRACKET	<-	"["

CLOSE_BRACKET	<-	"]"

OPEN_PAREN		<-	"("

CLOSE_PAREN		<-	")"

OPEN_BRACE		<-	"{"

CLOSE_BRACE		<-	"}"

'

Thanks to your help, I am able to reason through this rather than just hack at it.

Given this input:

testParagraph

^

'This is a test of paragraphs.

One hard return.

',

WikitextParserLibrary loremIpsum,

'

two hard returns above',

'

I get the following (correct!!!!) output.

<body>This is a test of paragraphs.One hard return.Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. two hard returns above three hard returns above</body>

Thank you again for your help.

One quick "style" question if you have a moment. Currently Paragraph looks like this:

Paragraph: anOrderedCollection

<action: 'Paragraph'>

|text|

Transcript show:'Paragraph';cr.

text := ''.

anOrderedCollection do:[:each | text := text,each asString].

^self newElementTag: Paragraph elements: (Array with: (self newText: text))

Its the iteration that feels odd. to me. It works, but it feels odd.

Thoughts?

cheers,

tty

---- On Wed, 11 Sep 2019 11:38:58 -0400 Levente Uzonyi <mailto:leves at caesar.elte.hu> wrote ----

Hi Tim, 

On Wed, 11 Sep 2019, gettimothy wrote: 

> He Levente. 
> 
> Thank you very much. This is a huge time saver. 
> 
> I was literally reading the wrong documentation. 
> 
> whew! 
> 
> Where is this documented? I do not see it on the wiki link you sent me.  

As far as I know, there's no documentation for it. 

> Is it infferable from a class in the XTreams packages? 

Yes, it is, because PEGParser is written in itself. 
The grammar to write PEG grammars is in PEGParser class >> #grammarPEG. 
Here's the part related to cardinality parsing: 

Cardinality            <-    OPEN_BRACE s (CardinalityRange / CardinalityLoopMin / CardinalityRangeMin / CardinalityLoop) s CLOSE_BRACE 
CardinalityRangeMin    <-    NumLiteral 
CardinalityRange    <-    NumLiteral s COMMA s NumLiteral 
CardinalityLoopMin    <-    NumLiteral s COMMA s Expression 
CardinalityLoop        <-    Expression 

The first line defines 4 potential cardinality descriptions. All of these 
start with { (OPEN_BRACE) and end with } (OPEN_BRACE). There can be any 
number of spaces between the braces (s nonterminal). 

The first one, CardinalityRangeMin accepts a single NumLiteral, which is 
either a number with no leading zeroes, or the string "Infinity". Infinity 
doesn't make any sense here (and the parser doesn't handle it btw). 
When PEGParserParser >> #CardinalityRangeMin: processes this rule, it'll 
create a block that will send #repeat:min:max: to PEGParser with the 
parsed number as min and max too. So, the actual logic is in PEGParser >> #repeat:min:max:. 
An example for this rule is 
     Foo <- "x" {3} 
which accepts Foo when there are three consecutive x characters on the 
input stream.

The second one, CardinalityRange is similar to CardinalityRangeMin, but 
accepts an upper bound as well: 
     Foo <- "x"{3, 5} 
accepts Foo when there are at least three consecutive x characters 
on the input stream, but it'll consume up to 5 when there are that many. 
Here the Infinity value makes sense: 
     Foo <- "x"{3, Infinity} 
accepts Foo when there are at least three consecutive x characters 
on the input stream, but will consume all further x characters no matter 
how many are there.

The third one, CardinalityLoopMin is the one you asked about in your other 
mail. Instead of an upper bound, it takes a stop expression, which can be 
anything from simple to advanced. The rule 
     Foo <- [a-z]{3, "foo" / "bar"} 
takes at least 3 lowercase ascii letters, then it will take further such 
characters up until "foo" or "bar" appears on the stream. It will read 
those characters as well, but will not yield them. By yield, I mean that 
your parser will not receive the characters of "foo" or "bar", so when 
you write your method processing the Foo rule, you will not know whether 
the input ended with "foo", "bar" or a non-ascii-letter character.

The fourth one is similar to the third one, but it works as if 0 
were added as the minimum number of repetitions of the pattern. 
E.g.: 
     Foo <- [a-z]{"foo" / "bar"} 
is equivalent to 
     Foo <- [a-z]{0, "foo" / "bar"}

> 
> I looked over the XTreams tests classes yesterday and did not see any clues. 
> Do the test cases need improving? (I can contribute by doing the grunt work under supervision) 

Yes, some things have no tests. For example, these cardinality rules. I'll 
push some fixes to the repository related to them soon. 

> 
> Also, what does "don't yield them" mean? 

I tried to explain above. I'll give you a more complete example if that 
doesn't make it clear. Just let me know. 

> 
> Does it mean the parser stays in its present spot? 

No. It means that the generated parser consumes the characters from the 
input, but doesn't pass them to the rule processor methods. 

> 
> thank you again. 
> 
> t. 
> 
> p.s. I have cc'ed squeak-beginners list on this message. 

I don't think that's the appropriate place for these messages, because 
this is anything but beginner stuff. 
I think squeak-dev would be a much better place for now. Should our 
discussion cause too much noise there, we can create a squeak-users list 
for them in the future. 

Levente 

> 
> 
> 
> 
> ---- On Tue, 10 Sep 2019 20:30:07 -0400 Levente Uzonyi <mailto:leves at caesar.elte.hu> wrote ---- 
> 
>       Hi Tim, 
> 
>       On Tue, 10 Sep 2019, gettimothy wrote: 
> 
>       > Hi Levente. 
>       > 
>       > If you don't have time for this, "No" is  a good answer. 
>       > 
>       > I have the WikiMedia freelinks working. https://en.wikipedia.org/wiki/Help:Wikitext#Free_links 
>       > 
>       > [[This Is A Link]] generates  
>       > 
>       > <a href="https://en.wikipedia.org/wiki/This_is_a_link">This is a link</a> 
>       > 
>       > I would like to translate the FreeLink <- "[[" .{&[>\]]} "]]" sequence into something like 
>       > 
>       > FreeLink <- LinkOpen  .{&[>\]]}  LinkClose 
>       > LinkOpen   <- BracketOpen BracketOpen 
>       > LinkClose   <- BracketClose BracketClose 
>       > BracketOpen <- "[" 
>       > BracketClose <- "]" 
>       > 
>       > so that I can iteratively build up to more complicated link styles. 
>       > 
>       > That Capture in the middle of the sequence is giving me fits. 
>       > Something as simple as: 
>       > 
>       > FreeLink <- LinkOpen  .{&[>\]]}  LinkClose 
>       > 
>       > does not parse as neither LinkOpen nor LinkClose are consumed. 
>       > 
>       > My interpretation of that middle sequence term is: 
>       > "." get the next character and consume it. 
>       > {&[>\]]} apply expression &[>\]] and capture the string that matched it for later use. 
> 
>       In Xtreams-Parsing, braces don't mean capture. They mean cardinality. It 
>       comes from common regular expression syntax. 
>       The regular expression x{1,3} means x 1 to 3 times, so it accepts x, 
>       xx, and xxx. 
>       You can also pass a single number x{3}, which is a shorthand for xxx. 
>       You can also omit the second argument like in x{3,}, which means x 3 or 
>       more times. 
>       This construct is extended in Xtreams-Parsing with a stop expression. 
>       "x"{"y"} means, accept any number of x up until y comes. Consume y too, 
>       but don't yield it. So, such expression accepts: xy, xxy, xxxy, xxxxy, 
>       etc, and yields x, xx, xxx, xxxx, etc. 
> 
>       I suspect having & inside {} probably causes problems, because {} tries to 
>       consume what it parses, but & tells the parser not to consume what comes 
>       after it. 
> 
>       If I were to write the FreeLink rule, it would be something like: 
> 
>       FreeLink <- "[[" .{"]]"} 
> 
>       It means: take two opening braces, accept and yield everything up to two 
>       closing braces, then consume those too, but don't yield them. 
> 
> 
>       Levente 
> 
>       > &[>\]] AND predicate : indicate success if expression [>\]] matches text ahead; otherwise indicate failure. do not consume text. 
>       > [>\]]  character range between ">" and "]" 
>       > 
>       > 
>       > Thanks for your time. 
>       > 
>       > cordially, 
>       > 
>       > t 
>       > 
>       > 
>       > 
>       > 
>       > 
>       > 
>       > 
> 
> 
> 
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20191009/2753dc1a/attachment.html>