[Newcompiler] Re: About getting one parser

Tue Aug 22 19:11:44 UTC 2006

>
>> So mathieu will talk to you about that. Here are some of our  
>> questions.
>> - Does it make sense to have Shout parser a subclass of a common
>> super of the new clean parser?
>
> By the new clean parser I assume you mean the Smacc based one -  
> SqueakParser in
> the NewCompiler package?

Yes

>
> I have already done some investigation into this. I have a visitor  
> class which
> goes through the AST trying to gather the same information that the  
> ShoutParser
> produces. I haven't got very far with it, but it is immediately  
> obvious that the
> AST nodes do not, currently, contain enough information to  
> replicate the
> ShoutParser's output.

Excellent

>     The first thing I looked at was comments. These were easily  
> patched up, by
> storing extra information in the scanner. Rather than simply  
> recording the
> comment strings, I record the string + the start position. The  
> question of where
> comments should go in the AST is then avoided, I get them from the  
> scanner.

Mathieu is currently fixing that: the idea is that he stores the  
comment in the nodes in
the following consistent manner:
	- first comments belong to the method node
	- comment after a expression on the same line belong to the expression
	- comment on a line belong to the sequence below (first node of the  
sequence)
But he is open to suggestion.
We were thinking if the comment should be treated at the parser level  
or as an annotation using
a visitor.

>     A good aim for the  parser/scanner/AST, would be to store  
> enough information
> so that the original source could be reproduced exactly. i.e. no  
> reformatting,
> whitespace the same, everything exactly the same as the original.

indeed this is a good idea.

> Whitespace
> could be stored in the scanner, the same way that comments are.  
> Then the only
> question is how the non-whitespace/non-comment parts of the source  
> are broken
> down, and stored.
>
>     For example, #'&&&' could be simply stored as a [symbol] with  
> source
> [#'&&&'] and a certain range. Or, it could be stored as  
> [symbolStart] with
> source [#], followed by [string] with source ['&&&']. There are  
> some oddities in
> the current (old) Parser such as this being a valid symbol
>
>     # "can there really be a comment here?"
>     "and another one here?
>     thisIsTheSymbol
>
> If this kind of wierdness is to be replicated in the new parser,  
> e.g. for
> backwards compatibility, then it is necessary to store it split up [#]
> [thisIstehSymbol], with the comments stored in-between, or stored  
> elsewhere
> (e.g. in the scanner).

I would like to avoid this kind of problem. May be this is the time  
to stop having that!
> # "can there really be a comment here?"
>     "and another one here?
>     thisIsTheSymbol

> Ok. Back to the question of whether it is a good idea to have the  
> Shout
> parser/scanner as subclasses of the SqueakParser/SqueakScanner.
>
>     If all the Shout parser is doing is some extra processing  
> before calling the
> super method, then it isn't a bad thing.
>     What would be nasty is if the ShoutScanner redefines the
> sannerDefinition/parserDefinition. One reason for doing this is to  
> record extra
> info, by introducing extra reduction actions. For example, pragmas  
> are on my
> mind at the moment, so the grammer currently is ...
>
>     Pragmas:
>          "<" PragmaMessage ">"      {#pragma:}
>         | Pragmas "<" PragmaMessage ">"    {#pragmas:};
>
> but could be expanded so the < and > tokens became non-terminals,  
> and had
> reduction actions...
>
>     PragmaStart: "<" {#pragmaStart: };
>     PragmaEnd: ">" {#pragmaEnd: };
>     Pragmas:
>          PragmaStart PragmaMessage PragmaEnd      {#pragma:}
>         | Pragmas PragmaStart PragmaMessage PragmaEnd    {#pragmas:};

Yes having a real node for pragmas would help.

> In this way the source ranges of the < and > can be recorded. The  
> grammar can
> easily become very complex if too much of this kind of stuff is  
> done. If the
> grammar of the Smacc parser became overly complex (in either  
> SqueakParser, or a
> Shout subclass), then I would simply leave the Shout parser as it  
> is today, as a
> dedicated hand-written parser.

Indeed.

>> - Can shout be using a non dedicated parser (what would be then the
>> points to pay attention).
>
> I have covered some above. But there are some other gotchas.
>
> error recovery - if the source is incomplete, Shout still needs to  
> know the
> ranges of the tokens already processed.  e.g.  '(((((1+'  has some  
> ranges for
> parenthesis, number, binarySelector; but probably no AST.
>
> garbage - Shout parses on each keystroke. It is easy to create too  
> many objects,
> which take time to create, and time to garbage collect. Here is an  
> example which
> takes a *very* long time to parse...
>
>     [SqueakParser parseMethod: 'a ^10e999999' ] timeToRun.
>
> It takes about 4 minutes.
> So there needs to be a way to defer the creation of literals, until  
> they are
> needed. (Shout simply records the source range of the number, but  
> doesn't create
> the object. That is one of the reasons why it does its own number  
> parsing).
>
>> -....
>>
>> Now mathieu is working on in which nodes do we put comments and this
>> reminded me that Philippe has the possibility
>> to annotate the AST with information. Is it in the AST?
>
> Consider this question - "Into which nodes should we put whitespace?"
> :)

:)

>> How could we proceed?
>
> I will keep on with my experiments to make Shout use the  
> SqueakParser, and
> hopefully things will become clearer.

By the way damien said that for gutenberg he would need to know not  
only the comments
in a node but where they were placed (if you read what I wrote above  
you will see
that you need to distinguish if the comment was on the same line or not.

Thanks I really appreciate what you are doing and the spirit it  
brings to squeak.
Been good is something been friendly another.

> Cheers,
> Andy
>
>>
>> Mathieu writes some tests... and deal with the comments :)
>>
>> Stef
>>
>