[Newcompiler] Fwd: About getting one parser

Tue Aug 22 19:00:15 UTC 2006

andy and damien
are in the new compiler list?

Begin forwarded message:

> From: "Andrew Tween" <amtween at hotmail.com>
> Date: 22 août 2006 20:47:58 HAEC
> To: stéphane ducasse <ducasse at iam.unibe.ch>, "Marcus Denker"  
> <denker at iam.unibe.ch>
> Cc: "math su" <mathieusuen at yahoo.fr>, "Philippe Marschall"  
> <philippe.marschall at gmail.com>
> Subject: Re: About getting one parser
>
> Hi Stef and all,
>> we are discussing with mathieu about what to do for the parser.
>> And indeed this would be good to have one parser that would work well
>> for all the needs
>> we have.
>
> Agreed. Not just the Parser, also the Compiler.
>
>>
>> So mathieu will talk to you about that. Here are some of our  
>> questions.
>> - Does it make sense to have Shout parser a subclass of a common
>> super of the new clean parser?
>
> By the new clean parser I assume you mean the Smacc based one -  
> SqueakParser in
> the NewCompiler package?
>
> I have already done some investigation into this. I have a visitor  
> class which
> goes through the AST trying to gather the same information that the  
> ShoutParser
> produces. I haven't got very far with it, but it is immediately  
> obvious that the
> AST nodes do not, currently, contain enough information to  
> replicate the
> ShoutParser's output.
>
>     The first thing I looked at was comments. These were easily  
> patched up, by
> storing extra information in the scanner. Rather than simply  
> recording the
> comment strings, I record the string + the start position. The  
> question of where
> comments should go in the AST is then avoided, I get them from the  
> scanner.
>
>     A good aim for the  parser/scanner/AST, would be to store  
> enough information
> so that the original source could be reproduced exactly. i.e. no  
> reformatting,
> whitespace the same, everything exactly the same as the original.  
> Whitespace
> could be stored in the scanner, the same way that comments are.  
> Then the only
> question is how the non-whitespace/non-comment parts of the source  
> are broken
> down, and stored.
>
>     For example, #'&&&' could be simply stored as a [symbol] with  
> source
> [#'&&&'] and a certain range. Or, it could be stored as  
> [symbolStart] with
> source [#], followed by [string] with source ['&&&']. There are  
> some oddities in
> the current (old) Parser such as this being a valid symbol
>
>     # "can there really be a comment here?"
>     "and another one here?
>     thisIsTheSymbol
>
> If this kind of wierdness is to be replicated in the new parser,  
> e.g. for
> backwards compatibility, then it is necessary to store it split up [#]
> [thisIstehSymbol], with the comments stored in-between, or stored  
> elsewhere
> (e.g. in the scanner).
>
> Ok. Back to the question of whether it is a good idea to have the  
> Shout
> parser/scanner as subclasses of the SqueakParser/SqueakScanner.
>
>     If all the Shout parser is doing is some extra processing  
> before calling the
> super method, then it isn't a bad thing.
>     What would be nasty is if the ShoutScanner redefines the
> sannerDefinition/parserDefinition. One reason for doing this is to  
> record extra
> info, by introducing extra reduction actions. For example, pragmas  
> are on my
> mind at the moment, so the grammer currently is ...
>
>     Pragmas:
>          "<" PragmaMessage ">"      {#pragma:}
>         | Pragmas "<" PragmaMessage ">"    {#pragmas:};
>
> but could be expanded so the < and > tokens became non-terminals,  
> and had
> reduction actions...
>
>     PragmaStart: "<" {#pragmaStart: };
>     PragmaEnd: ">" {#pragmaEnd: };
>     Pragmas:
>          PragmaStart PragmaMessage PragmaEnd      {#pragma:}
>         | Pragmas PragmaStart PragmaMessage PragmaEnd    {#pragmas:};
>
> In this way the source ranges of the < and > can be recorded. The  
> grammar can
> easily become very complex if too much of this kind of stuff is  
> done. If the
> grammar of the Smacc parser became overly complex (in either  
> SqueakParser, or a
> Shout subclass), then I would simply leave the Shout parser as it  
> is today, as a
> dedicated hand-written parser.
>
>> - Can shout be using a non dedicated parser (what would be then the
>> points to pay attention).
>
> I have covered some above. But there are some other gotchas.
>
> error recovery - if the source is incomplete, Shout still needs to  
> know the
> ranges of the tokens already processed.  e.g.  '(((((1+'  has some  
> ranges for
> parenthesis, number, binarySelector; but probably no AST.
>
> garbage - Shout parses on each keystroke. It is easy to create too  
> many objects,
> which take time to create, and time to garbage collect. Here is an  
> example which
> takes a *very* long time to parse...
>
>     [SqueakParser parseMethod: 'a ^10e999999' ] timeToRun.
>
> It takes about 4 minutes.
> So there needs to be a way to defer the creation of literals, until  
> they are
> needed. (Shout simply records the source range of the number, but  
> doesn't create
> the object. That is one of the reasons why it does its own number  
> parsing).
>
>> -....
>>
>> Now mathieu is working on in which nodes do we put comments and this
>> reminded me that Philippe has the possibility
>> to annotate the AST with information. Is it in the AST?
>
> Consider this question - "Into which nodes should we put whitespace?"
> :)
>
>>
>> How could we proceed?
>
> I will keep on with my experiments to make Shout use the  
> SqueakParser, and
> hopefully things will become clearer.
> Cheers,
> Andy
>
>>
>> Mathieu writes some tests... and deal with the comments :)
>>
>> Stef
>>
>