[Newcompiler] Fwd: About getting one parser
stéphane ducasse
ducasse at iam.unibe.ch
Tue Aug 22 19:00:15 UTC 2006
andy and damien
are in the new compiler list?
Begin forwarded message:
> From: "Andrew Tween" <amtween at hotmail.com>
> Date: 22 août 2006 20:47:58 HAEC
> To: stéphane ducasse <ducasse at iam.unibe.ch>, "Marcus Denker"
> <denker at iam.unibe.ch>
> Cc: "math su" <mathieusuen at yahoo.fr>, "Philippe Marschall"
> <philippe.marschall at gmail.com>
> Subject: Re: About getting one parser
>
> Hi Stef and all,
>> we are discussing with mathieu about what to do for the parser.
>> And indeed this would be good to have one parser that would work well
>> for all the needs
>> we have.
>
> Agreed. Not just the Parser, also the Compiler.
>
>>
>> So mathieu will talk to you about that. Here are some of our
>> questions.
>> - Does it make sense to have Shout parser a subclass of a common
>> super of the new clean parser?
>
> By the new clean parser I assume you mean the Smacc based one -
> SqueakParser in
> the NewCompiler package?
>
> I have already done some investigation into this. I have a visitor
> class which
> goes through the AST trying to gather the same information that the
> ShoutParser
> produces. I haven't got very far with it, but it is immediately
> obvious that the
> AST nodes do not, currently, contain enough information to
> replicate the
> ShoutParser's output.
>
> The first thing I looked at was comments. These were easily
> patched up, by
> storing extra information in the scanner. Rather than simply
> recording the
> comment strings, I record the string + the start position. The
> question of where
> comments should go in the AST is then avoided, I get them from the
> scanner.
>
> A good aim for the parser/scanner/AST, would be to store
> enough information
> so that the original source could be reproduced exactly. i.e. no
> reformatting,
> whitespace the same, everything exactly the same as the original.
> Whitespace
> could be stored in the scanner, the same way that comments are.
> Then the only
> question is how the non-whitespace/non-comment parts of the source
> are broken
> down, and stored.
>
> For example, #'&&&' could be simply stored as a [symbol] with
> source
> [#'&&&'] and a certain range. Or, it could be stored as
> [symbolStart] with
> source [#], followed by [string] with source ['&&&']. There are
> some oddities in
> the current (old) Parser such as this being a valid symbol
>
> # "can there really be a comment here?"
> "and another one here?
> thisIsTheSymbol
>
> If this kind of wierdness is to be replicated in the new parser,
> e.g. for
> backwards compatibility, then it is necessary to store it split up [#]
> [thisIstehSymbol], with the comments stored in-between, or stored
> elsewhere
> (e.g. in the scanner).
>
> Ok. Back to the question of whether it is a good idea to have the
> Shout
> parser/scanner as subclasses of the SqueakParser/SqueakScanner.
>
> If all the Shout parser is doing is some extra processing
> before calling the
> super method, then it isn't a bad thing.
> What would be nasty is if the ShoutScanner redefines the
> sannerDefinition/parserDefinition. One reason for doing this is to
> record extra
> info, by introducing extra reduction actions. For example, pragmas
> are on my
> mind at the moment, so the grammer currently is ...
>
> Pragmas:
> "<" PragmaMessage ">" {#pragma:}
> | Pragmas "<" PragmaMessage ">" {#pragmas:};
>
> but could be expanded so the < and > tokens became non-terminals,
> and had
> reduction actions...
>
> PragmaStart: "<" {#pragmaStart: };
> PragmaEnd: ">" {#pragmaEnd: };
> Pragmas:
> PragmaStart PragmaMessage PragmaEnd {#pragma:}
> | Pragmas PragmaStart PragmaMessage PragmaEnd {#pragmas:};
>
> In this way the source ranges of the < and > can be recorded. The
> grammar can
> easily become very complex if too much of this kind of stuff is
> done. If the
> grammar of the Smacc parser became overly complex (in either
> SqueakParser, or a
> Shout subclass), then I would simply leave the Shout parser as it
> is today, as a
> dedicated hand-written parser.
>
>> - Can shout be using a non dedicated parser (what would be then the
>> points to pay attention).
>
> I have covered some above. But there are some other gotchas.
>
> error recovery - if the source is incomplete, Shout still needs to
> know the
> ranges of the tokens already processed. e.g. '(((((1+' has some
> ranges for
> parenthesis, number, binarySelector; but probably no AST.
>
> garbage - Shout parses on each keystroke. It is easy to create too
> many objects,
> which take time to create, and time to garbage collect. Here is an
> example which
> takes a *very* long time to parse...
>
> [SqueakParser parseMethod: 'a ^10e999999' ] timeToRun.
>
> It takes about 4 minutes.
> So there needs to be a way to defer the creation of literals, until
> they are
> needed. (Shout simply records the source range of the number, but
> doesn't create
> the object. That is one of the reasons why it does its own number
> parsing).
>
>> -....
>>
>> Now mathieu is working on in which nodes do we put comments and this
>> reminded me that Philippe has the possibility
>> to annotate the AST with information. Is it in the AST?
>
> Consider this question - "Into which nodes should we put whitespace?"
> :)
>
>>
>> How could we proceed?
>
> I will keep on with my experiments to make Shout use the
> SqueakParser, and
> hopefully things will become clearer.
> Cheers,
> Andy
>
>>
>> Mathieu writes some tests... and deal with the comments :)
>>
>> Stef
>>
>
More information about the Newcompiler
mailing list