[Seaside] [BUG] in IAHtmlParser

Damon Anderson damon@spof.net
Fri, 29 Mar 2002 14:37:10 -0600


Avi Bryant writes:
 > Hmm. Although I agree that outputting the same HTML you get in is
 > important, I'm not sure that leaving lone tags dangling is the best
 > way to do it. In a case like "<p>foo<p>bar", the structure of the
 > document really should be treated as "<p>foo</p><p>bar</p>", not
 > "<p></p>foo<p></p>bar". I don't believe the latter properly reflects
 > the document; it's certainly not how I parse the document
 > intuitively.

It was definitely designed with a different philosophy in mind. My goal
was to try not to assume anything. If the user wants to manipulate a
paragraph block, then they need to put in a closing </p>. The reason why
I decided to leave dangling tags in the document is so that it would be
possible to handle cases like "<b><i>foo</b></i>". There's obviously no
tree structure which could possibly represent that without munging, so I
punt. By generating a tree like this you at least end up with the proper
output, even if you can't generate a good tree from that part of the
document:

  Tag{name=B}
  Tag{name=I}
    child: Text{"foo"}
    child: Text{"</B>"}

 > What the seaside parser does is records whether or not a given tag
 > was explicitly closed, and only outputs a close tag for it if it was.
 > This ends up coming very close to outputting identical HTML to what
 > it's given, whitespace included. Except in the case of a dangling
 > close tag, I don't think I've ever seen it output anything else.

It sounds like that's the main way that our parsers differ. I did the
same thing (although using a tag stack instead of by marking the tags,
for a couple of reasons), but put invalid HTML back into the stream as
text. You can't do any transformations on it, but you can't anyway if
the tree is invalid. I figured that trying to infer what the user meant
was going to doom my parser to bugginess and "browser catch-up". Just
look at the state of web browsers today. That's not a road I want my
code to go down! Yikes.

 > Depending on what you're doing with the tree, having an incorrect
 > structure, and having some of the tags inlined, is fine. Seaside
 > includes a full macro system for its templates, which in theory could
 > perform arbitrary transformations on the tree. Having either an
 > incomplete tree or one which doesn't match the developer's intuitions
 > about the document's structure (isn't great when a syntax is so
 > complex that you're never quite certain how it'll be parsed?),
 > restricts this power considerably. Now, this may be unnecessary power
 > - I've only used the macro system for very simple cases that would
 > almost certainly work with the parser you describe as well. But I
 > like having that power in reserve, and would be somewhat loathe to
 > give it up without a very good reason.

I agree, that's a very useful thing to be able to do. And I'm not trying
to persuade anybody to switch to my parser, but consider: if the HTML
document is valid, you have that capability with my parsing philosophy
as well (remember that the "tag RLE compression" I mentioned before is
optional). The differences mostly come up when you start talking about
invalid documents which can not be converted into trees. Do you "punt"
and offer less functionality, but still guarantee that you won't mangle
the document, or do you insist on correctness so that you can guarantee
full functionality?

 > (Quick example - I had a discussion with Marcel a while ago about
 > templates that included no special identifiers or marking whatsoever,
 > with totally external definitions of which elements were to be
 > treated specially. For that, the parser simply can't make any
 > assumptions about what structural information is useful and what
 > isn't).

I've thought about that as well. Unless you have sophisticated pattern
matching (HTML "shape" detection, basically), doesn't that tie the
external definition to the formatting of your HTML?

This is all very interesting stuff, either way.

-damon