[Seaside] [BUG] in IAHtmlParser

Fri, 29 Mar 2002 14:14:14 -0800 (PST)

On Fri, 29 Mar 2002, Damon Anderson wrote:

> One particularly invalid use of HTML that I see constantly is sticking
> <FORM> and related tags in between bits of the table. The browser
> doesn't render anything between <TR> and <TD>, for example, so the
> paragraph breaks that <FORM> normally puts into the document don't ruin
> the careful table-based formatting that the designer is attempting. And
> I've seen opening <FORM>s at a different "level" in the table than the
> closing </FORM>, which causes tag overlap like above. It's totally
> wrong, but if the browser accepts it why can't the parser? That's the
> reaction I get from the designers, anyway.

Hmm, yes, that breaks things pretty badly.  Thanks for the example, it's
about to be added to the unit tests.

> Also, I meant catch-up to the browser, not to the spec. And not only the
> browser, but anything else which is going to be processing the HTML
> after my parser had gotten its hands on it: ColdFusion, JSP, etc. I
> don't think that a parser can realistically keep up with all of those
> variations. (This doesn't apply to seaside, of course, I'm just
> justifying my decisions at this point.)

Absolutely true.  In the case where you're dealing with non-HTML tags,
trying to follow a spec is madness.

> That's the kind of situation where I think a lax parser is necessary. It
> definitely isn't ideal at all, but thinking about how the next *TML
> standard will make things easier doesn't get the project done today. I
> don't mean any offense, I'm just trying to be realistic.

No offense taken.  I think we agree on the goals, just not on the way to
achieve them.  I too think the parser should be as lax as possible, but I
think of not requiring explicit </p> tags as being more lax, not less.  I
do want the parser to be able to reasonably handle whatever HTML the
designers throw it, although in the case of Seaside we can assume that
this only includes known HTML tags, not CFML, JSP, etc.  The first step is
to take part of your advice and store dangling close tags as text - this
should satisfy the byte-identical output constraint.  However, if the
structure of something as awful as "<tr><form><td>foo</td></tr></form>" is
ever significant (say those tr and td are both repeating), I really have
no clue how to properly handle that... I don't think either of our
approaches will always build a useful tree from such stuff.

I wonder if there's some selective broadening that could be done to the
spec to handle such cases better while still being smart about things like
omitted </li> tags?