[Seaside] [BUG] in IAHtmlParser

Damon Anderson damon@spof.net
Fri, 29 Mar 2002 15:44:20 -0600


[Warning: I'm really starting to ramble. By now I'm inclined to just
"agree to disagree" about this, but it's an interesting thing to discuss
nonetheless.]

Avi Bryant writes:
 > How many designers bother putting in the </p>? Everybody knows where
 > a paragraph gets closed, I don't really see why the parser should
 > have to be babied by giving it extra information.

Because </p> is the easy case, and I didn't want to have to update the
parser as new tags are added (see below about ColdFusion/etc), as
browsers change their parsing slightly between versions, etc. The
correct assumption to make, as far as an HTML person is concerned, is
"the same assumption that the browser/etc makes", and I didn't want to
play catch-up there.

This was going to be put into a shrink-wrapped product, so I wouldn't
have had the opportunity to go back and fix invalid assumptions when the
clients discovered that a tag they used was being munged by the parser.
Also, parsing was not done at page display time, it was part of a
"build" process, sortof, so what about all of the wacky tags that would
have been in these HTML documents that need to be passed through
unharmed? I just didn't (and still don't) believe that I could safely
put assumptions into the parser. This parser was part of a content
management system, not a dynamic templating engine, I should have
mentioned that. Not quit the same context as seaside.

 > Yes, that makes sense, although I'm skeptical that there are cases
 > where "<b><i>foo</i></b>", which is how the Seaside parser would
 > treat that, isn't as good or better. There may be pathological cases
 > of broken browsers, of course - do you have any examples of these?

It's not that the browser requires <b><i></b></i>, or equivalent, it's
that my goal was to keep the document intact by any means necessary. The
goal was that any HTML document without "marking" would come out the
other end of the parser byte-for-byte identical. This is the only way I
could think of to successfully do that.

One particularly invalid use of HTML that I see constantly is sticking
<FORM> and related tags in between bits of the table. The browser
doesn't render anything between <TR> and <TD>, for example, so the
paragraph breaks that <FORM> normally puts into the document don't ruin
the careful table-based formatting that the designer is attempting. And
I've seen opening <FORM>s at a different "level" in the table than the
closing </FORM>, which causes tag overlap like above. It's totally
wrong, but if the browser accepts it why can't the parser? That's the
reaction I get from the designers, anyway.

 > No, you don't, and that's the point: "<ul><li>1<li>2</ul>" is a
 > *valid* HTML document, and it is crucial for manipulating it properly
 > that the 1 and 2 be children of the list items, not siblings. The
 > only way to correctly parse valid HTML documents is to follow the
 > HTML specification. Believe me, we tried to come up with simpler
 > heuristics, but it's not worth it. Having to "play catch up" with the
 > spec is, IMO, a reasonable price to pay, particularly since the next
 > transition will presumably be to XHTML, and make this all moot.

You're right. It is valid, what I meant when I said "valid" was
"including a close tag if a close tag is implied." Which is a much more
strict definition of "valid" than HTML uses, so I'll just retract that
statement.

Also, I meant catch-up to the browser, not to the spec. And not only the
browser, but anything else which is going to be processing the HTML
after my parser had gotten its hands on it: ColdFusion, JSP, etc. I
don't think that a parser can realistically keep up with all of those
variations. (This doesn't apply to seaside, of course, I'm just
justifying my decisions at this point.) For less pathological cases than
"street HTML" we had other parsers (XML) which generated the same type
of node tree, so we wouldn't have had to change the HTML parser to
support XHTML. It would have been a different parser.

Finally, I should mention that I've used things like XMLC (in Enhydra)
which are more strict about the HTML that they parse, and had a LOT of
trouble with them. I tried to import a few HTML pages from the company's
web site and got no less than 300 unrecoverable parse errors on some
pages, not to mention the scores of tags that were dropped or added
because they weren't valid in the place that they appeared in the
document. Yes, the HTML was *completely invalid*, but the designers
weren't about to rewrite everything to satisfy my choice of HTML parser.
That's the kind of situation where I think a lax parser is necessary. It
definitely isn't ideal at all, but thinking about how the next *TML
standard will make things easier doesn't get the project done today. I
don't mean any offense, I'm just trying to be realistic. I wish I could
dictate the variety or quality of HTML used, but we were trying to sell
our product to people with large quantities of existing HTML that they
were not willing to modify.

-damon