[Seaside] [BUG] in IAHtmlParser

Damon Anderson damon@spof.net
Fri, 29 Mar 2002 13:45:19 -0600

Julian Fitzell writes:
 > Yeah, sorry about that. I wouldn't say the parser is strictly
 > compliant but it ended up being necessary to make it somewhat
 > compliant. The reason is that in order to allow all the cases that
 > people use all the time, we would essentially not be able to support
 > valid HTML (even though it is probably never used).
 > The problem is, frankly that the HTML spec is insane! There are
 > ridiculous combinations of only allowing certain tags within others
 > and implicitly closing tags for you. This implicit closing is most of
 > the reason why I had to enforce some of the rules about what tags can
 > be contained inside others. This is why most people never use </p> or
 > </li> allowing them to be closed implicitly be the next non-inline
 > tag (usually the next <p> or <li> in these cases).
 > But it sucks. XML is often overused but in this case, HTML so wants
 > to be XML anyway I wish browser developers would hurry up and start
 > adding support for XHTML so I can start writing my webpages with it.

I wrote an HTML parser last year which handled malformed HTML pretty
well (better than any other parsers I could find the source to at the
time, which is why I ended up designing one from scratch). The
philosophy behind it was much different. It didn't try to do any
interpretation of the spec whatsoever, and so the resulting tree
reflected the HTML document exactly, not the nesting structure specified
by the spec.

It used a tag stack, but it handled lone tags differently: it just left
them there in the stream, including dangling close tags. This meant that
your tree potentially looked odd, but it also meant that when re-
generating the source you'd get back the same broken HTML (including
whitespace in most cases), which is the important thing, IMO. At least
for the project that I wrote it for, it would have been a sin for the
templating code to mangle the designer's carefully crafted (if totally
invalid) HTML. It did other things, like optionally leaving runs of un-
instrumented tags unparsed as text nodes, which reduced the chance that
an isolated </b> would ruin the entire tree. If you didn't put any
markers into "<b><i>foo</b></i>" it would just leave it alone. If this
isn't clear I can go into more detail.

Now the bad news: I wrote it in Java (partially using ANTLR). I also
don't have the source anymore, I had to leave it behind when I switched
jobs. I've been thinking about recreating it lately, though. I spent
enough time designing it that I still remember it pretty clearly a year
later, and I need a good HTML parser for an unrelated project I'm
working on. There's a good chance I'll end up recreating it in Squeak.

I don't want to make any promises, but is there any interest in such a
thing? If it would be useful to a few other people that would certainly
be motivation to put more time into it.