--On Monday, August 27, 2001 1:48 AM +0100 John Hinsley jhinsley@telinco.co.uk wrote:
[snip]
Tidy is a nice tool. My main objections to it are that it balances tags where they really don't need to be balanced -- unecessarily increasing the length of documents is _always_ a bad thing
No, it's not, especially if it makes parsing easier.
A stronger objection to Tidy is having a two pass rendering -- for many machines this would be unbearably slow, espeically as Scamper (to my knowledge) doesn't (yet) cache rendered pages.
-- and that it inserts its own meta tag (see above, plus it's rather rude).
[snip] But it's source code is available, so the latter can change. The meta thing doesn't bug me a whit, especially for the use we'd be putting it two (it'd be nice to be able to check whether the doc's been tidied). And, I wouldn't be surprised if there were a cmd line option to suppress that.
However, Tidy is still the place to look, IMHO, for how to handle the "normal" range of SUML (Screwed Up-ML) out there. It's design goal was to try to reproduce the parsing quirks of the major browsers (which, alas, is probably a better guide to what kind of HTML you'll find than any spec).
From what I can tell, the C code is very clean, so someone with a bit of
Cness in them might be able to just turn it into a plugin.
OTOH, it or the Java port could be used as the basis for, or inspiration for, a Squeak port. That would nicely eliminate the second pass, as the SqueakTidy parse could go directly to an internal whatever (tree, DOM abomination, what have you :)).
Done right, it would be easy to add extra "cleansing" rules to handle other pathological (or just irritating) cases.
Ideally, we'd have a good Squeak level representation of HTML that was standard enough and Squeakly enough that we'd all feel comfy using it from our squeak apps *and/or* writing parsers for it (I could see, for example, writing XHTML aware parsers that presumed that the source was valid XHTML). I suggest the Heeg code without looking mostly because it covers HTML4.0 and has a parser.
Cheers, Bijan Parsia.