Re: [ENH]Html table (second version)

27 Aug 2001


      --On Monday, August 27, 2001 1:48 AM +0100 John Hinsley 
jhinsley@telinco.co.uk wrote:
[snip]
...
Tidy is a nice tool. My main objections to it are that it balances tags
where they really don't need to be balanced -- unecessarily increasing
the length of documents is _always_ a bad thing
No, it's not, especially if it makes parsing easier.
A stronger objection to Tidy is having a two pass rendering -- for many 
machines this would be unbearably slow, espeically as Scamper (to my 
knowledge) doesn't (yet) cache rendered pages.
...
-- and that it inserts
its own meta tag (see above, plus it's rather rude).
[snip]
But it's source code is available, so the latter can change. The meta thing 
doesn't bug me a whit, especially for the use we'd be putting it two (it'd 
be nice to be able to check whether the doc's been tidied). And, I wouldn't 
be surprised if there were a cmd line option to suppress that.
However, Tidy is still the place to look, IMHO, for how to handle the 
"normal" range of SUML (Screwed Up-ML) out there. It's design goal was to 
try to reproduce the parsing quirks of the major browsers (which, alas, is 
probably a better guide to what kind of HTML you'll find than any spec).
...
From what I can tell, the C code is very clean, so someone with a bit of
Cness in them might be able to just turn it into a plugin.
OTOH, it or the Java port could be used as the basis for, or inspiration 
for, a Squeak port. That would nicely eliminate the second pass, as the 
SqueakTidy parse could go directly to an internal whatever (tree, DOM 
abomination, what have you :)).
Done right, it would be easy to add extra "cleansing" rules to handle other 
pathological (or just irritating) cases.
Ideally, we'd have a good Squeak level representation of HTML that was 
standard enough and Squeakly enough that we'd all feel comfy using it from 
our squeak apps *and/or* writing parsers for it (I could see, for example, 
writing XHTML aware parsers that presumed that the source was valid XHTML). 
I suggest the Heeg code without looking mostly because it covers HTML4.0 
and has a parser.
Cheers,
Bijan Parsia.