[ENH]Html table (second version)

Bijan Parsia bparsia at email.unc.edu
Mon Aug 27 13:52:18 UTC 2001


--On Monday, August 27, 2001 1:48 AM +0100 John Hinsley 
<jhinsley at telinco.co.uk> wrote:

[snip]
> Tidy is a nice tool. My main objections to it are that it balances tags
> where they really don't need to be balanced -- unecessarily increasing
> the length of documents is _always_ a bad thing

No, it's not, especially if it makes parsing easier.

A stronger objection to Tidy is having a two pass rendering -- for many 
machines this would be unbearably slow, espeically as Scamper (to my 
knowledge) doesn't (yet) cache rendered pages.

> -- and that it inserts
> its own meta tag (see above, plus it's rather rude).
[snip]
But it's source code is available, so the latter can change. The meta thing 
doesn't bug me a whit, especially for the use we'd be putting it two (it'd 
be nice to be able to check whether the doc's been tidied). And, I wouldn't 
be surprised if there were a cmd line option to suppress that.

However, Tidy is still the place to look, IMHO, for how to handle the 
"normal" range of SUML (Screwed Up-ML) out there. It's design goal was to 
try to reproduce the parsing quirks of the major browsers (which, alas, is 
probably a better guide to what kind of HTML you'll find than any spec).

>From what I can tell, the C code is very clean, so someone with a bit of 
Cness in them might be able to just turn it into a plugin.

OTOH, it or the Java port could be used as the basis for, or inspiration 
for, a Squeak port. That would nicely eliminate the second pass, as the 
SqueakTidy parse could go directly to an internal whatever (tree, DOM 
abomination, what have you :)).

Done right, it would be easy to add extra "cleansing" rules to handle other 
pathological (or just irritating) cases.

Ideally, we'd have a good Squeak level representation of HTML that was 
standard enough and Squeakly enough that we'd all feel comfy using it from 
our squeak apps *and/or* writing parsers for it (I could see, for example, 
writing XHTML aware parsers that presumed that the source was valid XHTML). 
I suggest the Heeg code without looking mostly because it covers HTML4.0 
and has a parser.

Cheers,
Bijan Parsia.




More information about the Squeak-dev mailing list