[ENH]Html table (second version)
Bijan Parsia
bparsia at email.unc.edu
Mon Aug 27 13:52:18 UTC 2001
--On Monday, August 27, 2001 1:48 AM +0100 John Hinsley
<jhinsley at telinco.co.uk> wrote:
[snip]
> Tidy is a nice tool. My main objections to it are that it balances tags
> where they really don't need to be balanced -- unecessarily increasing
> the length of documents is _always_ a bad thing
No, it's not, especially if it makes parsing easier.
A stronger objection to Tidy is having a two pass rendering -- for many
machines this would be unbearably slow, espeically as Scamper (to my
knowledge) doesn't (yet) cache rendered pages.
> -- and that it inserts
> its own meta tag (see above, plus it's rather rude).
[snip]
But it's source code is available, so the latter can change. The meta thing
doesn't bug me a whit, especially for the use we'd be putting it two (it'd
be nice to be able to check whether the doc's been tidied). And, I wouldn't
be surprised if there were a cmd line option to suppress that.
However, Tidy is still the place to look, IMHO, for how to handle the
"normal" range of SUML (Screwed Up-ML) out there. It's design goal was to
try to reproduce the parsing quirks of the major browsers (which, alas, is
probably a better guide to what kind of HTML you'll find than any spec).
>From what I can tell, the C code is very clean, so someone with a bit of
Cness in them might be able to just turn it into a plugin.
OTOH, it or the Java port could be used as the basis for, or inspiration
for, a Squeak port. That would nicely eliminate the second pass, as the
SqueakTidy parse could go directly to an internal whatever (tree, DOM
abomination, what have you :)).
Done right, it would be easy to add extra "cleansing" rules to handle other
pathological (or just irritating) cases.
Ideally, we'd have a good Squeak level representation of HTML that was
standard enough and Squeakly enough that we'd all feel comfy using it from
our squeak apps *and/or* writing parsers for it (I could see, for example,
writing XHTML aware parsers that presumed that the source was valid XHTML).
I suggest the Heeg code without looking mostly because it covers HTML4.0
and has a parser.
Cheers,
Bijan Parsia.
More information about the Squeak-dev
mailing list
|