Parsing HTML Recommendation
Todd Blanchard
tblanchard at mac.com
Sat Aug 25 04:21:22 UTC 2007
> But these two lines give me the headers of my table's columns.
> itemlist = soup.find('table', id=True)
> #gives me the only table with an ID
> headers = itemlist.findAll('th')
> #gives me the headers of that table.
>
> and to parse the table rows with recursing through the nested tables.
> rows = mytable.findAll('td', recursive=False)
>
In the HTML CSS parser - you want to look at tagsNamed:
for instance - dom tagsNamed: 'table'
will return a collection of table nodes that are children of the
receiver.
Look at the implementation of that in HtmlDOMNode - it uses a method
called nodesCollect:
that will take an arbitrary block and returns all subnodes for which
the block evaluates to true. It is very similar.
> The html is broken and has hundreds of tables. There are something
> like
> 6 nested tables in each of the primary tables rows. This is from a MS
> SharePoint website. The markup is awful.
HtmlCSSParser was designed to deal with just such markup (and tries
to explain what is wrong with it).
> I'm sure there is an easy way in Squeak to do the above, but I haven't
> spent enough time to master it.
>
> A problem I've had with both of the above and which makes them a
> problem
> for me, is that they have both popped up modal dialogs which I had to
> click on in order to proceed.
>
> They have fairly different APIs.
>
> The HTML-Parser popped up a box for every tag without a closing tag.
> The Html+CSS Validator popped a box it seemed when it couldn't connect
> to a site. I guess it was attempting to retrieve the CSS, in order to
> validate?
That would be the underlying transport layer - HtmlCSSParser never
tries to interact with the user.
You don't have to validate.
dom := (HtmlValidator onUrl: 'http://something.com') dom.
Cheers,
-Todd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20070824/ea714b78/attachment.htm
More information about the Squeak-dev
mailing list
|