Parsing HTML Recommendation

Todd Blanchard tblanchard at mac.com
Sat Aug 25 04:21:22 UTC 2007


> But these two lines give me the headers of my table's columns.
>      itemlist = soup.find('table', id=True)
>        #gives me the only table with an ID
>      headers = itemlist.findAll('th')
>        #gives me the headers of that table.
>
> and to parse the table rows with recursing through the nested tables.
>      rows = mytable.findAll('td', recursive=False)
>

In the HTML CSS parser - you want to look at tagsNamed:

for instance - dom tagsNamed: 'table'
will return a collection of table nodes that are children of the  
receiver.

Look at the implementation of that in HtmlDOMNode - it uses a method  
called nodesCollect:
that will take an arbitrary block and returns all subnodes for which  
the block evaluates to true. It is very similar.

> The html is broken and has hundreds of tables. There are something  
> like
> 6 nested tables in each of the primary tables rows. This is from a MS
> SharePoint website. The markup is awful.

HtmlCSSParser was designed to deal with just such markup (and tries  
to explain what is wrong with it).

> I'm sure there is an easy way in Squeak to do the above, but I haven't
> spent enough time to master it.
>
> A problem I've had with both of the above and which makes them a  
> problem
> for me, is that they have both popped up modal dialogs which I had to
> click on in order to proceed.
>
> They have fairly different APIs.
>
> The HTML-Parser popped up a box for every tag without a closing tag.
> The Html+CSS Validator popped a box it seemed when it couldn't connect
> to a site. I guess it was attempting to retrieve the CSS, in order to
> validate?

That would be the underlying transport layer - HtmlCSSParser never  
tries to interact with the user.

You don't have to validate.

dom := (HtmlValidator onUrl: 'http://something.com') dom.

Cheers,
-Todd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20070824/ea714b78/attachment.htm


More information about the Squeak-dev mailing list