Parsing HTML Recommendation

Jimmie Houchin j.squeak at cyberhaus.us
Sat Aug 25 12:29:11 UTC 2007


Hello Todd,

Thanks for the reply.

Todd Blanchard wrote:
>> But these two lines give me the headers of my table's columns.
>>      itemlist = soup.find('table', id=True)
>>        #gives me the only table with an ID
>>      headers = itemlist.findAll('th')
>>        #gives me the headers of that table.
>>
>> and to parse the table rows with recursing through the nested tables.
>>      rows = mytable.findAll('td', recursive=False)
> 
> In the HTML CSS parser - you want to look at tagsNamed: 
>
> for instance - dom tagsNamed: 'table'
> will return a collection of table nodes that are children of the receiver.

Yes, I've been doing that. But my problems have been:

1. Out of 1000+ tables I am looking for one which has an 'ID' attribute.
      In BeautifulSoup it is:  bs.findAll('table', id=True)

    I haven't yet figured out how to do that.

2. I haven't spent enough time with your parser yet, but my one table is 
a table comprised of 331 rows each with 6 nested tables.

    When I build a dom with the tagsNamed: 'tr',
    Does it return 331 or 1000+ rows?

    I want the 331. I want to be able to understand which 'column' I am 
in so that I can build objects out of the data. The columns represent 
object attributes. Some of the columns have tables as their td.

> Look at the implementation of that in HtmlDOMNode - it uses a method 
> called nodesCollect: 
> that will take an arbitrary block and returns all subnodes for which the 
> block evaluates to true. It is very similar.

Does this return the 331 or 1000+ rows (nodes)?

>> The html is broken and has hundreds of tables. There are something like 
>> 6 nested tables in each of the primary tables rows. This is from a MS 
>> SharePoint website. The markup is awful.
> 
> HtmlCSSParser was designed to deal with just such markup (and tries to 
> explain what is wrong with it).

In my case I am happy that HtmlCSSParser can deal with, but it doesn't 
matter what is wrong. I just want the data.

[snip]
>>
>> The HTML-Parser popped up a box for every tag without a closing tag.
>> The Html+CSS Validator popped a box it seemed when it couldn't connect 
>> to a site. I guess it was attempting to retrieve the CSS, in order to 
>> validate?
> 
> That would be the underlying transport layer - HtmlCSSParser never tries 
> to interact with the user.

Okay.

> You don't have to validate.
> 
> dom := (HtmlValidator onUrl: 'http://something.com') dom.

Okay. This is what I am doing.
dom := (HtmlValidator on: myHtmlString) dom.

But when I got the popups, I thought that the validation was going awry.

Again, thanks for your help. And thank you for providing this tool.

Jimmie




More information about the Squeak-dev mailing list