Parsing HTML Recommendation
Jimmie Houchin
j.squeak at cyberhaus.us
Sat Aug 25 12:29:11 UTC 2007
Hello Todd,
Thanks for the reply.
Todd Blanchard wrote:
>> But these two lines give me the headers of my table's columns.
>> itemlist = soup.find('table', id=True)
>> #gives me the only table with an ID
>> headers = itemlist.findAll('th')
>> #gives me the headers of that table.
>>
>> and to parse the table rows with recursing through the nested tables.
>> rows = mytable.findAll('td', recursive=False)
>
> In the HTML CSS parser - you want to look at tagsNamed:
>
> for instance - dom tagsNamed: 'table'
> will return a collection of table nodes that are children of the receiver.
Yes, I've been doing that. But my problems have been:
1. Out of 1000+ tables I am looking for one which has an 'ID' attribute.
In BeautifulSoup it is: bs.findAll('table', id=True)
I haven't yet figured out how to do that.
2. I haven't spent enough time with your parser yet, but my one table is
a table comprised of 331 rows each with 6 nested tables.
When I build a dom with the tagsNamed: 'tr',
Does it return 331 or 1000+ rows?
I want the 331. I want to be able to understand which 'column' I am
in so that I can build objects out of the data. The columns represent
object attributes. Some of the columns have tables as their td.
> Look at the implementation of that in HtmlDOMNode - it uses a method
> called nodesCollect:
> that will take an arbitrary block and returns all subnodes for which the
> block evaluates to true. It is very similar.
Does this return the 331 or 1000+ rows (nodes)?
>> The html is broken and has hundreds of tables. There are something like
>> 6 nested tables in each of the primary tables rows. This is from a MS
>> SharePoint website. The markup is awful.
>
> HtmlCSSParser was designed to deal with just such markup (and tries to
> explain what is wrong with it).
In my case I am happy that HtmlCSSParser can deal with, but it doesn't
matter what is wrong. I just want the data.
[snip]
>>
>> The HTML-Parser popped up a box for every tag without a closing tag.
>> The Html+CSS Validator popped a box it seemed when it couldn't connect
>> to a site. I guess it was attempting to retrieve the CSS, in order to
>> validate?
>
> That would be the underlying transport layer - HtmlCSSParser never tries
> to interact with the user.
Okay.
> You don't have to validate.
>
> dom := (HtmlValidator onUrl: 'http://something.com') dom.
Okay. This is what I am doing.
dom := (HtmlValidator on: myHtmlString) dom.
But when I got the popups, I thought that the validation was going awry.
Again, thanks for your help. And thank you for providing this tool.
Jimmie
More information about the Squeak-dev
mailing list
|