Parsing HTML Recommendation
Todd Blanchard
tblanchard at mac.com
Sat Aug 25 17:37:00 UTC 2007
On Aug 25, 2007, at 5:29 AM, Jimmie Houchin wrote:
> Hello Todd,
> Yes, I've been doing that. But my problems have been:
>
> 1. Out of 1000+ tables I am looking for one which has an 'ID'
> attribute.
> In BeautifulSoup it is: bs.findAll('table', id=True)
dom nodesCollect: [:ea | ea tag = 'table' and: [ea id notNil]]
if there is a specific id you want - make it ea id = 'theId'
> I haven't yet figured out how to do that.
>
> 2. I haven't spent enough time with your parser yet, but my one
> table is a table comprised of 331 rows each with 6 nested tables.
>
> When I build a dom with the tagsNamed: 'tr',
> Does it return 331 or 1000+ rows?
You need to get the right table - then send it the tagsNamed or
nodesCollect to search within it. Assuming that there is exactly one
table in the whole document with an id, you could do this:
rows := (dom nodesCollect: [:each | each tag = 'table' and:[each id
notNil]]) first tagsNamed: 'tr'
assuming these all contain fields that are plain text - you can get
the data as a list of lists doing
"convert rows list to list of lists of TD nodes"
data := rows collect: [:row | (row tagsNamed: 'td')].
"convert rows list to list of lists of text - stripping all markup.
rows := rows collect: [:row | row collect: [:cell | String
streamContents: [:s | (cell nodesCollect: [:n | n isCDATA]) do:
[:cdata | s nextPutAll: cdata asString]]]]
From here you can get the text of a cell with
string := (rows at: r) at: c
If your cell is a table itself, lather rinse repeat.
> Okay. This is what I am doing.
> dom := (HtmlValidator on: myHtmlString) dom.
>
> But when I got the popups, I thought that the validation was going
> awry.
In the interest of performance, the parser fetches CSS files in LINK
tags by queueing them in a separate thread as soon as the href is
encountered. Since you don't need this behavior - go into
HtmlLINKNode>>parseContents: and comment out the line:
self loader queueUrl: href. "Start download in another thread"
> Again, thanks for your help. And thank you for providing this tool.
I'm glad somebody found it useful.
-Todd
More information about the Squeak-dev
mailing list
|