Parsing HTML Recommendation

Todd Blanchard tblanchard at mac.com
Sat Aug 25 17:37:00 UTC 2007


On Aug 25, 2007, at 5:29 AM, Jimmie Houchin wrote:

> Hello Todd,

> Yes, I've been doing that. But my problems have been:
>
> 1. Out of 1000+ tables I am looking for one which has an 'ID'  
> attribute.
>      In BeautifulSoup it is:  bs.findAll('table', id=True)

dom nodesCollect: [:ea | ea tag = 'table' and: [ea id notNil]]

if there is a specific id you want - make it ea id = 'theId'

>    I haven't yet figured out how to do that.
>
> 2. I haven't spent enough time with your parser yet, but my one  
> table is a table comprised of 331 rows each with 6 nested tables.
>
>    When I build a dom with the tagsNamed: 'tr',
>    Does it return 331 or 1000+ rows?

You need to get the right table - then send it the tagsNamed or  
nodesCollect to search within it.  Assuming that there is exactly one  
table in the whole document with an id, you could do this:

rows := (dom nodesCollect: [:each | each tag = 'table' and:[each id  
notNil]]) first tagsNamed: 'tr'

assuming these all contain fields that are plain text - you can get  
the data as a list of lists doing
"convert rows list to list of lists of TD nodes"
data := rows collect: [:row | (row tagsNamed: 'td')].

"convert rows list to list of lists of text - stripping all markup.
rows := rows collect: [:row | row collect: [:cell | String  
streamContents: [:s | (cell nodesCollect: [:n | n isCDATA]) do:  
[:cdata | s nextPutAll: cdata asString]]]]

 From here you can get the text of a cell with
string := (rows at: r) at: c

If your cell is a table itself, lather rinse repeat.

> Okay. This is what I am doing.
> dom := (HtmlValidator on: myHtmlString) dom.
>
> But when I got the popups, I thought that the validation was going  
> awry.

In the interest of performance, the parser fetches CSS files in LINK  
tags by queueing them in a separate thread as soon as the href is  
encountered.  Since you don't need this behavior - go into  
HtmlLINKNode>>parseContents: and comment out the line:

self loader queueUrl: href. "Start download in another thread"

> Again, thanks for your help. And thank you for providing this tool.

I'm glad somebody found it useful.

-Todd



More information about the Squeak-dev mailing list