Parsing HTML Recommendation

Jimmie Houchin j.squeak at cyberhaus.us
Thu Aug 23 20:23:44 UTC 2007


Kurt Thams wrote:
> Of the various Squeak parsers that will parse HTML, does anyone have any 
> recommendations as to what is the best at this point?

Hello,

I too was going to post about this.

I have been working with both HTML-Parser written by Julian Fitzell and 
also Html+CSS Validating Parser by Todd Blanchard. (according to the 
SqueakMap entries)

I've also been using BeautifulSoup which is an excellent Python html parser.

I would prefer to use Squeak but BeautifulSoup so far has a few features 
that has made my job easier. Just a few things I haven't quite worked 
out as easy in either of the Squeak tools. Now I know someone proficient 
in Squeak could have done it just as easily.

But these two lines give me the headers of my table's columns.
     itemlist = soup.find('table', id=True)
       #gives me the only table with an ID
     headers = itemlist.findAll('th')
       #gives me the headers of that table.

and to parse the table rows with recursing through the nested tables.
     rows = mytable.findAll('td', recursive=False)

The html is broken and has hundreds of tables. There are something like 
6 nested tables in each of the primary tables rows. This is from a MS 
SharePoint website. The markup is awful.

I'm sure there is an easy way in Squeak to do the above, but I haven't 
spent enough time to master it.

A problem I've had with both of the above and which makes them a problem 
for me, is that they have both popped up modal dialogs which I had to 
click on in order to proceed.

They have fairly different APIs.

The HTML-Parser popped up a box for every tag without a closing tag.
The Html+CSS Validator popped a box it seemed when it couldn't connect 
to a site. I guess it was attempting to retrieve the CSS, in order to 
validate?

I would love to know if there is a way to silence the dialogs while 
proceeding through the parsing. Yes, I know the markup is awful. Do the 
best you can and it may be good enough for me to do my job. Apparently I 
can do that if if I click, click, click. But I would just like to just 
doit.

I've been doing the work in Python/BeautifulSoup but do not enjoy the 
dead system and would rather be working in a live environment.

Hopefully, I'll get smart enough to use Squeak effectively. :)

Any wisdom on this subject greatly appreciated. I have lots of html 
scraping to do. I don't really need or care about validation and a rich 
interface for this purpose would be great.

Thanks,

Jimmie




More information about the Squeak-dev mailing list