Parsing HTML Recommendation
Jimmie Houchin
j.squeak at cyberhaus.us
Thu Aug 23 20:23:44 UTC 2007
Kurt Thams wrote:
> Of the various Squeak parsers that will parse HTML, does anyone have any
> recommendations as to what is the best at this point?
Hello,
I too was going to post about this.
I have been working with both HTML-Parser written by Julian Fitzell and
also Html+CSS Validating Parser by Todd Blanchard. (according to the
SqueakMap entries)
I've also been using BeautifulSoup which is an excellent Python html parser.
I would prefer to use Squeak but BeautifulSoup so far has a few features
that has made my job easier. Just a few things I haven't quite worked
out as easy in either of the Squeak tools. Now I know someone proficient
in Squeak could have done it just as easily.
But these two lines give me the headers of my table's columns.
itemlist = soup.find('table', id=True)
#gives me the only table with an ID
headers = itemlist.findAll('th')
#gives me the headers of that table.
and to parse the table rows with recursing through the nested tables.
rows = mytable.findAll('td', recursive=False)
The html is broken and has hundreds of tables. There are something like
6 nested tables in each of the primary tables rows. This is from a MS
SharePoint website. The markup is awful.
I'm sure there is an easy way in Squeak to do the above, but I haven't
spent enough time to master it.
A problem I've had with both of the above and which makes them a problem
for me, is that they have both popped up modal dialogs which I had to
click on in order to proceed.
They have fairly different APIs.
The HTML-Parser popped up a box for every tag without a closing tag.
The Html+CSS Validator popped a box it seemed when it couldn't connect
to a site. I guess it was attempting to retrieve the CSS, in order to
validate?
I would love to know if there is a way to silence the dialogs while
proceeding through the parsing. Yes, I know the markup is awful. Do the
best you can and it may be good enough for me to do my job. Apparently I
can do that if if I click, click, click. But I would just like to just
doit.
I've been doing the work in Python/BeautifulSoup but do not enjoy the
dead system and would rather be working in a live environment.
Hopefully, I'll get smart enough to use Squeak effectively. :)
Any wisdom on this subject greatly appreciated. I have lots of html
scraping to do. I don't really need or care about validation and a rich
interface for this purpose would be great.
Thanks,
Jimmie
More information about the Squeak-dev
mailing list
|