[squeak-dev] [ANN] Soup 0.1
Zulq Alam
me at zulq.net
Thu Dec 25 09:02:59 UTC 2008
Soup is a port of Beautiful Soup [1]. If you're not familiar with
Beautiful Soup, it is a tolerant HTML/XML parser written in Python and
is extremely useful when you need to scrape data from a web page.
soup := Soup fromUrl: 'http://www.google.co.uk/search?q=squeak'.
results := soup findAllTags:
[:e |
e name = 'h3'
and: [(e attributeAt: 'class') = 'r']].
links := results collect: [:e | e text -> e a href].
Squeak Smalltalk
-> http://www.squeak.org/
Squeak Smalltalk: Download
-> http://www.squeak.org/Download/
Squeak - Wikipedia, the free encyclopedia
-> http://en.wikipedia.org/wiki/Squeak
etc...
The main differences in API are:
- find*Tag(s) for tags
- find*String(s) for strings, CData, declarations, processing
instructions
- the use of blocks for complex queries
For more usage information browse the searching tags and searching
strings protocols on SoupElement subclasses. Also look at the tests in
SoupElementTest, SoupTagTest and SoupParserTest. I will write/port
proper documentation later.
There are still many things to do:
- No attempt is made to deal with different character sets and
encodings. This is a major feature of Beautiful Soup which I have
so far ignored.
- The parser will not convert entity or char refs. Although this is
the default behavior for Beautiful Soup on HTML it is still an
important feature.
- The parser will not accept options such as whether to convert
entities, which entities to convert, what to parse, etc.
- The parser will only do HTML. Unlike Beautiful Soup there are no
configurations for other XML flavors yet.
The project is globally writable. I look forward to your feedback and
contributions.
Thanks,
Zulq.
[1] http://www.crummy.com/software/BeautifulSoup/
[2] http://www.squeaksource.com/Soup.html
More information about the Squeak-dev
mailing list
|