[squeak-dev] [ANN] Soup 0.1

Zulq Alam me at zulq.net
Thu Dec 25 09:02:59 UTC 2008


Soup is a port of Beautiful Soup [1]. If you're not familiar with 
Beautiful Soup, it is a tolerant HTML/XML parser written in Python and 
is extremely useful when you need to scrape data from a web page.

soup := Soup fromUrl: 'http://www.google.co.uk/search?q=squeak'.
results := soup findAllTags:
   [:e |
   e name = 'h3'
     and: [(e attributeAt: 'class') = 'r']].
links := results collect: [:e | e text -> e a href].

Squeak Smalltalk
   -> http://www.squeak.org/
Squeak Smalltalk: Download
   -> http://www.squeak.org/Download/
Squeak - Wikipedia, the free encyclopedia
   -> http://en.wikipedia.org/wiki/Squeak
etc...

The main differences in API are:

   - find*Tag(s) for tags
   - find*String(s) for strings, CData, declarations, processing
     instructions
   - the use of blocks for complex queries

For more usage information browse the searching tags and searching 
strings protocols on SoupElement subclasses. Also look at the tests in 
SoupElementTest, SoupTagTest and SoupParserTest. I will write/port 
proper documentation later.

There are still many things to do:

   - No attempt is made to deal with different character sets and
     encodings. This is a major feature of Beautiful Soup which I have
     so far ignored.

   - The parser will not convert entity or char refs. Although this is
     the default behavior for Beautiful Soup on HTML it is still an
     important feature.

   - The parser will not accept options such as whether to convert
     entities, which entities to convert, what to parse, etc.

   - The parser will only do HTML. Unlike Beautiful Soup there are no
     configurations for other XML flavors yet.

The project is globally writable. I look forward to your feedback and 
contributions.

Thanks,
Zulq.

[1] http://www.crummy.com/software/BeautifulSoup/
[2] http://www.squeaksource.com/Soup.html




More information about the Squeak-dev mailing list