[squeak-dev] Re: Extracting data from web pages using Squeak

John Richards ajtr at us.ibm.com
Mon Jun 16 19:51:54 UTC 2008

HtmlTokenizer helps here.  Here's a bit of code I added to String class to 
give you an idea of how to use it.

tagsOfType: aString
        "return all tags found in self of type aString"
        | endTag |
        endTag := '</' , aString , '>'.
        ^ ((HtmlTokenizer  on: self) upToEnd 
                select: [ :ea | ea name = aString])
                reject: [ :ea | ea source = endTag]

Here's another example that is slightly richer (and probably could be 
improved but what the heck).

textOfType: aString
        "return a collection of triples of all tags found in self of type 
aString with start tag, intermediate text if any, and end tag if any"
        | stream element endTag triple answer |
        endTag := '</' , aString , '>'.
        answer := OrderedCollection new.
        stream := ReadStream on: ((HtmlTokenizer  on: self) upToEnd).
        [stream atEnd] whileFalse: [
                (element := stream next) name = aString ifTrue: [  "start 
tag found"
                        triple := Array new: 3.
                        triple at: 1 put: element.
                        stream peek class = HtmlText ifTrue: [
                                triple at: 2 put: stream next.
                                stream peek source = endTag ifTrue: [
                                        triple at: 3 put: stream next
                        answer add: triple
        ^ answer

Louis LaBrunda <Lou at Keystone-Software.com> 
Sent by: squeak-dev-bounces at lists.squeakfoundation.org
06/16/08 11:57 AM
Please respond to
Lou at Keystone-Software.com; Please respond to
The general-purpose Squeak developers list 
<squeak-dev at lists.squeakfoundation.org>

squeak-dev at lists.squeakfoundation.org

[squeak-dev] Re: Extracting data from web pages using Squeak

Hi Cédrick,

Thanks for the hint.

>I would use:
>HTTPClient httpGet: 'http://url.com' to get the html stream.
>Then you can parse it...

Are there parsers available to get say table data into some kind of 

Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:Lou at Keystone-Software.com http://www.Keystone-Software.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20080616/e8bb7881/attachment.htm

More information about the Squeak-dev mailing list