[squeak-dev] Re: Extracting data from web pages using Squeak

John Richards ajtr at us.ibm.com
Mon Jun 16 19:51:54 UTC 2008


HtmlTokenizer helps here.  Here's a bit of code I added to String class to 
give you an idea of how to use it.

tagsOfType: aString
        "return all tags found in self of type aString"
 
        | endTag |
        endTag := '</' , aString , '>'.
        ^ ((HtmlTokenizer  on: self) upToEnd 
                select: [ :ea | ea name = aString])
                reject: [ :ea | ea source = endTag]



Here's another example that is slightly richer (and probably could be 
improved but what the heck).

textOfType: aString
        "return a collection of triples of all tags found in self of type 
aString with start tag, intermediate text if any, and end tag if any"
 
        | stream element endTag triple answer |
        endTag := '</' , aString , '>'.
        answer := OrderedCollection new.
        stream := ReadStream on: ((HtmlTokenizer  on: self) upToEnd).
        [stream atEnd] whileFalse: [
                (element := stream next) name = aString ifTrue: [  "start 
tag found"
                        triple := Array new: 3.
                        triple at: 1 put: element.
                        stream peek class = HtmlText ifTrue: [
                                triple at: 2 put: stream next.
                                stream peek source = endTag ifTrue: [
                                        triple at: 3 put: stream next
                                        ]
                                ].
                        answer add: triple
                        ]
                ].
        ^ answer




Louis LaBrunda <Lou at Keystone-Software.com> 
Sent by: squeak-dev-bounces at lists.squeakfoundation.org
06/16/08 11:57 AM
Please respond to
Lou at Keystone-Software.com; Please respond to
The general-purpose Squeak developers list 
<squeak-dev at lists.squeakfoundation.org>


To
squeak-dev at lists.squeakfoundation.org
cc

Subject
[squeak-dev] Re: Extracting data from web pages using Squeak






Hi Cédrick,

Thanks for the hint.

>I would use:
>HTTPClient httpGet: 'http://url.com' to get the html stream.
>Then you can parse it...

Are there parsers available to get say table data into some kind of 
collection?

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:Lou at Keystone-Software.com http://www.Keystone-Software.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20080616/e8bb7881/attachment.htm


More information about the Squeak-dev mailing list