[squeak-dev] Re: Extracting data from web pages using Squeak

Tue Jun 17 04:44:45 UTC 2008

Or port BeautifulSoup :)

--
Hwee-Boon

On Tue, Jun 17, 2008 at 3:51 AM, John Richards <ajtr at us.ibm.com> wrote:
>
> HtmlTokenizer helps here.  Here's a bit of code I added to String class to
> give you an idea of how to use it.
>
> tagsOfType: aString
>         "return all tags found in self of type aString"
>
>         | endTag |
>         endTag := '</' , aString , '>'.
>         ^ ((HtmlTokenizer  on: self) upToEnd
>                 select: [ :ea | ea name = aString])
>                 reject: [ :ea | ea source = endTag]
>
>
>
> Here's another example that is slightly richer (and probably could be
> improved but what the heck).
>
> textOfType: aString
>         "return a collection of triples of all tags found in self of type
> aString with start tag, intermediate text if any, and end tag if any"
>
>         | stream element endTag triple answer |
>         endTag := '</' , aString , '>'.
>         answer := OrderedCollection new.
>         stream := ReadStream on: ((HtmlTokenizer  on: self) upToEnd).
>         [stream atEnd] whileFalse: [
>                 (element := stream next) name = aString ifTrue: [  "start
> tag found"
>                         triple := Array new: 3.
>                         triple at: 1 put: element.
>                         stream peek class = HtmlText ifTrue: [
>                                 triple at: 2 put: stream next.
>                                 stream peek source = endTag ifTrue: [
>                                         triple at: 3 put: stream next
>                                         ]
>                                 ].
>                         answer add: triple
>                         ]
>                 ].
>         ^ answer
>
>
>
> Louis LaBrunda <Lou at Keystone-Software.com>
> Sent by: squeak-dev-bounces at lists.squeakfoundation.org
>
> 06/16/08 11:57 AM
>
> Please respond to
> Lou at Keystone-Software.com; Please respond to
> The general-purpose Squeak developers list
>  <squeak-dev at lists.squeakfoundation.org>
> To
> squeak-dev at lists.squeakfoundation.org
> cc
> Subject
> [squeak-dev] Re: Extracting data from web pages using Squeak
>
>
>
>
> Hi Cédrick,
>
> Thanks for the hint.
>
>>I would use:
>>HTTPClient httpGet: 'http://url.com' to get the html stream.
>>Then you can parse it...
>
> Are there parsers available to get say table data into some kind of
> collection?
>
> Lou
> -----------------------------------------------------------
> Louis LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
> mailto:Lou at Keystone-Software.com http://www.Keystone-Software.com
>
>
>
>
>
>
>


-- 
Hwee-Boon