[squeak-dev] Re: Extracting data from web pages using Squeak
John Richards
ajtr at us.ibm.com
Mon Jun 16 19:51:54 UTC 2008
HtmlTokenizer helps here. Here's a bit of code I added to String class to
give you an idea of how to use it.
tagsOfType: aString
"return all tags found in self of type aString"
| endTag |
endTag := '</' , aString , '>'.
^ ((HtmlTokenizer on: self) upToEnd
select: [ :ea | ea name = aString])
reject: [ :ea | ea source = endTag]
Here's another example that is slightly richer (and probably could be
improved but what the heck).
textOfType: aString
"return a collection of triples of all tags found in self of type
aString with start tag, intermediate text if any, and end tag if any"
| stream element endTag triple answer |
endTag := '</' , aString , '>'.
answer := OrderedCollection new.
stream := ReadStream on: ((HtmlTokenizer on: self) upToEnd).
[stream atEnd] whileFalse: [
(element := stream next) name = aString ifTrue: [ "start
tag found"
triple := Array new: 3.
triple at: 1 put: element.
stream peek class = HtmlText ifTrue: [
triple at: 2 put: stream next.
stream peek source = endTag ifTrue: [
triple at: 3 put: stream next
]
].
answer add: triple
]
].
^ answer
Louis LaBrunda <Lou at Keystone-Software.com>
Sent by: squeak-dev-bounces at lists.squeakfoundation.org
06/16/08 11:57 AM
Please respond to
Lou at Keystone-Software.com; Please respond to
The general-purpose Squeak developers list
<squeak-dev at lists.squeakfoundation.org>
To
squeak-dev at lists.squeakfoundation.org
cc
Subject
[squeak-dev] Re: Extracting data from web pages using Squeak
Hi Cédrick,
Thanks for the hint.
>I would use:
>HTTPClient httpGet: 'http://url.com' to get the html stream.
>Then you can parse it...
Are there parsers available to get say table data into some kind of
collection?
Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:Lou at Keystone-Software.com http://www.Keystone-Software.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20080616/e8bb7881/attachment.htm
More information about the Squeak-dev
mailing list
|