[GOODIE] Squeak talking to Google ?
David Salamon
david at myth.sdsu.edu
Tue Apr 16 20:53:13 UTC 2002
On 4/16/02 1:39 PM, "David Salamon" <david at myth.sdsu.edu> wrote:
> Actually, I was messing around about two weeks ago, and have a html paring
> based google searching utility, although fairly amateurish.
>
> The parser is attached, along with a #getTokenizer command for httpUrls
>
> Tell me what you think,
> David Salamon
>
>
Hmm... My mistake, this is a goodie. Sorry about repost.
-------------- next part --------------
'From Squeak3.2gamma of 15 January 2002 [latest update: #4743] on 16 April 2002 at 12:37:32 pm'!
!HttpUrl methodsFor: 'downloading' stamp: 'DS 3/31/2002 03:28'!
getTokenizer
^ HtmlTokenizer on: self retrieveContents contentStream! !
-------------- next part --------------
'From Squeak3.2gamma of 15 January 2002 [latest update: #4743] on 16 April 2002 at 12:34:55 pm'!
Object subclass: #GoogleParser
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'Anna-Project'!
!GoogleParser commentStamp: 'DS 4/4/2002 18:44' prior: 0!
Designed to parse the Google web search engine for use in smalltalk.
Possible useful expressions for doIt or printIt.
Structure:
instVar1 type -- comment about the purpose of instVar1
instVar2 type -- comment about the purpose of instVar2
Any further useful comments about the general approach of this implementation.
| tokens |
Transcript clear.
(GoogleParser search: 'text' startingIndex: 900 atRandom)
do: [:page |
tokens _ (WordDistributionParser parseOnUrl: page) sortedCounts.
tokens do: [:each |
Transcript show: each key printString; tab; show: each value printString; cr]]
!
"-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "!
GoogleParser class
instanceVariableNames: ''!
!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'ds 3/31/2002 01:34'!
createUrlForSearch: keyWords startingAt: anInt
^(('http://www.google.com/search?q=',keyWords withoutQuoting,'&hl=en&start=',anInt printString withoutQuoting) asUrl)! !
!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'DS 4/2/2002 21:18'!
hitsReturnedFromSearch: keyWords
| inBoldTag parseCounter |
inBoldTag _ false.
parseCounter _ 0.
(GoogleParser createUrlForSearch: keyWords startingAt: 0) getTokenizer
do: [:next |
inBoldTag & next isText
ifTrue: [parseCounter _ parseCounter + 1].
inBoldTag & next isText & parseCounter = 5
ifTrue: [^ next text].
inBoldTag _ next isTag and: [next name = 'b' & next isNegated not]].
^ 0! !
!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'ds 3/30/2002 23:35'!
search: word
"returns the first pages matching the query"
^ self search: word startingIndex: 0.! !
!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'DS 4/16/2002 12:34'!
search: word startingIndex: anInt
"returns the pages matching the query, starting from the submitted index"
| tokenizer bag |
bag _ Bag new.
tokenizer _ (GoogleParser createUrlForSearch: word startingAt: anInt) getTokenizer.
tokenizer do: [:next |
(next isText and: [next text beginsWith: 'www.'])
ifTrue: [bag add: ((next text asUrl) path: '')]].
^bag.! !
More information about the Squeak-dev
mailing list
|