[GOODIE] Squeak talking to Google ?

David Salamon david at myth.sdsu.edu
Tue Apr 16 20:53:13 UTC 2002


On 4/16/02 1:39 PM, "David Salamon" <david at myth.sdsu.edu> wrote:

> Actually, I was messing around about two weeks ago, and have a html paring
> based google searching utility, although fairly amateurish.
> 
> The parser is attached, along with a #getTokenizer command for httpUrls
> 
> Tell me what you think,
>   David Salamon
> 
> 

Hmm... My mistake, this is a goodie. Sorry about repost.

-------------- next part --------------
'From Squeak3.2gamma of 15 January 2002 [latest update: #4743] on 16 April 2002 at 12:37:32 pm'!

!HttpUrl methodsFor: 'downloading' stamp: 'DS 3/31/2002 03:28'!
getTokenizer
	^ HtmlTokenizer on: self retrieveContents contentStream! !
-------------- next part --------------
'From Squeak3.2gamma of 15 January 2002 [latest update: #4743] on 16 April 2002 at 12:34:55 pm'!
Object subclass: #GoogleParser
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Anna-Project'!
!GoogleParser commentStamp: 'DS 4/4/2002 18:44' prior: 0!
Designed to parse the Google web search engine for use in smalltalk.

Possible useful expressions for doIt or printIt.

Structure:
 instVar1		type -- comment about the purpose of instVar1
 instVar2		type -- comment about the purpose of instVar2

Any further useful comments about the general approach of this implementation.

| tokens |
Transcript clear.
(GoogleParser search: 'text' startingIndex: 900 atRandom)
	do: [:page |
		tokens _ (WordDistributionParser parseOnUrl: page) sortedCounts.
		tokens do: [:each |
				Transcript show: each key printString; tab; show: each value printString; cr]]
!


"-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "!

GoogleParser class
	instanceVariableNames: ''!

!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'ds 3/31/2002 01:34'!
createUrlForSearch: keyWords startingAt: anInt
	^(('http://www.google.com/search?q=',keyWords withoutQuoting,'&hl=en&start=',anInt printString withoutQuoting) asUrl)! !

!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'DS 4/2/2002 21:18'!
hitsReturnedFromSearch: keyWords
	| inBoldTag parseCounter |
	inBoldTag _ false.
	parseCounter _ 0.

	(GoogleParser createUrlForSearch: keyWords startingAt: 0) getTokenizer
		do: [:next |
			inBoldTag & next isText
				ifTrue: [parseCounter _ parseCounter + 1].
			inBoldTag & next isText & parseCounter = 5
				ifTrue: [^ next text].
			inBoldTag _ next isTag and: [next name = 'b' & next isNegated not]].
	^ 0! !

!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'ds 3/30/2002 23:35'!
search: word
	"returns the first pages matching the query"
	^ self search: word startingIndex: 0.! !

!GoogleParser class methodsFor: 'as yet unclassified' stamp: 'DS 4/16/2002 12:34'!
search: word startingIndex: anInt
	"returns the pages matching the query, starting from the submitted index"
	| tokenizer bag |
	bag _ Bag new.
	tokenizer _ (GoogleParser createUrlForSearch: word startingAt: anInt) getTokenizer.

	tokenizer do: [:next |
		(next isText and: [next text beginsWith: 'www.'])
			ifTrue: [bag add: ((next text asUrl) path: '')]].
	^bag.! !


More information about the Squeak-dev mailing list