Yahoo! Search API on top of VWXML

Matthew S. Hamrick mhamrick at cryptonomicon.net
Fri Apr 15 22:35:10 UTC 2005


FYI..
	So I spent a little time looking for tools to automate some simple web 
searches and alert me if certain conditions were met. For various 
reasons none of them seemed to work out. Then I noticed a blog entry 
about Yahoo!'s new search API web service ( more info at 
http://developer.yahoo.net/ . ) And then I thought... "hey... I just 
spent a fair amount of time creating web service debugging tools for my 
last job on top of the VisualWorks VWXML SAX Driver..." so... I whipped 
up a few classes to encapsulate the logic of querying the yahoo 
interface. Yahoo! is offering five types of searches: generic web 
search, news search, image search, local search, and video search.

	In what has become my style, I spent 2 hours on the first 
implementation before completely canning it. Then I spent 2 hours 
working on the Web Search followed by an hour of refactoring and new 
coding to create the News Search. But then the third search, the Image 
Search, took 20 minutes. The last two searches, local and video took 
about 10 minutes each. On top of that I put an extra 30 minutes of 
debugging, and voila! a "reasonable" interface to the web service.

	You can find the classes at: 
http://www.cryptonomicon.net/msh/squeak/msh-yahoo.st . They require the 
VWXML (and related) change sets be loaded (VWXML.1.cs, OX.1.cs, 
VWXMLSaxDriver-fix.st, and VWXMLTweaks.1.cs) So if you have these 
loaded, it should be pretty straightforward to load the msh-yahoo.st 
classes.

	The current implementation is rather slow and does not demonstrate the 
best Smalltalk style, but one of the great joys of OO is that it tends 
to be easier to replace what's under the hood after the car is driving.

	Assuming you're interested, here are a few notes about using the 
msh-yahoo.st classes.

	I totally violated one of my fundamental rules with this small 
project: I didn't write the test code first. Would I have whipped up a 
few SUnit tests, I believe I could have probably finished much quicker 
than the 5 1/2 hours it took me to do this rev. Or rather... it 
probably would have taken me 1 hour of SUnit development, and 4 1/2 
hours of function implementation, but I would have wound up with a 
series of tests that would help track down bugs in the future.

	There are two sets of classes you have to worry about: YahooSearch 
(and it's subclasses) and YahooSearchResult (and it's subclasses.) You 
use one of the subclasses of YahooSearch to represent a query (there's 
a subclass for each of the query types: YahooWebSearch, 
YahooImageSearch, YahooLocalSearch, YahooVideoSearch, and 
YahooNewsSearch.) For each of the search types, there's a corresponding 
result type: YahooWebSearchResult, YahooImageSearchResult, etc. I 
probably should have represented results as dictionaries and allowed 
users to access the results with the #at: message, but I didn't. So sue 
me.

	YahooSearch is a subclass of VWXMLSaxDriver because I wanted to limit 
the number of classes hanging around that have to have knowledge of the 
Yahoo! web service interface. I personally find this mildly ugly, but 
hey... I didn't want to spend forever working on this project.

	YahooSearch and it's subclasses also exhibit behaviors of collections. 
For instance, after setting up a query, you can use the #at: and #size 
messages to find out how many responses there are to your query, and to 
retrieve information about the Nth response. I even added a #do: 
message to iterate over all the results. I beg you to use this feature 
with caution as you can frequently find search terms that return 
millions of hits. I'm not responsible if you're silly enough to want to 
iterate over EVERY response for the query: 'star wars'.

	So here's a code sample that retrieves the first 100 results for "star 
wars" and prints their titles to the Transcript.

| search |
search _  ( YahooWebSearch new )
			query: 'star wars';
			type: 'phrase';
			results: 50.
1 to: 100 do: [ :index |
	Transcript
		nextPutAll: ( index asString );
		nextPutAll: ' - ';
		nextPutAll: ( ( search at: index ) title );
		cr;
		flush.
].

	And examining this code... the search object is instanciated with the 
#new message sent to YahooWebSearch. We tell that object that it's 
query is 'star wars' with the #query message. The #type message tells 
the Yahoo WS interface that 'star wars' is a phrase, not two different 
words we want to search on. The #results message tells the query object 
to ask for 50 results at a time. There are a number of other options 
you can set, in including #format: which allows you to select the file 
type you're interested in. So, if you were interested in knowing how 
many PDF files on the web mention 'star wars', you could use the 
following code snippit:

( YahooWebSearch new )
	query: 'star wars';
	type: 'phrase';
	format: 'pdf';
	size

	I leave it to you to explore the other search types. The Yahoo! 
developer web page is a good read if you're trying to figure out what's 
what. I also ask you to kindly report bugs via email. Or don't if you 
don't want to.

	One other thing... Yahoo! limits the interface to 50 results per 
query. Now it's certainly possible that a particular search will result 
in more than 50 hits, so you simply make multiple queries with 
parameters on subsequent queries to tell the interface where in the 
sequence of results you want to start. Each query returns a "total 
hits" value indicating how many hits there are to a particular query. 
But the Yahoo! search API can get confused sometimes, and it will 
sometimes change the number of "total hits" on subsequent calls to the 
API. This leads to a situation where the first call to the API will 
tell you there's 150 hits, but subsequent calls tell you there might be 
120 or so. This is especially upsetting when you do something like:

| search |
search _  ( YahooWebSearch new )
			query: 'matthew s. hamrick';
			type: 'phrase'.
1 to: ( search size ) do: [ :index |
	Transcript
		nextPutAll: ( index asString );
		nextPutAll: ' - ';
		nextPutAll: ( ( search at: index ) title );
		cr;
		flush.
].

When I do this search, the first query the interface makes tells me 
there are 82 results. Subsequent calls tell me there are either 68 or 
69 results. But the size of the query's result space is calculated from 
the first value (82). We return a nil if we can't find a particular 
result, so the code above should fail with an error message about 
sending #at: to UndefinedObject. To get around this, try something more 
like this:

| count |
count _ 1.
( ( YahooWebSearch new )
	query: 'matthew s. hamrick';
	type: 'phrase' ) do: [ :value |
		( value isNil ) ifFalse: [
			Transcript
				nextPutAll: ( count asString );
				nextPutAll: ' - ';
				nextPutAll: ( value title );
				cr;
				flush.
			count _ count + 1.
		].
	].

Anyway... hope this is of interest to someone else out there. Over the 
weekend, I'm going to try to find the time to hack together a Yahoo! 
web search browser, so I'll probably try to add a few SUnit tests 
before going down that path.

-Cheers,
-Matt H.




More information about the Squeak-dev mailing list