Yahoo! Search API on top of VWXML
Matthew S. Hamrick
mhamrick at cryptonomicon.net
Fri Apr 15 22:35:10 UTC 2005
FYI..
So I spent a little time looking for tools to automate some simple web
searches and alert me if certain conditions were met. For various
reasons none of them seemed to work out. Then I noticed a blog entry
about Yahoo!'s new search API web service ( more info at
http://developer.yahoo.net/ . ) And then I thought... "hey... I just
spent a fair amount of time creating web service debugging tools for my
last job on top of the VisualWorks VWXML SAX Driver..." so... I whipped
up a few classes to encapsulate the logic of querying the yahoo
interface. Yahoo! is offering five types of searches: generic web
search, news search, image search, local search, and video search.
In what has become my style, I spent 2 hours on the first
implementation before completely canning it. Then I spent 2 hours
working on the Web Search followed by an hour of refactoring and new
coding to create the News Search. But then the third search, the Image
Search, took 20 minutes. The last two searches, local and video took
about 10 minutes each. On top of that I put an extra 30 minutes of
debugging, and voila! a "reasonable" interface to the web service.
You can find the classes at:
http://www.cryptonomicon.net/msh/squeak/msh-yahoo.st . They require the
VWXML (and related) change sets be loaded (VWXML.1.cs, OX.1.cs,
VWXMLSaxDriver-fix.st, and VWXMLTweaks.1.cs) So if you have these
loaded, it should be pretty straightforward to load the msh-yahoo.st
classes.
The current implementation is rather slow and does not demonstrate the
best Smalltalk style, but one of the great joys of OO is that it tends
to be easier to replace what's under the hood after the car is driving.
Assuming you're interested, here are a few notes about using the
msh-yahoo.st classes.
I totally violated one of my fundamental rules with this small
project: I didn't write the test code first. Would I have whipped up a
few SUnit tests, I believe I could have probably finished much quicker
than the 5 1/2 hours it took me to do this rev. Or rather... it
probably would have taken me 1 hour of SUnit development, and 4 1/2
hours of function implementation, but I would have wound up with a
series of tests that would help track down bugs in the future.
There are two sets of classes you have to worry about: YahooSearch
(and it's subclasses) and YahooSearchResult (and it's subclasses.) You
use one of the subclasses of YahooSearch to represent a query (there's
a subclass for each of the query types: YahooWebSearch,
YahooImageSearch, YahooLocalSearch, YahooVideoSearch, and
YahooNewsSearch.) For each of the search types, there's a corresponding
result type: YahooWebSearchResult, YahooImageSearchResult, etc. I
probably should have represented results as dictionaries and allowed
users to access the results with the #at: message, but I didn't. So sue
me.
YahooSearch is a subclass of VWXMLSaxDriver because I wanted to limit
the number of classes hanging around that have to have knowledge of the
Yahoo! web service interface. I personally find this mildly ugly, but
hey... I didn't want to spend forever working on this project.
YahooSearch and it's subclasses also exhibit behaviors of collections.
For instance, after setting up a query, you can use the #at: and #size
messages to find out how many responses there are to your query, and to
retrieve information about the Nth response. I even added a #do:
message to iterate over all the results. I beg you to use this feature
with caution as you can frequently find search terms that return
millions of hits. I'm not responsible if you're silly enough to want to
iterate over EVERY response for the query: 'star wars'.
So here's a code sample that retrieves the first 100 results for "star
wars" and prints their titles to the Transcript.
| search |
search _ ( YahooWebSearch new )
query: 'star wars';
type: 'phrase';
results: 50.
1 to: 100 do: [ :index |
Transcript
nextPutAll: ( index asString );
nextPutAll: ' - ';
nextPutAll: ( ( search at: index ) title );
cr;
flush.
].
And examining this code... the search object is instanciated with the
#new message sent to YahooWebSearch. We tell that object that it's
query is 'star wars' with the #query message. The #type message tells
the Yahoo WS interface that 'star wars' is a phrase, not two different
words we want to search on. The #results message tells the query object
to ask for 50 results at a time. There are a number of other options
you can set, in including #format: which allows you to select the file
type you're interested in. So, if you were interested in knowing how
many PDF files on the web mention 'star wars', you could use the
following code snippit:
( YahooWebSearch new )
query: 'star wars';
type: 'phrase';
format: 'pdf';
size
I leave it to you to explore the other search types. The Yahoo!
developer web page is a good read if you're trying to figure out what's
what. I also ask you to kindly report bugs via email. Or don't if you
don't want to.
One other thing... Yahoo! limits the interface to 50 results per
query. Now it's certainly possible that a particular search will result
in more than 50 hits, so you simply make multiple queries with
parameters on subsequent queries to tell the interface where in the
sequence of results you want to start. Each query returns a "total
hits" value indicating how many hits there are to a particular query.
But the Yahoo! search API can get confused sometimes, and it will
sometimes change the number of "total hits" on subsequent calls to the
API. This leads to a situation where the first call to the API will
tell you there's 150 hits, but subsequent calls tell you there might be
120 or so. This is especially upsetting when you do something like:
| search |
search _ ( YahooWebSearch new )
query: 'matthew s. hamrick';
type: 'phrase'.
1 to: ( search size ) do: [ :index |
Transcript
nextPutAll: ( index asString );
nextPutAll: ' - ';
nextPutAll: ( ( search at: index ) title );
cr;
flush.
].
When I do this search, the first query the interface makes tells me
there are 82 results. Subsequent calls tell me there are either 68 or
69 results. But the size of the query's result space is calculated from
the first value (82). We return a nil if we can't find a particular
result, so the code above should fail with an error message about
sending #at: to UndefinedObject. To get around this, try something more
like this:
| count |
count _ 1.
( ( YahooWebSearch new )
query: 'matthew s. hamrick';
type: 'phrase' ) do: [ :value |
( value isNil ) ifFalse: [
Transcript
nextPutAll: ( count asString );
nextPutAll: ' - ';
nextPutAll: ( value title );
cr;
flush.
count _ count + 1.
].
].
Anyway... hope this is of interest to someone else out there. Over the
weekend, I'm going to try to find the time to hack together a Yahoo!
web search browser, so I'll probably try to add a few SUnit tests
before going down that path.
-Cheers,
-Matt H.
More information about the Squeak-dev
mailing list
|