Yahoo! Search API on top of VWXML
stéphane ducasse
ducasse at iam.unibe.ch
Sat Apr 16 12:13:50 UTC 2005
this is cool
I will keep that for my lectures ;)
On 16 avr. 05, at 0:35, Matthew S. Hamrick wrote:
> FYI..
> So I spent a little time looking for tools to automate some simple
> web searches and alert me if certain conditions were met. For various
> reasons none of them seemed to work out. Then I noticed a blog entry
> about Yahoo!'s new search API web service ( more info at
> http://developer.yahoo.net/ . ) And then I thought... "hey... I just
> spent a fair amount of time creating web service debugging tools for
> my last job on top of the VisualWorks VWXML SAX Driver..." so... I
> whipped up a few classes to encapsulate the logic of querying the
> yahoo interface. Yahoo! is offering five types of searches: generic
> web search, news search, image search, local search, and video search.
>
> In what has become my style, I spent 2 hours on the first
> implementation before completely canning it. Then I spent 2 hours
> working on the Web Search followed by an hour of refactoring and new
> coding to create the News Search. But then the third search, the Image
> Search, took 20 minutes. The last two searches, local and video took
> about 10 minutes each. On top of that I put an extra 30 minutes of
> debugging, and voila! a "reasonable" interface to the web service.
>
> You can find the classes at:
> http://www.cryptonomicon.net/msh/squeak/msh-yahoo.st . They require
> the VWXML (and related) change sets be loaded (VWXML.1.cs, OX.1.cs,
> VWXMLSaxDriver-fix.st, and VWXMLTweaks.1.cs) So if you have these
> loaded, it should be pretty straightforward to load the msh-yahoo.st
> classes.
>
> The current implementation is rather slow and does not demonstrate
> the best Smalltalk style, but one of the great joys of OO is that it
> tends to be easier to replace what's under the hood after the car is
> driving.
>
> Assuming you're interested, here are a few notes about using the
> msh-yahoo.st classes.
>
> I totally violated one of my fundamental rules with this small
> project: I didn't write the test code first. Would I have whipped up a
> few SUnit tests, I believe I could have probably finished much quicker
> than the 5 1/2 hours it took me to do this rev. Or rather... it
> probably would have taken me 1 hour of SUnit development, and 4 1/2
> hours of function implementation, but I would have wound up with a
> series of tests that would help track down bugs in the future.
>
> There are two sets of classes you have to worry about: YahooSearch
> (and it's subclasses) and YahooSearchResult (and it's subclasses.) You
> use one of the subclasses of YahooSearch to represent a query (there's
> a subclass for each of the query types: YahooWebSearch,
> YahooImageSearch, YahooLocalSearch, YahooVideoSearch, and
> YahooNewsSearch.) For each of the search types, there's a
> corresponding result type: YahooWebSearchResult,
> YahooImageSearchResult, etc. I probably should have represented
> results as dictionaries and allowed users to access the results with
> the #at: message, but I didn't. So sue me.
>
> YahooSearch is a subclass of VWXMLSaxDriver because I wanted to limit
> the number of classes hanging around that have to have knowledge of
> the Yahoo! web service interface. I personally find this mildly ugly,
> but hey... I didn't want to spend forever working on this project.
>
> YahooSearch and it's subclasses also exhibit behaviors of
> collections. For instance, after setting up a query, you can use the
> #at: and #size messages to find out how many responses there are to
> your query, and to retrieve information about the Nth response. I even
> added a #do: message to iterate over all the results. I beg you to use
> this feature with caution as you can frequently find search terms that
> return millions of hits. I'm not responsible if you're silly enough to
> want to iterate over EVERY response for the query: 'star wars'.
>
> So here's a code sample that retrieves the first 100 results for
> "star wars" and prints their titles to the Transcript.
>
> | search |
> search _ ( YahooWebSearch new )
> query: 'star wars';
> type: 'phrase';
> results: 50.
> 1 to: 100 do: [ :index |
> Transcript
> nextPutAll: ( index asString );
> nextPutAll: ' - ';
> nextPutAll: ( ( search at: index ) title );
> cr;
> flush.
> ].
>
> And examining this code... the search object is instanciated with the
> #new message sent to YahooWebSearch. We tell that object that it's
> query is 'star wars' with the #query message. The #type message tells
> the Yahoo WS interface that 'star wars' is a phrase, not two different
> words we want to search on. The #results message tells the query
> object to ask for 50 results at a time. There are a number of other
> options you can set, in including #format: which allows you to select
> the file type you're interested in. So, if you were interested in
> knowing how many PDF files on the web mention 'star wars', you could
> use the following code snippit:
>
> ( YahooWebSearch new )
> query: 'star wars';
> type: 'phrase';
> format: 'pdf';
> size
>
> I leave it to you to explore the other search types. The Yahoo!
> developer web page is a good read if you're trying to figure out
> what's what. I also ask you to kindly report bugs via email. Or don't
> if you don't want to.
>
> One other thing... Yahoo! limits the interface to 50 results per
> query. Now it's certainly possible that a particular search will
> result in more than 50 hits, so you simply make multiple queries with
> parameters on subsequent queries to tell the interface where in the
> sequence of results you want to start. Each query returns a "total
> hits" value indicating how many hits there are to a particular query.
> But the Yahoo! search API can get confused sometimes, and it will
> sometimes change the number of "total hits" on subsequent calls to the
> API. This leads to a situation where the first call to the API will
> tell you there's 150 hits, but subsequent calls tell you there might
> be 120 or so. This is especially upsetting when you do something like:
>
> | search |
> search _ ( YahooWebSearch new )
> query: 'matthew s. hamrick';
> type: 'phrase'.
> 1 to: ( search size ) do: [ :index |
> Transcript
> nextPutAll: ( index asString );
> nextPutAll: ' - ';
> nextPutAll: ( ( search at: index ) title );
> cr;
> flush.
> ].
>
> When I do this search, the first query the interface makes tells me
> there are 82 results. Subsequent calls tell me there are either 68 or
> 69 results. But the size of the query's result space is calculated
> from the first value (82). We return a nil if we can't find a
> particular result, so the code above should fail with an error message
> about sending #at: to UndefinedObject. To get around this, try
> something more like this:
>
> | count |
> count _ 1.
> ( ( YahooWebSearch new )
> query: 'matthew s. hamrick';
> type: 'phrase' ) do: [ :value |
> ( value isNil ) ifFalse: [
> Transcript
> nextPutAll: ( count asString );
> nextPutAll: ' - ';
> nextPutAll: ( value title );
> cr;
> flush.
> count _ count + 1.
> ].
> ].
>
> Anyway... hope this is of interest to someone else out there. Over the
> weekend, I'm going to try to find the time to hack together a Yahoo!
> web search browser, so I'll probably try to add a few SUnit tests
> before going down that path.
>
> -Cheers,
> -Matt H.
>
>
More information about the Squeak-dev
mailing list
|