Yahoo! Search API on top of VWXML

stéphane ducasse ducasse at iam.unibe.ch
Sat Apr 16 12:13:50 UTC 2005


this is cool
I will keep that for my lectures ;)

On 16 avr. 05, at 0:35, Matthew S. Hamrick wrote:

> FYI..
> 	So I spent a little time looking for tools to automate some simple 
> web searches and alert me if certain conditions were met. For various 
> reasons none of them seemed to work out. Then I noticed a blog entry 
> about Yahoo!'s new search API web service ( more info at 
> http://developer.yahoo.net/ . ) And then I thought... "hey... I just 
> spent a fair amount of time creating web service debugging tools for 
> my last job on top of the VisualWorks VWXML SAX Driver..." so... I 
> whipped up a few classes to encapsulate the logic of querying the 
> yahoo interface. Yahoo! is offering five types of searches: generic 
> web search, news search, image search, local search, and video search.
>
> 	In what has become my style, I spent 2 hours on the first 
> implementation before completely canning it. Then I spent 2 hours 
> working on the Web Search followed by an hour of refactoring and new 
> coding to create the News Search. But then the third search, the Image 
> Search, took 20 minutes. The last two searches, local and video took 
> about 10 minutes each. On top of that I put an extra 30 minutes of 
> debugging, and voila! a "reasonable" interface to the web service.
>
> 	You can find the classes at: 
> http://www.cryptonomicon.net/msh/squeak/msh-yahoo.st . They require 
> the VWXML (and related) change sets be loaded (VWXML.1.cs, OX.1.cs, 
> VWXMLSaxDriver-fix.st, and VWXMLTweaks.1.cs) So if you have these 
> loaded, it should be pretty straightforward to load the msh-yahoo.st 
> classes.
>
> 	The current implementation is rather slow and does not demonstrate 
> the best Smalltalk style, but one of the great joys of OO is that it 
> tends to be easier to replace what's under the hood after the car is 
> driving.
>
> 	Assuming you're interested, here are a few notes about using the 
> msh-yahoo.st classes.
>
> 	I totally violated one of my fundamental rules with this small 
> project: I didn't write the test code first. Would I have whipped up a 
> few SUnit tests, I believe I could have probably finished much quicker 
> than the 5 1/2 hours it took me to do this rev. Or rather... it 
> probably would have taken me 1 hour of SUnit development, and 4 1/2 
> hours of function implementation, but I would have wound up with a 
> series of tests that would help track down bugs in the future.
>
> 	There are two sets of classes you have to worry about: YahooSearch 
> (and it's subclasses) and YahooSearchResult (and it's subclasses.) You 
> use one of the subclasses of YahooSearch to represent a query (there's 
> a subclass for each of the query types: YahooWebSearch, 
> YahooImageSearch, YahooLocalSearch, YahooVideoSearch, and 
> YahooNewsSearch.) For each of the search types, there's a 
> corresponding result type: YahooWebSearchResult, 
> YahooImageSearchResult, etc. I probably should have represented 
> results as dictionaries and allowed users to access the results with 
> the #at: message, but I didn't. So sue me.
>
> 	YahooSearch is a subclass of VWXMLSaxDriver because I wanted to limit 
> the number of classes hanging around that have to have knowledge of 
> the Yahoo! web service interface. I personally find this mildly ugly, 
> but hey... I didn't want to spend forever working on this project.
>
> 	YahooSearch and it's subclasses also exhibit behaviors of 
> collections. For instance, after setting up a query, you can use the 
> #at: and #size messages to find out how many responses there are to 
> your query, and to retrieve information about the Nth response. I even 
> added a #do: message to iterate over all the results. I beg you to use 
> this feature with caution as you can frequently find search terms that 
> return millions of hits. I'm not responsible if you're silly enough to 
> want to iterate over EVERY response for the query: 'star wars'.
>
> 	So here's a code sample that retrieves the first 100 results for 
> "star wars" and prints their titles to the Transcript.
>
> | search |
> search _  ( YahooWebSearch new )
> 			query: 'star wars';
> 			type: 'phrase';
> 			results: 50.
> 1 to: 100 do: [ :index |
> 	Transcript
> 		nextPutAll: ( index asString );
> 		nextPutAll: ' - ';
> 		nextPutAll: ( ( search at: index ) title );
> 		cr;
> 		flush.
> ].
>
> 	And examining this code... the search object is instanciated with the 
> #new message sent to YahooWebSearch. We tell that object that it's 
> query is 'star wars' with the #query message. The #type message tells 
> the Yahoo WS interface that 'star wars' is a phrase, not two different 
> words we want to search on. The #results message tells the query 
> object to ask for 50 results at a time. There are a number of other 
> options you can set, in including #format: which allows you to select 
> the file type you're interested in. So, if you were interested in 
> knowing how many PDF files on the web mention 'star wars', you could 
> use the following code snippit:
>
> ( YahooWebSearch new )
> 	query: 'star wars';
> 	type: 'phrase';
> 	format: 'pdf';
> 	size
>
> 	I leave it to you to explore the other search types. The Yahoo! 
> developer web page is a good read if you're trying to figure out 
> what's what. I also ask you to kindly report bugs via email. Or don't 
> if you don't want to.
>
> 	One other thing... Yahoo! limits the interface to 50 results per 
> query. Now it's certainly possible that a particular search will 
> result in more than 50 hits, so you simply make multiple queries with 
> parameters on subsequent queries to tell the interface where in the 
> sequence of results you want to start. Each query returns a "total 
> hits" value indicating how many hits there are to a particular query. 
> But the Yahoo! search API can get confused sometimes, and it will 
> sometimes change the number of "total hits" on subsequent calls to the 
> API. This leads to a situation where the first call to the API will 
> tell you there's 150 hits, but subsequent calls tell you there might 
> be 120 or so. This is especially upsetting when you do something like:
>
> | search |
> search _  ( YahooWebSearch new )
> 			query: 'matthew s. hamrick';
> 			type: 'phrase'.
> 1 to: ( search size ) do: [ :index |
> 	Transcript
> 		nextPutAll: ( index asString );
> 		nextPutAll: ' - ';
> 		nextPutAll: ( ( search at: index ) title );
> 		cr;
> 		flush.
> ].
>
> When I do this search, the first query the interface makes tells me 
> there are 82 results. Subsequent calls tell me there are either 68 or 
> 69 results. But the size of the query's result space is calculated 
> from the first value (82). We return a nil if we can't find a 
> particular result, so the code above should fail with an error message 
> about sending #at: to UndefinedObject. To get around this, try 
> something more like this:
>
> | count |
> count _ 1.
> ( ( YahooWebSearch new )
> 	query: 'matthew s. hamrick';
> 	type: 'phrase' ) do: [ :value |
> 		( value isNil ) ifFalse: [
> 			Transcript
> 				nextPutAll: ( count asString );
> 				nextPutAll: ' - ';
> 				nextPutAll: ( value title );
> 				cr;
> 				flush.
> 			count _ count + 1.
> 		].
> 	].
>
> Anyway... hope this is of interest to someone else out there. Over the 
> weekend, I'm going to try to find the time to hack together a Yahoo! 
> web search browser, so I'll probably try to add a few SUnit tests 
> before going down that path.
>
> -Cheers,
> -Matt H.
>
>




More information about the Squeak-dev mailing list