Monday, May 01, 2006

scraping Amazon's statistically improbable phrases

My favorite sources for finding books include the bibliographies I find in books and journal articles; the Library of Congress catalog; Jstor, DNB, OED, and the other proprietary databases which my library website makes available; Google Book Search; Google Scholar; and just plain Google. Each requires its own little skill set to avoid either an overabundance of results or getting nothing.

I've been intrigued with the bibliographic details that Amazon gives, but haven't found a way to use that information as a means of finding books. William J. Turkel, who writes a blog called Digital History Hacks seems to have accomplished this trick. Take a look at his post called SIP Mapping.

He says: "Amazon keeps track of phrases that are distinctive to a small set of books. These SIPs (statistically improbable phrases) can be used to get some idea of the conceptual landscape in and around particular works, and thus can be used to generate bibliographies." And, having recognized this possibility, he wrote a Perl applet to pull together statistically improbable phrases in books. I haven't had a chance to play with it. Seems to be it could be a valuable tool.


GobberGo said...

Wow. Neat. What's next? The Improbability Drive??

Jeff said...

More likely the Library of Babel.

Library of Babel