Google Books is a project sponsored by search engine giant Google to scan the pages of every book available, convert the scans to text using OCR, and make the resulting text corpus searchable. Not withstanding any remaining copyright disputes surrounding the project, Google has reached an agreement with most of the major copyright holders (authors and the publishers that represent them) around the world. So far, Google has scanned over 15 million books, most of which are no longer in print or commercially available. This database is a treasure trove for history scholars.

Last week, Google released a new statistics tool for the Google Books project called the Ngram Viewer. It it simple to use. You simply enter a list of words or phrases separated by a comma (each called an n-gram and is case-sensitive), the language (American English and British English can be searched separately), the date range (starting from 1500 to 2008, though the number of books is sparse before 1780 which makes the early data very spiky), and the amount of moving average smoothing to apply (long trends are easier to see with smoothing, but the individual yearly data is lost).

The Ngram Viewer is the best time series graphing toy I’ve seen.

There have been lots of stories in the press showing interesting trends in the popularity of certain words and phrases in books. For instance, Jennifer Valentino-DeVries of the Wall St. J. shows that Merry Christmas beats Happy Holidays by a big margin.


Merry Christmas vs Happy Holidays, Image from Google labs

Slate’s Tom Scocca has been posting an Ngram of the Day on the site comparing the frequency of words like shopping vs salvation and television vs the Bible. He even does a comparison between words and shows the year in which the two cross in popularity. For instance, anxiety passes shame in 1942.


anxiety vs. shame. Image from Google labs

Here’s a couple charts I created for the n-grams “independence” and “rebellion” in U.S. and British English. I have no idea what conclusions to draw from this data, but it is just begs for an explanation based on unfounded speculation. There is a spike in both words in U.S. books in the 1770s, but no spike in British literature. Independence becomes more popular than rebellion in books from both countries after about 1820. The word rebellion has a spike again in the U.S. from 1860 to 1870. The absolute occurrence of both words are similar in the two countries starting around 1900. Independence shows a spike from 1940 to 1942 and another spike around 1968 to 1970.



independence vs. rebellion, British (top) vs. American (bottom). Images from Google labs.

For the truly hardcore programmer, the n-gram datasets are available for download from Google. Their use is covered under the Creative Commons Attribution 3.0 Unported license.

The next step is for Google to add its Ngram Viewer toolkit to its Public Data Explorer visualization tool (see Mar 2010 blog post) to allow animations and drill down. I can hardly wait.

[Update: I rescaled the first two graphs to normalized the time spans.]