Vocabulogic: Google Ngrams for All: Exploring Word Use in Current and Historic Publications

Sunday, December 26, 2010

Google Ngrams for All: Exploring Word Use in Current and Historic Publications

Christmas Greetings! I hope you are making merry, because according to the Ngrams graphed below, the English culture as a whole may be losing its grasp on joy, at least as evidenced by how often, relative to other words, the words happy, glad, and merry have appeared in print in the last 200 years.

Needless to say, I am making a gross assumption, doing exactly what we must not do when interpreting these data, but perhaps I have caught your attention?

I created this graph with a wondrous new analytic tool: Google Books Ngram Viewer. Google has amassed the world’s largest digitized collection of books, almost 5.2 million scanned books (but it still represents only about 4% of existing publications). Google Labs, working with Erez Lieberman Aiden and Jean-Baptiste Michel, doctoral students from Harvard University, made the dataset freely accessible to the public (only the dataset, not the books). Google continues to scan books to add to the corpus.

In a recent issue of Science (2010), Jean-Baptiste Michel et al. describe their work as "culturomics." Michel et al. state, "Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics," focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000." But in a recent Language Log post, Berkeley linguist Geoffrey Nunberg discusses both the constraints and the strengths of what he calls "the largest corpus ever assembled for social science and humanities research." (See also his article in the Chronicle of Higher Education.) The Ngram Viewer is a powerful tool, but without access to supporting information, interpret results loosely.

This tool is certainly motivating, triggering interest. For instance, I was curious about the evolving forms of the compound word songbird, so I created the graph below, plugging in the comma-separated terms songbird and song bird (my first attempt also included the hyphenated form song-bird but hyphens are problematic). I set the smoothing value to 10, thus allowing for 21 years of moving averages, and the date range from 1800-2000. The graph shows that the closed form, songbird, and the open form, song bird, competed for favor, with the closed form eventually taking a sweeping lead.

Teachers might use this tool occasionally. Consider the term imperialism. It often occurs with colonialism. These concepts are distinct yet overlapping. The graph below depicts a relationship between colony and empire (smoothing is set at 7). Students could annotate graphs with pertinent events and related words, possibly using a Smart Board or Promethian. I created the colonialism Wordle (full size here) with text from the Stanford Encyclopedia of Philosophy. (See prior post describing Wordle.)

By clicking on the hyperlinks, listed below the graph, one can see snippets from actual scanned books, sorted by date. Below is a sample snippet. Notice how the word says is spelled fays. In archaic documents, an s (called Long s) often looked like an f, without the horizontal stroke. Google's program did not make the translation, from f to s.

The history of the colony of Massachuset's Bay: from the first ...

He gave the name also to Martha's Vineyard. t This I suppose is what Josselyn, and no other author, calls the £rst colony of Newr-Plimouth, for he fays it...

Back to the graph, the output percentage, on the y-axis, is the relative frequency of the word—relative to total words published that year. I use the term words loosely--words are case sensitive in this corpus, so meal and Meal are not the same (but then again, how often would a sentence begin with the singular, Meal?).

There is more to learn: Google Labs describes how to interpret the output and the smoothing function. View nine descriptive sample graphs. Scroll through a collection of public-made Ngrams (you could add your own to the site). Read the New York Times. See Scientific American. Read the post and comments at Language hat (dot com). Explore culturomics.org.

Create your own graph. Type in a single word or several words, separated by a comma. It might be interesting to explore the shifting forms or spellings of your own name, or any other famous personage. One can also search for phrases, such as middle of the night, midnight or House of Lords, House of Commons. To share your discovery, right-click on your graph and click "copy image location." Then, paste the copied link into an email or website, or save the URL, to find your graph again on the Web. Enjoy!

Warm wishes for a peaceful and productive new year.
Susan