Sunday, December 26, 2010

Google Ngrams for All: Exploring Word Use in Current and Historic Publications

Christmas Greetings! I hope you are making merry, because according to the Ngrams graphed below, the English culture as a whole may be losing its grasp on joy, at least as evidenced by how often, relative to other words, the words happy, glad, and merry have appeared in print in the last 200 years.

Needless to say, I am making a gross assumption, doing exactly what we must not do when interpreting these data, but perhaps I have caught your attention? 

I created this graph with a wondrous new analytic tool: Google Books Ngram ViewerGoogle has amassed the world’s largest digitized collection of books, almost 5.2 million scanned books (but it still represents only about 4% of existing publications). Google Labs, working with Erez Lieberman Aiden and Jean-Baptiste Michel, doctoral students from Harvard University, made the dataset freely accessible to the public (only the dataset, not the books). Google continues to scan books to add to the corpus.

In a recent issue of Science (2010), Jean-Baptiste Michel et al. describe their work as "culturomics." Michel et al. state, "Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics," focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000." But in a recent Language Log post, Berkeley linguist Geoffrey Nunberg discusses both the constraints and the strengths of what he calls "the largest corpus ever assembled for social science and humanities research." (See also his article in the Chronicle of Higher Education.)  The Ngram Viewer is a powerful tool, but without access to supporting information, interpret results loosely.

This tool is certainly motivating, triggering interest. For instance, I was curious about the evolving forms of the compound word songbird, so I created the graph below, plugging in the comma-separated terms songbird and song bird (my first attempt also included the hyphenated form song-bird but hyphens are problematic). I set the smoothing value to 10, thus allowing for 21 years of moving averages, and the date range from 1800-2000. The graph shows that the closed form, songbird, and the open form, song bird, competed for favor, with the closed form eventually taking a sweeping lead.

Teachers might use this tool occasionally. Consider the term imperialism. It often occurs with colonialism. These concepts are distinct yet overlapping. The graph below depicts a relationship between colony and empire (smoothing is set at 7). Students could annotate graphs with pertinent events and related words, possibly using a Smart Board or Promethian. I created the colonialism Wordle (full size here) with text from the Stanford Encyclopedia of Philosophy. (See prior post describing Wordle.)

By clicking on the hyperlinks, listed below the graph, one can see snippets from actual scanned books, sorted by date. Below is a sample snippet. Notice how the word says is spelled fays. In archaic documents, an s (called Long s) often looked like an f, without the horizontal stroke. Google's program did not make the translation, from f to s.
He gave the name also to Martha's Vineyard. t This I suppose is what Josselyn, and no other author, calls the £rst colony of Newr-Plimouth, for he fays it...

Back to the graph, the output percentage, on the y-axis, is the relative frequency of the word—relative to total words published that year. I use the term words loosely--words are case sensitive in this corpus, so meal and Meal are not the same (but then again, how often would a sentence begin with the singular, Meal?).

There is more to learn:
Google Labs describes how to interpret the output and the smoothing function.  View nine descriptive sample graphs.  Scroll through a collection of public-made Ngrams (you could add your own to the site). Read the New York Times. See Scientific American. Read the post and comments at Language hat (dot com). Explore 

Create your own graph. Type in a single word or several words, separated by a comma. It might be interesting to explore the shifting forms or spellings of your own name, or any other famous personage. One can also search for phrases, such as middle of the night, midnight or House of Lords, House of Commons. To share your discovery, right-click on your graph and click "copy image location." Then, paste the copied link into an email or website, or save the URL, to find your graph again on the Web. Enjoy!

Warm wishes for a peaceful and productive new year.


  1. I just got an email from a colleague, pointing us to WolframAlpha Computational Knowledge Engine. This site has many possible applications to vocab instruction. For example, it provides the word frequency ranking (useful for prioritizing which words to teach).

  2. Wow, Ngrams is fascinating . . . i'll try it out, thanks for this tip!!

  3. Hi again, so I tried it . . . what I did is put in two different terms:
    climate change
    global warming

    both terms made their brief debut in 1900 when the World Bank mentioned it in a proceeding re: greenhouse gasses, then it nothing until the 1980's--even though there were studies done in late 1930's on the issue of a global rise in temperature. I have these studies, which do not mention these terms at all. A great tool for discussing the cultural trends of terms like these--pondering/speculating the reasons for why terms may be used, supplanted or possibly censored.

  4. Interesting search, Diana! I am surprised to learn that greenhouse gasses was even around in 1900. To appear on radar, a search must yield a minimum ( 40 occurances a year, if I recall correctly). Maybe that explains the "empty" years. Also, it would be worthwhile to check back, after Google adds newspaper articles to the corpus (Pretty sure I read that they plan to do so, eventually...the project is huge)


    I did a Google search today for "Marsha Henry" because I am serving on the Literacy Review committee for my school district, Howard Winneshiek. I have read Unlocking Literacy several times, and I also wanted to review your power point presentations you had uploaded for your classes at Stanford University. I am looking for your suggested presentation order of Anglo-Saxon derrived morphemes (by grade level).

    I found, instead, the answer to another concern! I have been concerned with how commonly I am seeing compound words spelled as two separate words, so I did an analysis on livingroom, and living room. Even the dictionary on my Droid indicated it was spelled incorrectly as livingroom!

    Marsha, you are an amazing woman, and you have contributed greatly to the world of reading! I met you in Cedar Rapids when you presented at an IDA Conference, and if you still speak in schools, I'd love to visit about this!

    Brenda Knobloch
    Sent from my U.S. Cellular Android device

  6. Glad you found Ngrams useful, Brenda! PS. Marcia Henry is a guest author (see her post in archives) and a friend, but I wrote this post and maintain this blog.


Comments are published after they are reviewed, to ensure they are not SPAM.