Numerical Analysis of Literature


CAN the technologies of Big Data, which are transforming so many areas of life, change our understanding of American novels? After conducting research with Google¡¦s Ngram database, which tabulates the frequency of words used in more than five million books, I believe the answer is yes.


Consider the question of which themes and books characterize a literary era. The time-honored approach to this problem has been for an august critic or group of distinguished scholars to select and analyze key novels. That methodology, however, has its flaws. No one person or team of readers can do more than dip their toes into the vast sea of literary works. By the 1840s Americans wrote more than 100 novels annually; by the 1880s, more than 1,000; by the early 21st century, more than 10,000. In addition, there is the threat of subjective bias. Not long ago, for example, critics focused their attention almost exclusively on white male authors.


The Ngram database offers an alternative approach. As I demonstrate in the latest issue of the journal Social Science History, by examining the changing frequencies of key words in books published in the United States, researchers can gain new perspectives on America and its novels.


There are important caveats in using this source. The ¡§American English¡¨ subset of the Ngram database includes a broad selection of books published in the United States - not just fiction or writings by American authors. It excludes the dime novels favored by the lower class, and so has a middle-class bias. But as a guide to the works that middle-class Americans read, it is a fruitful source of hypotheses and a healthy check on subjective opinion. In a number of instances, Ngram data suggest challenges to common assumptions about American literature.


Take the role of women in mid-19th-century American novels. Scholars have long argued that domesticity shaped the world of middle-class women and that novels relegated them to the home and restricted their activities. Women had influence only when they persuaded men to act, as in Harriet Beecher Stowe¡¦s 1852 novel, ¡§Uncle Tom¡¦s Cabin.¡¨ Women were supposed to be submissive, pious, domestic and pure. But Ngram data indicate that the use of those words peaked, respectively, in 1807, 1814, 1835 and 1847. All fell off before midcentury. By contrast, striking gains were recorded during these years in the usage of woman¡¦s rights. Virtually unknown before the 1840s, the term soared in frequency after the Seneca Falls Convention in 1848 and did not peak until 1884. Perhaps we need to invert the conventional wisdom and declare as ¡§representative¡¨ those midcentury novels criticizing domesticity and celebrating independent women - books like Fanny Fern¡¦s ¡§Ruth Hall,¡¨ published in 1854, and E.D.E.N. Southworth¡¦s ¡§Hidden Hand,¡¨ which first appeared in serial form in 1859.


Ngram data also provide a new perspective on the novels of the 1930s. These years are traditionally viewed as the heyday of the proletarian novel, a time of gloom and a period when business leaders were despised. John Steinbeck¡¦s 1939 novel, ¡§The Grapes of Wrath,¡¨ is considered a quintessential novel of the decade. But according to Ngram data, the use of businessman, a term virtually unknown before 1930, surged during the decade. Of course, you might guess that those citations were negative, but trends in other terms point to a more positive reading. References to optimism rose throughout the decade, while pessimism declined. Mentions of the American dream, a term rarely seen before 1930, also climbed precipitously. So instead of Steinbeck¡¦s novel, works highlighting scrappy, successful entrepreneurs may best mark this decade. In Zora Neale Hurston¡¦s ¡§Their Eyes Were Watching God,¡¨ published in 1937, for example, the heroine¡¦s first two husbands were successful businessmen who overcame tough times and racial prejudice. Similarly, Margaret Mitchell¡¦s ¡§Gone With the Wind¡¨ (1936) details Scarlett O¡¦Hara¡¦s campaign to regain the affluence she once enjoyed.


Our view of postmodern fiction might also need adjusting. Chaos, conspiracy and nihilism are thought to reign in this literary world, as in the unsettling early works of Thomas Pynchon and Bret Easton Ellis. Word usage, however, indicates a more positive dynamic: the growing attention paid to children. Among the terms whose frequency escalates after 1960 are caring, nurturing, infant, toddler and childhood. It could be that the truly representative works of this era are novels like Toni Morrison¡¦s ¡§Beloved,¡¨ Philip Roth¡¦s ¡§American Pastoral¡¨ and Cormac McCarthy¡¦s ¡§The Road,¡¨ all of which feature deep parent-child bonds.


Such hypotheses are merely suggestive, but as tools like the Ngram database continue to improve, the insights they make possible should encourage scholars to revisit longstanding assumptions with a critical eye.


Marc Egnal, a professor of history at York University in Toronto, is the author of ¡§Clash of Extremes: The Economic Origins of the Civil War.¡¨


Crunching Literary Numbers

Published: July 12, 2013