Topic Modeling: What Humanists Actually Do With It

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.
pennsylvania20gazzette

One of the hardest questions we can pose to a computer is asking what a human-language text is about. Given an article, what are its keywords or subjects? What are some other texts on the same subjects? For us as human readers, these kinds of tasks may seem inseparable from the very act of reading: we direct our attention over a sequence of words in order to connect them to one another syntactically and interpret their semantic meanings. Reading a text, for us, is a process of unfolding its subject matter.

Computer reading, by contrast, seems hopelessly naive. The computer is well able to recognize unique strings of characters like words and can perform tasks like locating or counting these strings throughout a document. For instance, by pressing Control-F in my word processor, I can tell it to search for the string of letters reading which reveals that so far I have used the word three times and highlights each instance. But that’s about it. The computer doesn’t know that the word is part of the English language, much less that I am referring to its practice as a central method in the humanities.

To their credit, however, computers make excellent statisticians and this can be leveraged toward the kind of textual synthesis that initiates higher-order inquiry. If a computer were shown many academic articles, it might discover that articles containing the word reading frequently include others like interpretationcriticismdiscourse. Without foreknowledge of these words’ meanings, it could statistically learn that there is a useful relationship between them. In turn, the computer would be able to identify articles in which this cluster of words seems to be prominent, corresponding to humanist methods.

This process is popularly referred to as topic modeling, since it attempts to capture a list of many topics (that is, statistical word clusters) that would describe a given set of texts. The most commonly used implementation of a topic modeling algorithm is MALLET, which is written an maintained by Andrew McCallum. It is distributed as well in the form of an easy-to-use R package, ‘mallet‘, by David Mimno.

Since there are already several excellent introductions to topic modeling for humanists, I won’t go further into the mathematical details here. For those looking for an intuitive introduction to topic modeling, I would point out Matt Jockers’ fable of the “LDA Buffet.” LDA is the most popular algorithm for topic modeling. For those curious about the math behind it, but aren’t interested in deriving any equations, I highly recommend Ted Underwood’s “Topic Modeling Made Just Simple Enough” and David Blei’s “Probabilistic Topic Models.”

Despite its algorithmic nature, it would be a gross mischaracterization to claim that topic modeling is somehow objective or absent interpretation. I will simply emphasize that human evaluative decisions and textual assumptions are encoded in each step of the process, including text selection and topic scope. In light of this, I will focus on how topic modeling has been used critically to work on humanistic research questions.

Topic modeling’s use in humanistic research might be thought of in terms of three broad approaches: as a tool to guide our close readings, as a technique for capturing the social conditions of texts, and as a literary method that defamiliarizes texts and language.

Topic Modeling as Exploratory Archival Tool

Early examples of topic modeling in the humanities emphasize its ability to help scholars navigate large archives, in order to find useful texts for close reading.

Describing her work on the Pennsylvania Gazette, an American colonial newspaper spanning nearly a century, Sharon Block frames topic modeling as a “promising way to move beyond keyword searching.” Instead of relying on individual words to identify articles relevant to our research questions, we can watch how the “entire contents of an eighteenth-century newspaper change over time.”

To make this concrete, Block reports some of the most common topics that appeared across Gazette articles, including the words that were found to cluster and a label reflecting her own after-the-fact interpretation of those words and articles in which they appear.

% of Gazette Most likely words in a topic in order of likelihood Human-added topic label
5.6 away reward servant named feet jacket high paid hair coat run inches master… Runaways
5.1 state government constitution law united power citizen people public congress… Government
4.6 good house acre sold land meadow mile premise plantation stone mill dwelling… Real Estate
3.9 silk cotton ditto white black linen cloth women blue worsted men fine thread… Cloth

Prevalent Topics in Pennsylvania Gazette; source: Sharon Block in Common-Place

If we were searching through an archive for articles on colonial textiles by keyword alone, we might think to look for articles including words like silkcottoncloth but a word like fine would be trickier to use since it has multiple, common meanings, not to mention the multivalence of gendered words like women and men.

Beyond simply guiding us to articles of interest, Block suggests that we can use topic modeling to inform our close readings by tracking topic prevalence over time and especially the relationships among topics. For example, she notes that articles relating to Cloth peak in the 1750s at the very moment the Religion topic is at its lowest, and wonders aloud whether we can see “colonists (or at least Gazette editors) choosing consumption over spirituality during those years.” This observation compels further close readings of articles from that decade in order to understand better why and how consumption and spirituality competed on the eve of the American Revolution.

A similar project that makes the same call for topic modeling in conjunction with close reading is Cameron Blevins’ work on the diary of Martha Ballard.

Topic Modeling as Qualitative Social Evidence

Following Block’s suggestion, several humanists since have tracked topics over time in different corpora in order to interpret underlying social conditions.

Robert K. Nelson’s project Mining the Dispatch topic models articles from the Richmond Daily Dispatch, the paper of record of the Confederacy, over the course of the American Civil War. In a series of short pieces on the project website and that of the New York Times, Nelson does precisely the kind of guided close reading that Block indicates.

Topic Prevalence over time in Richmond Daily Dispatch; source: Robert K. Nelson in New York Times, Opinionator

Following two topics that seem to rise and fall in tandem, Anti-Northern Diatribes and Poetry and Patriotism, Nelson identifies them as two sides of the same coin in the war effort. Taken together, they not only reveal how the Confederacy understood itself in relation to the war, but the simultaneous spikes and drops of these topics offer what he refers to as “a cardiogram of the Confederate nation.”

Andrew Goldstone and Ted Underwood similarly use readings of individual articles to ground and illustrate the trends they discover in their topic model of 30,000 articles in literary studies spanning the twentieth century. Their initial goal is to test the conventional wisdom of literary studies – for example, the mid-century rise of New Criticism that is supplanted by theory during the 1970s-80s – which their study confirms in broad strokes.

However, they also find that there are other kinds of changes that occur at a longer scale regarding an “underlying shift in the justification for literary study.” Whereas the early part of the century had tended to emphasize “literature’s aesthetically uplifting character,” contemporary scholars have refocused attention on “topics that are ethically provocative,” such as violence and power. Questions of how and why to study literature appear deeply intwined with broader changes in the academy and society.

Matt Jockers has used topic modeling to study the social conditions of novelistic production, however he has placed greater emphasis on the relationship between authorial identity – especially gender and nationality – and subject matter. For example, in an article with David Mimno, they look not only at whether topics are used more frequently by women than men, but also how the same topic may be used differently based on authorial gender. (See also Macroanalysis, Ch. 8, “Theme”)

Topic Modeling as Literary Theoretical Springboard

The above-mentioned projects are primarily historical in nature. Recently, literary scholars have used topic modeling to ask more aesthetically oriented questions regarding poetics and theory of the novel.

Studying poetry, Lisa Rhody uses topic modeling as an entry point on figurative language. Looking at the topics generated from a set of more than 4000 poems, Rhody notes that many are semantically opaque. It would be difficult to assign labels to them in the way that Block had for the Pennsylvania Gazette topics, however she does not treat this as a failure on the computer’s part.

In Rhody’s words “Determining a pithy label for a topic with the keywords death, life, heart, dead, long, world, blood, earth… is virtually impossible until you return to the data, read the poems most closely associated with the topic, and infer the commonalities among them.”

So she does just that. As might be expected from the keywords she names, many of the poems in which the topic is most prominent are elegies. However, she admits that a “pithy label” like “Death, Loss, and Inner Turmoil” fails to account for the range of attitudes and problems these poems consider, since this kind of figurative language necessarily broadens a poem’s scope. Rhody closes by noting that several of these prominently elegiac poems are by African-American poets meditating on race and identity. Figurative language serves not only as an abstraction but as a dialogue among poets and traditions.

Most recently, Rachel Sagner Buurma has framed topic modeling as a tool that can productively defamiliarize a text and uses this to explore novelistic genre. Taking Anthony Trollope’s six Barsetshire novels as her object of study, Buurma suggests that we should read the series not as a formal totality – as we might do for a novel with a single, omniscient narrator – but in terms of its partial and uneven nature. The prominence of particular topics across disparate chapters offer alternate traversals through the books and across the series.

As Buurma finds, the topic model reveals the “layered histories of the novel’s many attempts to capture social relations and social worlds through testing out different genres.” In particular, the periodic trickle of a topic letter, write, read, written, letters, note, wrote, writing… captures not only the subject matter of correspondence, but reading those chapters finds “the ghost of the epistolary novel” haunting Trollope long after its demise. Genres and genealogies that only show themselves partially may be recovered through this kind of method.

Closing Thought

What exactly topic modeling captures about a set of texts is an open debate. Among humanists, words like theme and discourse have been used to describe the statistically-derived topics. Buurma frames them as fictions we construct to explain the production of texts. For their part, computer scientists don’t really claim to know what they are either. But as it turns out, this kind of interpretive fuzziness is entirely useful.

Humanists are using topic modeling to reimagine relationships among texts and keywords. This allows us to chart new paths through familiar terrain by drawing ideas together in unexpected or challenging ways. Yet the findings produced by topic modeling consistently call us back to close reading. The hardest work, as always, is making sense of what we’ve found.

 

References

Blei, David. “Probabilisitic Topic Models.” Communications of the ACM 55.4 (2012): 77-84.

Blevins, Cameron. “Topic Modeling Martha Ballard’s Diary.” http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ (2010).

Block, Sharon. “Doing More with Digitization.” Common-place 6.2 (2006).

Buurma, Rachel Sagner. “The fictionality of topic modeling: Machine reading Anthony Trollope’s Barsetshire series.” Big Data & Society 2.2 (2015): 1-6

Goldstone, Andrew and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45.3 (2014): 359-384.

Jockers, Matthew. “The LDA Buffet is Now Open; or Latent Dirichlet Allocation for English Majors.” http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ (2011).

Jockers, Matthew. “Theme.” Macroanalysis: Digital Methods & Literary History. Urbana: University of Illinois Press, 2013. 118-153.

Jockers, Matthew and David Mimno. “Significant Themes in 19th-Century Literature.” Poetics 41.6 (2013): 750-769.

Nelson, Robert K. Mining the Dispatchhttp://dsl.richmond.edu/dispatch/

Nelson, Robert K. “Of Monsters, Men – And Topic Modeling.” New York Times, Opinionator (blog)http://opinionator.blogs.nytimes.com/2011/05/29/of-monsters-men-and-topic-modeling/ (2011).

Rhody, Lisa. “Topic Modeling and Figurative Language.” Journal of Digital Humanities. 2.1 (2012).

Underwood, Ted. “Topic Modeling Made Just Simple Enough.” http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/(2012).

One thought on “Topic Modeling: What Humanists Actually Do With It

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s