Chicago Corpus Word Embeddings

A quick post to announce the public distribution of word embeddings trained from the Chicago Text Lab’s corpus of US novels. They will be hosted by this blog and can be downloaded from this link (download) or through the static Open Code & Data page.

From the Chicago Text Lab’s description of the corpus, the corpus contains

American fiction spanning the period 1880-2000. The corpus contains nearly 9,000 novels which were selected based on the number of library holdings as recorded in WorldCat. They represent a diverse array of authors and genres, including both highly canonical and mass-market works. There are about 7,000 authors represented in the corpus, with peak holdings around 1900 and the 1980s.

In total, the corpus consists of over 700M words and the embeddings’ vocabulary contains 250K unique terms.

The embeddings are learned by the word2vec algorithm distributed in the Python package gensim, version 4.0.1, which implements the skip-gram model described in Mikolov et al, 2013a and Mikolov et al, 2013b. Parameters include

  • Vector Size: 300 dimensions
  • Window Size: 5 words
  • Training Epochs: 3 iterations

All other parameters are default values in gensim (see documentation).

The embeddings are distributed as word vectors in a plain-text file, according to the original word2vec format: one vector per line; initial keyword; values separated by whitespace.

Enjoy a visualization of words similar to the keyword “modern” in the model.

Most similar words: postmodern, modernist, minimalist, kitsch, nonobjective, bastardization, medieval, expressionists

Distant Reading: An Exam List

As a resource to future graduate students, I am sharing the reading list I compiled for my qualifying exam on Distant Reading. Below the list, you will find a user manual of sorts that explains the rationale for each of the selections.

My goal for posting is by no means to assert an authoritative list, but to offer a provisional set of principles and touchpoints for ongoing conversations in the field. I hope that many more such lists will eventually be posted, as the field matures and as new students and perspectives contribute to the project of Distant Reading.

Good luck on your exam!

Qualifying Exam List in Distant Reading

1. Intellectual & Institutional History

A. Literary Study
James Turner, Philology
John Guillory, Cultural Capital
René Wellek, The Rise of English Literary History
William K. Wimsatt & Cleanth Brooks, Literary Criticism: A Short History
Gerald Graff, Professing Literature
Gauri Viswanathan, Masks of Conquest

B. Statistics
Alain Desrosières, The Politics of Large Numbers
Ian Hacking, The Emergence of Probability
Stephen Stigler, The History of Statistics
Theodore M. Porter, Trust in Numbers
Margo J. Anderson, The American Census
Stephen Jay Gould, The Mismeasure of Man

C. Computer Science
Martin Campbell-Kelly et al, Computer
William Kneale & Martha Kneale, The Development of Logic
Michael R. Williams, A History of Computing Technology
Paul N. Edwards, A Vast Machine
JoAnne Yates, Control through Communication
Janet Abbate, Recoding Gender

2. Quantitative Literary Study

A. Human Sciences
L. A. Sherman, Analytics of Literature
Caroline Spurgeon, Shakespeare’s Imagery and What It Tells Us
Josephine Miles, Wordsworth and the Vocabulary of Emotion
Janice Radway, Reading the Romance
Franco Moretti*, Atlas of the European Novel, 1800-1900
—, “Conjectures on World Literature”

B. Stylometry
T. C. Mendenhall, “The Characteristic Curves of Composition”
G. K. Zipf, “Selected Studies of the Principle of Relative Frequency in Language”
G. Udny Yule, The Statistical Study of Literary Vocabulary (Ch 1-3)
Frederick Mosteller & David L. Wallace, “Inference and Disputed Authorship: The Federalist”
Stanley Fish, “What is Stylistics, and Why Are They Saying Such Terrible Things about It?”
Mark Olsen, “Signs, Symbols, and Discourses”
J. F. Burrows, “Not Unless You Ask Nicely”

C. Humanities Computing
Roberto Busa, Varia Specimina Concordantiarum
Jacob Leed (ed.), The Computer & Literary Style
Henry Kučera & W. Nelson Francis, Brown Corpus
Susan Hockey, Oxford Concordance Program
Michael Sperberg-McQueen & Lou Burnard, TEI P3
Jerome McGann, The Rossetti Archive
Susan Schreibman et al (eds.), A Companion to Digital Humanities (2004)
Andrew McCallum, MALLET
Roy Rosenzweig & Tom Scheinfeldt, Omeka
NEH, Office of Digital Humanities
CIC & UC Libraries, HathiTrust Digital Library
Matthew K. Gold (ed.), Debates in the Digital Humanities (2012; online)

* Note
As a community, we are reckoning with Moretti’s influential role for Distant Reading alongside the revelation of sexual assault against a graduate student. Personally, I am uneasy about including his work here. Lauren F. Klein has called on us to imagine “Distant Reading after Moretti,” and if this reading list makes any contribution to our collective imagination, it is to show that the field of Distant Reading has a history that long precedes him as well. To be up front, there are several other problematic texts and figures in the reading list, but I have also found that they are met at every turn by scholars whose goal is justice. We must commit ourselves to the same.

A Resource

As mentioned above, I’m sharing this list as a resource for graduate students preparing to do research in Distant Reading. Broadly speaking, Distant Reading is a body of scholarship that shares a general goal to produce new interpretive knowledge about literature and culture through measurement and computation. It is necessarily interdisciplinary. In that sense, I see Distant Reading as a branch of the Data Science movement in the academy, and I hope that this resource will be useful to students in a variety of departments.

If there is an argument or polemic in the list, it is this: for Distant Reading to succeed as a research program, it is not enough to simply use statistics and computing to answer conventional literary questions. There must be a reciprocal move, in which literary study shows that it too is an essential part of Data Science, just as much as statistical modeling and computer engineering. The structure of this reading list is designed to make both moves possible.

[Inside Baseball: Although I am not explicitly laying out the relationship between Distant Reading and Digital Humanities, suffice it to say that I think it is a close one. The polemical stance taken above is deeply informed by discussions about Humanities Computing and New Media Studies in the 2000s, and I understand Digital Humanities to have directly contributed to Distant Reading scholarship. These ideas have implicitly guided my discussion below, as well as the selections in the reading list.]

The Classics

The list is broken into two parts. The second part is easier to explain since it contains the “classics” of quantitative literary study. They are the common touchpoints across a range of conversations that enable newcomers to participate and contribute. Having been a graduate student for a while, I can say that the inverse is also true; being unfamiliar with these texts makes it hard to participate in current conversations.

To my eye, there are three major branches of quantitative literary study, as it has been practiced historically: stylometry, humanities computing, and scholarship that approaches literature as a human science. In brief, stylometry produces statistical measurements from literary and linguistic texts; humanities computing works on the logical formalisms that organize language and text; literature-as-human-science tries to systematize knowledge about an entire discourse, such as an author’s oeuvre, a genre, or a period.

The three branches are not entirely separate; they overlap at various points in their histories. For example, Josephine Miles makes important contributions to two or all three branches, depending on how you count them. However, I have found it useful to trace each conversation individually, and afterwards to identify points of contact.

Disciplinary History

The first part of the exam list is intended to historicize quantitative literary study. How did literary study take its current shape? What made quantitative methods feel timely at a few key historical moments, including the present? What are the shared intellectual roots between aesthetics and computation? (I’m looking at you, Aristotle.) Answering these questions means tracing the intellectual and institutional histories of literary study, as well as computer science and statistics.

Each discipline is considered individually but in a way that should draw out their parallels. There are six texts listed for each discipline, in the following order:

  • an historical overview of the discipline,
  • three aspects of its domain knowledge,
  • an instance of its institutionalization, and
  • a critique of power in its institutions.

Again, I have attempted to select “classics” in each field, to facilitate participation in their respective conversations.

For example, the standard text Computer, whose subtitle reads A History of the Information Machine, offers a general account of just that. Getting into the weeds, some of the animating tensions in Computer Science come from its roots in several prior disciplines: mathematics, engineering, and physical science. Each of their contributions is considered in turn in the texts The Development of Logic, A History of Computing Technology, and A Vast Machine. As an historical formation, computers were institutionalized first as part of business practices, described in Control Through Communication. However, sexism is a well-known problem in computing culture. The process by which computing became “coded” as masculine, thereby putting up barriers to women’s participation, is recounted in Recoding Gender.

Data Science

Putting both halves of the list together shows my understanding of Distant Reading as part of the larger program of Data Science. In the literature, Data Science is generally framed as the application of data management (Computer Science) and analysis (Statistics) to problems in a given domain (in this case, Literary Study). This framing more-or-less corresponds to the scholarship found in the second half of the list.

But the relationship of method and problem domain can be reversed to useful effect as well. Telling histories of Computer Science and Statistics recasts their central problems as ones in the written record. They become domain problems to which the methods of the Humanities are applied. This reversal of perspective corresponds to the readings in first half of the exam list.

The virtues of the “bi-directional” approach manifest at two levels. Institutionally: if Literary Study is to participate in Data Science as full partners, then we will need to express our concerns in the language of Computer Science and Statistics, and vice-versa. Mutual intelligibility is a minimum requirement. Intellectually: if Distant Reading is to draw from the full resources of both Humanities and Data Science, then it must be articulated from both sides of the divide; each approach supplements the other.

Caveats & Parameters

To be sure, the list in its current form is organized around my own research needs. For one, it emphasizes American cultural study. If the exam were focused on British computing culture, it would be appropriate to switch out JoAnne Yates’s Control Through Communication with Jon Agar’s The Government Machine. The other obvious priority for the list is literary study. A student of Art History, for example, will hopefully find it easy to slot out the history of literary study wherever it appears. The list is designed as a series of modules to facilitate this kind of replacement.

Why is the exam list so short? This is the “theory” section of my full list, which includes American literature and criticism. If you would like additional resources for your own exam, I suggest starting with the UCSB English Department’s exam list in Literature and Theory of Technology. The department requires two rounds of exams, and I took that list during my first round. It offers a theoretical complement to this largely historical list.

It is also worth being explicit about the historical orientation. In both parts, I have constrained readings to those published before 2012 (with the exception of James Turner’s Philology). There are a few motives for this, the most important being to emphasize lines of historical development. There has been an explosion of scholarship in Distant Reading — and in the Digital Humanities generally — since 2012, so it is valuable to have a sense of its historicity. The restriction also guarantees that the list’s expiration date will not come too soon. As time passes, that restriction may appear less congenial.

A Request

I too dislike canons. It is a truism that Distant Reading needs to identify some common discursive touchpoints in order to sustain its conversation, and I have used my exam list to try to name some of them. But I am not certain that they will be useful for everybody. And, to sustain a conversation, more important than texts we agree about are the texts we disagree about! I hope that this list will be the beginning of a conversation rather than its premature closure.

To any student planning your own qualifying exam: please take the useful parts and throw away the rest!

To students who have already taken your exams, I have a request: please share them. I’d love to see what we think Distant Reading is.

A Naive Empirical Post about DTM Weighting

In light of word embeddings’ recent popularity, I’ve been playing around with a version called Latent Semantic Analysis (LSA). Admittedly, LSA has fallen out of favor with the rise of neural embeddings like Word2Vec, but there are several virtues to LSA including decades of study by linguists and computer scientists. (For an introduction to LSA for humanists, I highly recommend Ted Underwood’s post “LSA is a marvelous tool, but…“.) In reality, though, this blog post is less about LSA and more about tinkering with it and using it for parts.

Like other word embeddings, LSA seeks to learn about words’ semantics by way of context. I’ll sidestep discussion of LSA’s specific mechanics by saying that it uses techniques that are closely related to ones commonly used in distant reading. Broadly, LSA constructs something like a document-term matrix (DTM) and then performs something like Principle Component Analysis (PCA) on it.1 (I’ll be using those later in the post.) The art of LSA, however, lies in between these steps.

Typically, after constructing a corpus matrix, LSA involves some kind of weighting of the raw word counts. The most familiar weight scheme is (l1) normalization: sum up the number of words in a document and divide the individual word counts, so that each cell in the matrix represents a word’s relative frequency. This is something distant readers do all the time. However, there is an extensive literature on LSA devoted to alternate weights that improve performance on certain tasks, such as analogies or document retrieval, and on different types of documents.

This is the point that piques my curiosity. Can we use different weightings strategically to capture valuable features of a textual corpus? How might we use a semantic model like LSA in existing distant reading practices? The similarity of LSA to a common technique (i.e. PCA) for pattern finding and featurization in distant reading suggests that we can profitably apply its weight schemes to work that we are already doing.

What follows is naive empirical observation. No hypotheses will be tested! No texts will be interpreted! But I will demonstrate the impact that alternate matrix weightings have on the patterns we identify in a test corpus. Specifically, I find that applying LSA-style weights has the effect of emphasizing textual subject matter which correlates strongly with diachronic trends in the corpus.

I resist making strong claims about the role that matrix weighting plays in the kinds of arguments distant readers have made previously — after all, that would require hypothesis testing — however, I hope to shed some light on this under-explored piece of the interpretive puzzle.

Matrix Weighting

Matrix weighting for LSA gets broken out into two components: local and global weights. The local weight measures the importance of a word within a given text, while the global weight measures the importance of that word across texts. These are treated as independent functions that multiply one another.

W_{i,j} = L_{i,j}\cdot G_{i}

where i refers to the row corresponding to a given word and j refers to the column corresponding to a given text.2 One common example of such a weight scheme is tf-idf, or term frequency-inverse document frequency. (Tf-idf has been discussed for its application to literary study at length by Stephen Ramsey and others.)

Research on LSA weighting generally favors measurements of information and entropy. Typically, term frequencies are locally weighted using a logarithm and globally weighted by measuring the term’s entropy across documents. How much information does a term contribute to a document? How much does the corpus attribute to that term?

While there are many variations of these weights, I have found one of the most common formulations to be the most useful on my toy corpus.3

L_{i,j} = log(tf_{i,j} + 1)
G_{i} = 1 + \frac{\sum_{1}^{J}p_{i,j}\cdot log(p_{i,j})}{log(N)}

where tf is the raw count of words in a document, p is the conditional probability of a document given a word, and N is the total number of documents in the corpus. For the global weights, I used normalized (l1) frequencies for each term when calculating the probabilities (pi,j). This prevents longer texts from having disproportionate influence on the term’s entropy. Vector (l2) normalization is applied over each document, after weights have been applied.

To be sure, this is not the only weight and normalization procedure one might wish to use. The formulas above merely describe the configuration I have found most satisfying (so far) on my toy corpus. Applying variants of those formulas on new data sets is highly encouraged!

Log-Entropy vs. Relative Frequency

My immediate goal for this post is to ask what patterns get emphasized by log-entropy weighting a DTM. Specifically, I’m asking what gets emphasized when we perform a Principle Component Analysis, however the fact of modeling feature variance suggests that the results may have bearing on further statistical models. We can do this at a first approximation by comparing the results from log-entropy to those of vanilla normalization. To this end, I have pulled together a toy corpus of 400 twentieth-century American novels. (This sample approximates WorldCat holdings, distributed evenly across the century.)

PCA itself is a pretty useful model for asking intuitive questions about a corpus. The basic idea is that it looks for covariance among features: words that appear in roughly the same proportions to one another (or inversely) across texts probably indicate some kind of interesting subject matter. Similarly, texts that contain a substantive share of these terms may signal some kind of discourse. Naturally, there will be many such clusters of covarying words, and PCA will rank each of these by the magnitude of variance they account for in the DTM.

Below are two graphs that represent the corpus of twentieth-century novels. Each novel appears as a circle, while the color indicates its publication date: the lightest blue corresponds to 1901 and the brightest purple to 2000. Note that positive/negative signs are assigned arbitrarily, so that, in Fig. 1, time appears to move from the lower-right hand corner of the graph to the upper-left.

To produce each graph, PCA was performed on the corpus DTM. The difference between graphs consists in the matrices’ weighting and normalization. To produce Fig. 2, the DTM of raw word counts was (l1) normalized by dividing all terms by the sum of the row. On the other hand, Fig. 1 was produced by weighting the DTM according to the log-entropy formula described earlier. Note that stop words were not removed from either DTM.4

PCA: 400 20th Century American Novels
Figure 1. Log-Entropy Weighted PCA
Figure 1. Log-Entropy Weighted DTM.
Explained Variance: PC 1 (X) 2.8%, PC 2 (Y) 1.8%
Figure 2. Normalized PCA
Figure 2. Normalized DTM.
Explained Variance: PC 1 (X) 25%, PC 2 (Y) 10%

The differentiation between earlier and later novels is visibly greater in Fig. 1 than Fig. 2.  We can gauge that intuition formally, as well. The correlation between these two principle components and publication year is substantial, r2 = 0.49. Compare that to Fig. 2, where r2 = 0.07. For the log-entropy matrix, the first two principle components encode a good deal of information about the publication year, while those of the other matrix appear to capture something more synchronic.5

Looking at the most important words for the first principle component of each graph (visualized as the x-axis; accounting for the greatest variance among texts in the corpus) is pretty revealing about what’s happening under the hood. Table 1 and Table 2 show the top positive and negative words for the first principle component from each analysis. Bear in mind that these signs are arbitrary and solely indicate that words are inversely proportional to one another.

Top 10 ranked positive and negative words in first principle component
Table 1: Log-Weighted DTM
Positive Negative
upon guy
morrow guys
presently gonna
colour yeah
thus cop
carriage cops
madame phone
ah uh
exclaimed okay
_i_ jack
Table 2: Normalized DTM
Positive Negative
the she
of you
his her
and it
he said
in me
from to
their don
by what
they know

The co-varying terms in the log-entropy DTM in Table 1 potentially indicate high level differences in social milieu and genre. Among the positive terms, we find British spelling and self-consciously literary diction. (Perhaps it is not surprising that many of the most highly ranked novels on this metric belong to Henry James.) The negative terms, conversely, include informal spelling and word choice. Similarly, the transition from words like carriage and madame to phone and cops gestures toward changes that we might expect in fiction and society during the twentieth century.

Although I won’t describe them at length, I would like to point out that the second principle component (y-axis) is equally interesting. The positive terms refer to planetary conflict, and the cluster of novels at the top of the graph comprise science fiction. The inverse terms include familial relationships, common first names, and phonetic renderings of African American Vernacular English (AAVE). (It is telling that some of the novels located furthest in that direction are those of Paul Laurence Dunbar.) This potentially speaks to ongoing conversations about race and science fiction.

On the other hand, the top ranking terms in Table 2 for the vanilla-normalized DTM are nearly all stop words. This makes sense when we remember that they are the most frequent words by far in a typical text. By their sheer magnitude, any variation in these common words will constitute a strong signal for PCA. Granted, stop words can be easily removed during pre-processing of texts. However, this means a trade-off for the researcher, since they are known to encode valuable textual features like authorship or basic structure.

The most notable feature of the normalized DTM’s highly ranked words is the inverse relationship between gendered pronouns. Novels about she and her tend to speak less about he and his, and vice versa. The other words in the list don’t give us much of an interpretive foothold on the subject matter of such novels, however we can find a pattern in terms of the metadata: the far left side of the graph is populated mostly by women-authored novels and the far right by men. PCA seems to be performing stylometry and finding substantial gendered difference. This potentially gestures toward the role that novels play as social actors, part of a larger cultural system that reproduces gendered categories.

This is a useful finding in and of itself, however, looking at further principle components of the normalized DTM reveals the limitation of simple normalization. The top ranking words for next several PCs are comprised of the same stop words but in different proportions. PCA is simply accounting for stylometric variation. Again, it is easy enough to remove stop words from the corpus before processing. Indeed, removing stop words reveals an inverse relationship between dialogue and moralizing narrative.6 Yet, even with a different headline pattern, one finds oneself in roughly the same position. The first several PCs articulate variations of the most common words. The computer is more sensitive to patterns among words with relatively higher frequency, and Zipf’s Law indicates that some cluster of words will always rise to the top.


From this pair of analyses, we can see that our results were explicitly shaped by the matrix weighting. Log-entropy got us closer to the subject matter of the corpus, while typical normalization captures more about authorship (or basic textual structure, when stop words are removed). Each of these is attuned to different but equally valuable research questions, and the selection of matrix weights will depend on the category of phenomena one hopes to explore.

I would like to emphasize that it is entirely possible that the validity of these findings is limited to applications of PCA. Log-entropy was chosen here because it optimizes semantic representation on a PCA-like model. Minimally, we may find it useful when we are looking specifically at patterns of covariance or even when using the principle components as features in other models.

However, this points us to a larger discussion about the role of feature representation in our modeling. As a conceptual underpinning, diction takes us pretty far toward justifying the raw counts of words that make up a DTM. (How often did the author choose this word?) But it is likely that we wish to find patterns that do not reduce to diction alone. The basis of a weight scheme like log-entropy in information theory moves us to a different set of questions about representating a text to the computer. (How many bits does this term contribute to the novel?)

The major obstacle then is not a technical one but interpretive. Information theory frames the text as a message that has been stored and transmitted. The question this raises, then: what else is a novel, if not a form of communication?


1. In reality, LSA employs a term-context matrix, that turns the typical DTM on its side: rows correspond unique terms in the corpus and columns to documents (or another unit that counts as “context,” such as a window of adjacent words). After weighting and normalizing the matrix, LSA performs a Singular Value Decomposition (SVD). PCA is an application of SVD.

2. Subscripts i and j correspond to the rows and columns of the term-context matrix described in fn. 1, rather than the document-term matrix that I will use later.

3. This formulation appears, for example in Nakov et al (2001) and Pincombe (2004). I have tested variations of these formulas semi-systematically, given the corpus and pre-processing. This pair of local and global weights were chosen because they resulted in principle components that correlated most strongly with publication date, as described in the next section. It was also the case that weight schemes tended to minimize the role of stop words in rough proportion to that correlation with date. The value of this optimization is admittedly more intuitive than formal.

For the results of other local and global weights, see the Jupyter Notebook in the GitHub repo for this post.

4. This is an idiosyncratic choice. It is much more common to remove stop words in both LSA specifically and distant reading generally. At bottom, the motivation to keep them is to identify baseline performance for log-entropy weighting, before varying pre-processing procedures. In any case, I have limited my conclusions to ones that are valid regardless of stop word removal.

Note also that about the most common 12,000 tokens were used in this model, or accounting for 95% of the total words in the corpus. Reducing that number to the most common 3,000 tokens, or about 80% of total words, did not appreciably change the results reported in this post.

5. When stop words were removed, log-entropy performed slightly worse on this metric (r2 = 0.48), while typical normalization performed slightly better (r2 = 0.10). The latter result is consistent with the overall trend that minimizing stop words improves correlation with publication date. However, the negative effect on log-entropy suggests that it is able to leverage valuable information encoded in stop words, even while the model is not dominated by them.

6. See Table 3 below for list of top ranked words in first principle component when stop words are removed from the normalized DTM. Note that, for log-entropy, removing stop words does not substantially change its list of top words.

Top 10 ranked positive and negative words in first principle component
Table 3: Normalized DTM, stopwords removed
Positive Negative
said great
don life
got world
didn men
know shall
ll young
just day
asked new
looked long
right moment

Operationalizing The Urn: Part 3

This post is the third in a series on operationalizing the method of close reading in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading, and the second post had performed that distant reading in order to test Brooks’s literary historical claims. This final post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Irony (& Paradox & Ambiguity)

For Cleanth Brooks, the word is the site where irony is produced. Individual words do not carry meanings with them a priori, but instead their meanings are constructed dynamically and contingently through their use in the text: “[T]he word as the poet uses it, has to be conceived of, not as a discrete particle of meaning, but as a potential of meaning, a nexus or cluster of meanings” (210). This means that words are deployed as flexible semantic objects that are neither predetermined nor circumscribed by a dictionary entry. In fact, he refuses to settle on a particular name for this phenomenon of semantic construction, saying that further work must be done in order to better understand it. Brooks uses the terms paradox and ambiguity at several points; however, as a shorthand, I will simply use the term irony to refer to the co-presence of multiple, unstable, or incommensurate meanings.

This commitment to the word as a discrete-yet-contextualized unit is already encoded into the distant reading of the previous post. We had found provisional evidence for Brooks’s empirical claims about literary history, based on patterns across words that traverse multiple textual scales. The bag-of-words (BOW) model used to represent texts in our model had counted the frequencies of words as individual units, while the logistic regression had looked at trends across these frequencies, including co-occurence. (Indeed, Brooks’s own interpretive commitment had quietly guided the selection of the logistic  regression model.)

Previously, I described the process of learning the difference between initial and final BOWs in terms of geometry, however I now will point us to the only-slightly grittier algebra behind the spatial intuition. When determining where to draw the line between categories of BOW, logistic regression learns how much to weight each word in the BOW while making its consideration. For example, the model may have found that some words appear systematically in a single category of BOW; these have received larger weights. Other words will occur equally in both initial and final BOWs, making them unreliable predictors of the BOW’s category. As a result, these words receive very little weight. Similarly, some words are too infrequent to give evidence one way or the other.

Word Weight
chapter -7.65
oh -5.98
yes -5.40
took -4.67
thank -4.57
tall -4.33
does -3.74
sat -3.51
let -3.12
built -3.10
Word Weight
asked 4.76
away 4.33
happy 3.62
lose 3.51
forever 3.50
rest 3.48
tomorrow 3.21
kill 3.20
cheek 3.16
help 3.12

Table 1. Top 10 weighted initial and final words in the model. Weights reported in standard units (z-score) to facilitate comparison

We can build an intuition for what our model has found by circling back to the human-language text.1 Weights have been assigned to individual words — excerpted in Table 1 — which convey whether and how much their presence indicates the likelihood of a given category. It is a little more complicated than this, since words do no appear in isolation but often in groups, and the weights for the whole grouping get distributed over the individual words. This makes it difficult to separate out the role of any particular word in the assignment of a BOW to a particular category. That said, looking at where these highly-weighted words aggregate and play off one another may gesture toward the textual structure that Brooks had theorized. When looking at the texts themselves, I will highlight any words whose weights lean strongly toward the initial (blue) or the final (red) class.2

Let us turn to a well-structured paragraph in a well-structured novel: Agatha Christie’s The A.B.C. Murders. In this early passage, Hercule Poirot takes a statement from Mrs. Fowler, the neighbor of a murder victim, Mrs. Ascher. Poirot asks first whether the deceased had received any strange letters recently. Fowler guesses such a letter may have come from Ascher’s estranged husband, Franz.

I know the kind of thing you mean—anonymous letters they call them—mostly full of words you’d blush to say out loud. Well, I don’t know, I’m sure, if Franz Ascher ever took to writing those. Mrs. Ascher never let on to me if he did. What’s that? A railway guide, an A B C? No, I never saw such a thing about—and I’m sure if Mrs. Ascher had been sent one I’d have heard about it. I declare you could have knocked me down with a feather when I heard about this whole business. It was my girl Edie what came to me. ‘Mum,’ she says, ‘there’s ever so many policemen next door.’ Gave me quite a turn, it did. ‘Well,’ I said, when I heard about it, ‘it does show that she ought never to have been alone in the house—that niece of hers ought to have been with her. A man in drink can be like a ravening wolf,’ I said, ‘and in my opinion a wild beast is neither more nor less than what that old devil of a husband of hers is. I’ve warned her,’ I said, ‘many times and now my words have come true. He’ll do for you,’ I said. And he has done for her! You can’t rightly estimate what a man will do when he’s in drink and this murder’s a proof of it.

There are several turns in the paragraph, and we find that Mrs. Fowler’s train of thought (quietly guided by Poirot’s questioning) is paralleled by the color of the highlighted words. The largest turn occurs about midway through the paragraph when the topic changes from clues to the murder itself. Where initially Mrs. Fowler had been sure to have no knowledge of the clues, she confidently furnishes the murder’s suspect, opportunity, and motive. Structurally, we find that the balance of initial and final words flips at this point as well. The first several sentences rest on hearsay — what she has heard, what has been let on or uttered out loud — while the latter rests on Fowler’s self-authorization — what she herself has previously said. By moving into the position of the author of her own story, she overwrites her previously admitted absence of knowledge and validates her claims about the murder.

The irony of Fowler’s claiming to know (the circumstances of murder) despite not knowing (the clues), in fact does not invalidate her knowledge. Her very misunderstandings reveal a great deal about the milieu in which the murder took place. For example, it is a world where anonymous letters are steamy romances rather than death threats. (Poirot himself had recently received an anonymous letter regarding the murder.) More importantly, Fowler had earlier revealed that it is a world of door-to-door salesman, when she had mistaken Poirot for one. This becomes an important clue toward solving the case, but only much later once Poirot learns to recognize it.

Zooming our attention to the scale of the sentence, however, leads us to a different kind of tension than the one that animates Poirot. At the scale of the paragraph, the acquisition and transmission of knowledge are the central problems: what is known and what is not yet known. At the scale of the sentence, the question becomes knowability.

In the final sentence of her statement, Mrs. Fowler makes a fine epistemological point. Characterizing the estranged husband, Franz:

You can’t rightly estimate what a man will do when he’s in drink and this murder’s a proof of it.

Not simply a recoiling from gendered violence here, Fowler expresses the unpredictability of violence in a man under the influence. The potential for violence is recognizable, yet its particular, material manifestation is not. Paradoxically, the evidence before Fowler confirms that very unknowability.

Again, we find that the structure of the sentence mirrors this progression of potential-for-action and eventual materialization. A man is a multivalent locus of potential; drink intervenes by narrowing his potential for action, while disrupting its predictability; and the proof adds another layer by making the unknowable known. Among highlighted words earlier in the paragraph, we observe this pattern to an extent as well. Looking to a different character, the moment when Mrs. Fowler’s girl, Edie, came into the house (both initial words) sets up the delivery of unexpected news. And what had Fowler heard (initial)? This whole business (final). The final words seem to  materialize or delimit the possibilities made available by the initial words. Yet the paths not taken constitute the significance of the ones that are finally selected.

To be totally clear, I am in no way arguing that the close readings I have offered are fully articulated by the distant reading. The patterns to which the initial and final words belong were found at the scale of more than one thousand texts. As a result, no individual instance of a word is strictly interpretable by way of the model: we would entirely expect by chance to find, say, a final word in the beginning of a sentence, at the start of a paragraph, in the first half of a book. (This is the case for letters in the paragraph above.) This touches on an outstanding problem in distant reading: how far to interpret our model?

I have attempted to sidestep this question by oscillating between close and distant readings. My method here has been to offer a close reading of the passage, highlighting its constitutive irony, and to show how the model’s features follow the contour of that reading. The paragraph turns from not-knowing to claiming knowledge; several sentences turn from unknowability to the horror of knowing. Although it is true that the sequential distribution of the model’s features parallel these interpretive shifts, they do not tell us what the irony actually consists of in either case. That is, heard-heard-heard-said-said-said or man-drinkproof do not produce semantic meaning, even while they trace its arc. If, as in Brooks’s framework, the structure of a text constitutes the solution to problems raised by its subject matter, then the distant reading seems to offer a solution to a problem it does not know. The model is merely a shadow on the wall cast by the richness of the text.

Let us take one last glance at The ABC Murders‘s full shadow at the scale of the novel. In the paragraph and sentences above, I had emphasized the relative proportions of initial and final words in short segments of text. While this same approach could be taken with the full text, we will add a small twist. Rather than simply scanning for locations where one or the other type of word appears with high density, we will observe the sequential accumulation of such words. That is, we will pass through the text word-by-word, while keeping a running tally of the difference between feature categories: when we see an initial word, we will add one to the tally and when we see a final word, we will subtract one.3

Figure 3. Line graph representing cumulative sum of initial and final words over the book The ABC Murders. Tally rises early in the book and remains high until near the middle. In the last part of the book, the tally moves downward rapidly and ends low.

Figure 3. Cumulative sum of initial (+1) and final (-1) words over The ABC Murders. X-axis indicates position in text by word count.

In Figure 3, we find something tantalizingly close to plot arc. If initial words had opened up spaces of possibility and final words had manifested or intervened on these, then perhaps we can see the rise and fall of tension (even, suspense?) in this mystery novel. To orient ourselves, the paragraph we had attended to closely appears around the 9000th word in the text. This puts the first murder (out of four) and speculations like those of Mrs. Fowler’s at the initial rise in the tally that dominates the first half of the book. At the other end, the book closes per convention with Poirot’s unmasking of the murderer and extended explanation, during which ambiguity is systematically eliminated. This is paralleled by the final precipitous drop in the tally.

I’ll close by asking not what questions our model has answered about the text (these are few), but what the text raises about the model (these are many more). What exactly has our distant reading found? Analysis began with BOWs, yet can we say that the logistic regression has found patterns that exceed mere word frequencies?

Brooks had indicated what he believed we might find through these reading practices: irony, paradox, ambiguity. Since Brooks himself expressed ambivalence as to their distinctions and called for clarification of terms, I have primarily used the word irony as a catch-all. However, irony alone did not fully seem to capture the interpretive phenomenon articulated by the model. While observing the actual distribution of the model features in the text, I found it expedient to use those terms paradox and ambiguity at certain points, as well. Perhaps, this is a sign that our distant reading has picked up precisely the phenomenon Brooks was attending. If that is the case, then distant reading is well positioned to extend Brooks’s own close reading project.


1. This method has been used, for example, by Underwood & Sellers in “The Longue Durée of Literary Prestige.” Goldstone’s insightful response to a preprint of that paper, “Of Literary Standards and Logistic Regression: A Reproduction,” describes some of the interpretive limits to “word lists,” which are used in this study as well. For a broader discussion of the relationship between machine classification and close reading, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.”

2. Not all words from the model have been highlighted. In order to increase our confidence of their validity, only those whose weight is at least one standard deviation from the mean of all weights. As a result, we have a highlighted vocabulary of about 800 words, accounting for about 10% of all words in the corpus. In the passage we examine, just over one-in-ten words is highlighted.

The full 812-word vocabulary is available in plaintext files as initial and final lists.

3. Note that it is a short jump from this accumulative method to identifying the relative densities of these words at any point in the text. We would simply look at the slope of the line, rather than its height.


Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Christie, Agatha. The ABC Murders. New York: Harper Collins. 2011 (1936).

Goldstone, Andrew. “Of Literary Standards and Logistic Regression: A Reproduction.” Personal blog. 2016. Accessed March 2, 2017.

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

Operationalizing The Urn: Part 2

This post is the second in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading. This post will perform that distant reading in order to test Brooks’s literary historical claims. The third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Distant Reading a Century of Structure

By Brooks’s account, close reading enjoys universal application over Anglo-American poems produced “since Shakespeare” because these employ the same broad formal structure. Let us test this hypothesis. In order to do so we need to operationalize his textual model and use this to read a fairly substantial corpus.

Under Brooks’s model, structure is a matter of both sequence and scale. In order to evaluate sequence in its most rudimentary form, we will look at halves: the second half of a text sequentially follows the first. The matter of scale indicates for us what are to be halved: sentences, paragraphs, full texts. As a goal for our model, we will need to identify how second halves of things differentiate themselves from first halves. Moreover, this must be done in such a way that the differentiation of sentences‘ halves occurs in dialogue with the differentiation paragraphs‘ halves and of books‘ halves and vice-versa.

Regarding corpus, we will deviate from Brooks’s own study, by turning from poetry to fiction. This brings our study closer into line with current digital humanities scholarship, speaking to issues that have recently been raised regarding narrative and scale. We also need a very large corpus and access to full texts. To this end, we will draw from the Chicago Text Lab’s collection of twentieth-century novels.1 Because we hope to observe historical trends, we require balance across publication dates: twelve texts were randomly sampled from each year of the twentieth century.

Each text in our corpus is divided into an initial set of words and a final set of words, however what constitutes initial-ness and final-ness for each text will be scale-dependent and randomly assigned. We will break our corpus into three groups of four hundred novels, associated with each level of scale discussed. For example, from the first group of novels, we will collect the words belonging to the first half of each sentence into an initial bag of words (BOW) and those belonging to the second half each sentence into a final BOW. To be sure, a BOW is simply a list of words and the frequencies with which they appear. In essence, we are asking whether certain words (or, especially, groups of words) conventionally indicate different structural positions. What are the semantics of qualification?

For example, Edna Ferber’s Come and Get It (1935) was included among the sentence-level texts. The novel begins:

DOWN the stairway of his house came Barney Glasgow on his way to breakfast. A fine stairway, black walnut and white walnut. A fine house. A fine figure of a man, Barney Glasgow himself, at fifty-three. And so he thought as he descended with his light quick step. He was aware of these things this morning. He savored them as for the first time.

The words highlighted in blue are considered to be the novel’s initial words and those in red are its final words. Although this is only a snippet, we may note Ferber’s repeated use of sentence fragments, where each refers to “[a] fine” aspect of Barney Glasgow’s self-reflected life. That is, fine occurs three times as an initial word here and not at all as a final word. Do we expect certain narrative turns to follow this sentence-level set up? (Perhaps, things will not be so fine after all!)

This process is employed at each of the other scales as well. From the second group of novels, we do the same with paragraphs: words belonging to the first half of each paragraph are collected into the initial BOW and from the second half into the final BOW. For the third group of novels, the same process is performed over the entire body of the text. In sum, each novel is represented by two BOWs, and we will search for patterns that distinguish all initial BOWs from all final BOWs simultaneously. That is, we hope to find patterns that operate across scales.

The comparison of textual objects belonging to binary categories has been performed in ways congenial to humanistic interpretation by using logistic regression.2 One way to think about this method is in terms of geometry. Imagining our texts as points floating in space, classification would consist of drawing a line that separates the categories at hand: initial BOWs, say, above the line and final BOWs below it. Logistic regression is a technique for choosing where to draw that line, based on the frequencies of the words it observes in each BOW.

The patterns that it identifies are not necessarily ones that are expressed in any given text, but that become visible at scale. There are a several statistical virtues to this method that I will elide here, but I will mention that humanists have found it valuable for the fact that it returns a probability of membership in a given class. Its predictions are not hard-and-fast but rather fuzzy; these allow us to approach categorization as a problem of legibility and instability.3

The distant reading will consist of this: we will perform a logistic regression over all of our initial and final BOWs (1200 of each).4 In this way, the computer will “learn” by attempting to draw a line that will put sentence-initial BOWs, paragraph-initial and narrative-initial BOWs on the same side, but none of the final BOWs. The most robust interpretive use of such a model is to make predictions about whether it thinks new, unseen BOWs belong to the initial or final category, and we will make use of this shortly.

Before proceeding, however, we may wish to know how well our statistical model had learned to separate these categories of BOWs in the first place. We can do this using a technique called Leave-One-Out Cross Validation. Basically, we set aside the novels belonging to a given author at training time, when the logistic regression learns where to draw its line. We then make a prediction for the initial and final BOWs (regardless of scale) for that particular author’s texts. By doing this for every author in the corpus, we can get a sense of its performance.5

Such a method results in 87% accuracy.6 This is a good result, since it indicates that substantial and generalizable patterns have been found. Looking under the hood, we can learn a bit more about the patterns the model had identified. Among BOWs constructed from the sentence-scale, the model predicts initial and final classes with 99% accuracy; at the paragraph-scale, accuracy is 95%; and at the full-text-scale, it is 68%.7 The textual structure that we have modeled makes itself most felt in the unfolding of the sentences, followed by that of paragraphs, and only just noticeably in the move across halves of the novel. The grand unity of the text has lower resolution than its fine details.

We have now arrived at the point where we can test our hypothesis. The model had learned how to categorize BOWs as initial and final according to the method described above. We can now ask it to predict the categories of an entirely new set of texts: a control set of 400 previously unseen novels from the Chicago Text Lab corpus. These texts will not have been divided in half according to the protocols described above. Instead, we will select half of their words randomly.

To review, Brooks had claimed that the textual structures he had identified were universally present across modernity. If the model’s predictions for these new texts skew toward either initial or final, and especially if we find movement in one direction or the other over time, then we will have preliminary evidence to the contrary of Brooks’s claim. That is, we will have observed a shift in the degree or mode by which textual structure has been made legible linguistically.  Do we find such evidence that this structure is changing over a century of novels? In fact, we do not.

Fig 1. Scatter plot of BOWs drawn randomly from control texts. X-axis corresponds to novels' publication dates and Y-axis to their predicted probability of being

Figure 1, Distribution of 400 control texts’ probabilities of initial-ness, by publication date; overlaid with best-fit line

The points in Figure 1 represent each control text, while their height indicates the probability that a given text is initial. (Subtract that value from 1 to get the probability it is final.) We find a good deal of variation in the predictions — indeed, we had expected to find such variation, since we had chosen words randomly from the text — however the important finding is that this variation does not reflect a chronological pattern. The best-fit line through the points is flat, and the correlation between predictions and publication date is virtually zero (r2 < 0.001).

This indicates that the kinds of words (and clusters of words) that had indexed initial-ness and final-ness remain in balance with one another over the course of twentieth-century fiction. It is true that further tests would need to be performed in order to increase our confidence in this finding.8 However, on the basis of this first experiment, we have tentative evidence that Brooks is correct. The formalism that underpins close reading has a provisional claim to empirical validity.

I would like to reiterate that these findings are far from the final word on operationalizing close reading. More empirical work must be done to validate these findings, and more theoretical work must be done to create more sophisticated models. Bearing that qualification in mind, I will take the opportunity to explore this particular model a bit further. Brooks had claimed that the textual structure that we have operationalized underpins semantic problems of irony, paradox, and ambiguity. Is it possible that this model can point us toward moments like these in a text?


I heartily thank Hoyt Long and Richard So for permission to use the Chicago Text Lab corpus. I would also like to thank Andrew Piper for generously making available the txtLAB_450 corpus.


1. This corpus spans the period 1880-2000 and is designed to reflect WorldCat library holdings of fiction by American authors across that period. Recent projects using this corpus include Piper, “Fictionality” and Underwood, “The Life Cycles of Genres.” For an exploration of the relationship between the Chicago Text Lab corpus and the HathiTrust Digital Library viz representations of gender, see Underwood & Bamman, “The Gender Balance of Fiction, 1800-2007”

2. See, for example: Jurafsky, Chahuneau, Routledge, & Smith, “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus;” Underwood, “The Life Cycles of Genres;” Underwood & Sellers, “The Longue Durée of Literary Prestige”

3. For an extended theorization of these problems, using computer classification methods, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning ”

4. Specifically, this experiment employs a regularized logistic regression as implemented in the Python package scikit-learn. Regularization is a technique that minimizes the effect that any individual feature is able to have on the model’s predictions. On one hand, this increases our confidence that the model we develop is generalizable beyond its training corpus. On the other hand, this is particularly important for literary text analysis, since each word in the corpus’s vocabulary may constitute a feature, which leads to a risk of overfitting the model. Regularizations reduces this risk.

When performing regularized logistic regression for text analysis, there are two parameters that must be determined: the regularization constant and the feature set. Regarding the feature set, it is typical to use only the most common words in the corpus. The questions of how large/small to make the regularization and how many words to use can be approached empirically.

The specific values were chosen through a grid search over combinations of parameter values, using ten-fold cross validation on the training set (over authors). This grid search was not exhaustive but found a pair of values that lie within the neighborhood of the optimal pair. C = 0.001; 3000 Most Frequent Words.

Note also that prior to logistic regression, word frequencies in BOWs were normalized and transformed to standard units. Stop words were not included in the feature set.

5. This method is described by Underwood and Sellers in “Longue Durée.” The rationale for setting aside all texts by a particular author, rather than single texts at a time, is that what we think of as authorial style may produce consistent word usages across texts. Our goal is to create an optimally generalizable model, which requires we prevent the “leak” of information from the training set to the test set.

6. This and all accuracies reported are F1-Scores. This value is generally considered more robust than a simple count of correct classifications, since it balances true-positives (and false-negatives) against false-positives.

7. An F1-Score of 99% is extraordinary in literary text analysis, and as such it should be met with increased skepticism. I have taken two preliminary steps in order to convince myself of its validity, but beyond these, I invite readers to experiment with their own texts to find whether these results are consistent across novels and literary corpora. The code has also been posted online for examination.

First, I performed an unsupervised analysis over the sentence BOWs. A quick PCA visualization indicates an almost total separation between sentence-initial and sentence-final BOWs.

Figure 2. Scatter plot of sentence-initial and sentence-final BOWs, visualized using PCA. Points representing initial BOWs are colored blue; points representing final BOWs are colored red. The clusters of points are mostly separate, however there is some noticeable overlap.

Figure 2. Distribution of sentence-initial BOWs (blue) and sentence-final BOWs (red) in the third and fourth principle components of PCA. PCA was performed over sentence BOWs alone.

The two PCs that are visualized here account for just 3.5% of the variance in the matrix. As an aside, I would point out that these are not the first two PCs but the third and four (ranked by their explained variance). This suggests that the difference between initial and final BOWs is not even the most substantial pattern across them. Perhaps it makes sense that something like chronology of publication dates or genre would dominate. By his own account, Brooks sought to look past these in order to uncover structure.

Second, I performed the same analysis using a different corpus: the 150 English-language novels in the txtLAB450, a multilingual novel corpus distributed by McGill’s txtLab. Although only 50 novels were used for sentence-level modeling (compared to 400 from the Chicago corpus), sentence-level accuracy under Leave-One-Out Cross Validation was 98%. Paragraph-level accuracy dropped much further, while text-level accuracy remained about the same.

8. First and foremost, if we hope to test shifts over time, we will have to train on subsets of the corpus, corresponding to shorter chronological periods, and make predictions about other periods. This is an essential methodological point made in Underwood and Sellers’s “Longue Durée.” As such, we can only take the evidence here as preliminary.

In making his own historical argument, Brooks indicates that the methods used to read centuries of English literature were honed on poetry from the earliest (Early Modern) and latest (High Modern) periods. A training set drawn from the beginning and end years of the century should be the first such test. Ideally, one might use precisely the time periods he names, over a corpus of poetry.

Other important tests include building separate models for each scale of text individually and comparing these with the larger scale-simultaneous model. Preliminary tests on a smaller corpus had shown differences in predictive accuracy between these types of models, suggesting that they were identifying different patterns and which I took to license using the scale-simultaneous model. This would need to be repeated with the larger corpus.

We may also wish to tweak the model as it stands. For example, we have treated single-sentence paragraphs as full paragraphs. The motivation is to see how their words perform double duty at both scales, yet it is conceivable that we would wish to remove this redundancy.

Or we may wish to build a far more sophisticated model. This one is built on a binary logic of first and second halves, which is self-consciously naive, whereas further articulation of the texts may offer higher resolution. Perhaps an unsupervised learning method would be better since it is not required to find a predetermined set of patterns.

And if one wished to contradict the claims I have made here, one would do well to examine the text-level of the novel. The accuracy of this model is low enough at that scale that we can be certain there are other interesting phenomena at work.

The most important point to be made here is not to claim that we have settled our research question, but to see that our preliminary findings direct us toward an entire program of research.


Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Jurafsky, Dan, et al. “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus.” Journal of Cultural Analytics. 2016.

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Pedregosa, F, et al. “Scikit-learn: Machine Learning in Python.” JMLR 12 (2011). 2825-2830.

Underwood, Ted. “The Life Cycles of Genres.” Journal of Cultural Analytics. 2016.

Underwood, Ted & David Bamman. “The Gender Balance of Fiction, 1800-2007.” The Stone and the Shell. 2016. Accessed March 2, 2017.

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

Operationalizing The Urn: Part 1

This post is the first in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. This post lays out the rationale and stakes for such a method of reading. The second post will perform that distant reading in order to test Brooks’s literary historical claims, and the third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Meaning & Structure

keats_urnCleanth Brooks’s The Well Wrought Urn lays out a program of close reading that continues to enjoy great purchase in literary study more than seventy years on. Each of ten chapters performs a virtuosic reading of an individual, canonic poem, while the eleventh chapter steps back to discuss findings and theorize methods. That theorization is familiar today: to paraphrase is heresy; literary interpretation must be highly sensitive to irony, ambiguity, and paradox; and aesthetic texts are taken as “total patterns” or “unities” that give structure to heterogenous materials. These are, of course, the tenets of close reading.

For Brooks, it is precisely the structuring of the subject matter which produces a text’s meaning. On the relationship between these, he claims, “The nature of the material sets the problem to be solved, and the solution is the ordering of the material” (194). A poem is not simply a meditation on a theme, but unfolds as a sequence of ideas, emotions, and images. This framework for understanding the textual object is partly reflected by his method of reading texts dramatically. The mode of reading Brooks advocates requires attention to the process of qualification and revision that is brought by each new phrase and stanza.

This textual structure takes the form of a hierarchy of “resolved tensions,” which produce irony. It is telling that Brooks initially casts textual tension and resolution as a synchronic, spatial pattern, akin to architecture or painting. We may zoom in to observe the text’s images in detail to observe how one particular phrase qualifies the previous one, or we may zoom out to reveal how so many micro-tensions are arranged in relation to one another. To perform this kind of reading, “the relation of each item to the whole context is crucial” (207). Often, irony is a way of accounting for the double meanings that context produces, as it traverses multiple scales of the text.

More recently, distant readers have taken up different scales of textual structure as sites of interpretation. The early LitLab pamphlet, “Style at the Scale of the Sentence” by Sarah Allison et al, offers a taxonomy of the literary sentence observed in thousands of novels. At its root, the taxonomy is characterized by the ordering of independent and dependent clauses — what comes first and how it is qualified. A later pamphlet, “On Paragraphs. Scale, Themes, and Narrative Form” by Mark Algee-Hewitt et al, takes up the paragraph as the novel’s structural middle-scale, where themes are constructed and interpenetrate. Moving to the highest structural level, Andrew Piper’s article “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel”examines how the second half a of novel constitutes a transformation away from its first half in his article.1

Each of these distant readings offers a theorization of its scale of analysis: sentence, paragraph, novel. In the first example, Allison et al ask how it is possible that the relationship between the first half of a given sentence and the second encodes so much information about the larger narrative in which it appears. Indeed, all of the instances above take narrative — the grand unity of the text — as an open question, which they attempt to answer by way of patterns revealed at a particular scale. Brooks offers us a challenge, then: to read at these multiple scales simultaneously.

This simultaneity directs us to an ontological question behind the distant readings mentioned above. (Here, I use ontology especially in its taxonomic sense from information science.) Despite the different scales examined, each of those studies takes the word as a baseline feature for representing a text to the computer. That is, all studies take the word as the site in which narrative patterns have been encoded. Yet, paradoxically, each study decodes a pattern that only becomes visible at a particular scale. For example, the first study interprets patterns of words’ organization within sentence-level boundaries. This is not to imply that the models developed in each of those studies are somehow incomplete — after all, each deals with research questions whose terms are properly defined by their scale. However, the fact that multiple studies have found the word-feature to do differently-scaled work indicates an understanding of its ontological plurality.

Although ontology will not be taken up as an explicit question in this blog series, it haunts the distant readings performed in Parts 2 & 3. In brief, when texts are represented to the computer, it will be shown all three scales of groupings at once. (Only one scale per text to prevent overfitting.) Reading across ontological difference is partly motivated by Alan Liu’s article, “N+1: A Plea for Cross Domain  Data in the Digital Humanities.” There, he calls for distant readings across types of cultural objects, in order to produce unexpected encounters which render them ontologically alien. Liu’s goal is to unsettle familiar disciplinary divisions, so, by that metric, this blog series is the tamest version of such a project. That said, my contention is that irony partly registers the alienness of the cross-scale encounter at the site of the word. Close reading is a method for navigating this all-too-familiar alienness.

While close reading is characterized by its final treatment of texts as closed, organic unities, it is important to remember that this method rests on a fundamentally intertextual and historical claim. Describing the selection of texts he dramatically reads in The Well Wrought Urn, Brooks claims to have represented all important periods “since Shakespeare” and that poems have been chosen in order to demonstrate what they have in common. Rather than content or subject matter, poems from the Early Modern to High Modernism share the structure of meaning described above. Irony would seem to be an invariant feature of modernity.

We can finally raise this as a set of questions. If we perform a distant reading of textual structure at multiple, simultaneous levels, would we find any changes to that structure at the scale of the century? Could such a structure show some of the multiple meanings that traverse individual words in a text? More suggestively: would that be irony?


1. Related work in Schmidt’s “Plot Arceology: a vector-space model of narrative structure.” Similar to the other studies mentioned, Schmidt takes narrative as a driving concern, following the movement of each text through plot space. Methodologically, that article segments texts (in this case, film and TV scripts) into 2-4 minute chunks. This mode of articulation does not square easily with any particular level of novelistic scale, although it speaks to some of the issues of segmentation raised in “On Paragraphs.”


Algee-Hewitt, Mark, et al. “On Paragraphs. Scale, Themes, and Narrative Form.” Literary Lab Pamphlet 10. 2015.

Allison, Sarah, et al. “Style at the Scale of the Sentence.” Literary Lab Pamphlet 5. 2013.

Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Liu, Alan. “N+1: A Plea for Cross Domain  Data in the Digital Humanities.” Debates in the Digital Humanities 2016. eds, Matthew K. Gold and Lauren F. Klein. 2016. Accessed March 2, 2017.

Piper, Andrew. “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel.” New Literary History, 46:1 (2015). 63–98.

Schmidt, Ben. “Plot arceology: A vector-space model of narrative structure,” 2015 IEEE International Conference on Big Data (Big Data). Santa Clara, CA. 2015. 1667-1672.

What We Talk About When We Talk About Digital Humanities

The first day of Alan Liu’s Introduction to the Digital Humanities seminar opens with a provocation. At one end of the projection screen is the word DIGITAL and at the other HUMAN. Within the space they circumscribe, we organize and re-organize familiar terms from media studies: media, communication, information, and technology. What happens to these terms when they are modified by DIGITAL or HUMAN? What happens when they modify one another in the presence of those master terms? There are endless iterations of these questions but one effect is clear: the spaces of overlap, contradiction, and possibility that are encoded in the term Digital Humanities.

Pushing off from that exercise, this blog post puts Liu’s question to an extant body of DH scholarship: How does the scholarly discourse of DH organize these media theoretic terms? Indeed, answering that question may shed insight on the fraught relationship between these fields. We can also ask a more fundamental question as well. To what extent does DH discourse differentiate between DIGITAL and HUMAN? Are they the primary framing terms?

Provisional answers to these latter questions could be offered through distant reading of scholarship in the digital humanities. This would give us breadth of scope across time, place, and scholarly commitments. Choosing this approach changes the question we need to ask first: What texts and methods could operationalize the very framework we had employed in the classroom?

For a corpus of texts, this blog post turns to to Matthew K. Gold’s Debates in the Digital Humanities (2012). That edited volume has been an important piece of scholarship precisely because it collected essays from a great number of scholars representing just as many perspectives on what DH is, can do, and would become. Its essays (well… articles, keynote speeches, blog posts, and other genres of text) especially deal with problems that were discussed in the period 2008-2011. These include the big tent, tool/archive, and cultural criticism debates, among others, that continue to play out in 2017.

Token Frequency
digital 2729
humanities 2399
work 740
new 691
university 429
research 412
media 373
data 328
social 300
dh 291

Table 1. Top 10 Tokens in Matthew K. Gold’s Debates in the Digital Humanities (2012)

The questions we had been asking in class dealt with the relationships among keywords and especially the ways that they contextualize one another. As an operation, we need some method that will look at each word in a text and collect the words that constitute its context. (q from above: How do these words modify one another?) With these mappings of keyword-to-context in hand, we need another method that will identify which are the most important context words overall and which will separate out their spheres of influence. (q from above: How do DIGITAL and HUMAN overlap and contradict one another? Are DIGITAL and HUMAN the most important context words?)

For this brief study, the set of operations ultimately employed were designed to be the shortest line from A to B. In order to map words to their contexts, the method simply iterated through the text, looking at each word in sequence and cumulatively tallying the three words to the right and left of it.1 This produced a square matrix in which each row was a given unique word in Debates and each column represents the number of times that another word had appeared within the given window.2

DTM that records keywordwords (rows) and their contexts (columns

Table 2. Selection from Document-Term Matrix, demonstrating relationship between rows and columns. For example, the context word “2011” appears within a 3-word window of the keyword “association” on three separate occasions in Debates in the Digital Humanities. This likely refers to the 2011 Modern Language Association conference, an annual conference for literature scholars that is attended by many digital humanists.

Principle Component Analysis was then used to identify patterns in that matrix. Without going deeply into the details of PCA, the method looks for variables that tend to covary with one another (related to correlation). In theory, PCA can go through a matrix and identify every distinct covariance (This cluster of context words tends to appear with one another, and this other cluster appears with one another, etc…). In practice, researchers typically only base their analyses on the principle components (in this case, context-word clusters) that account for the largest amounts of variance, since these are the most prominent and least subject to noise.

Figure 1. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities. (Click for larger image.)

The above visualization was produced using the first two principle components of the context space and projecting the keywords into it. The red arrows represent the loadings, or important context words, and the blue dots represent the keywords which they contextualize. Blue dots that appear near one another can be thought to have relatively similar contexts in Debates.

What we find is that digital and humanities are by far the two most prominent context words. Moreover, they are nearly perpendicular to one another, which means that they constitute very different kinds of contexts. Alan’s provocation turns out to be entirely well-founded in the literature under this set of operations in the sense that digital and humanities are doing categorically different intellectual work. (Are we surprised that a human close reader and scholar in the field should find the very pattern revealed by distant reading?)

Granted, Alan’s further provocation is to conceive of humanities as its root human, which is not the case in the discourse represented here. This lacuna in the 2012 edition of Debates sets the stage for Kim Gallon’s intervention in the 2016 edition, “Making a Case for the Black Digital Humanities.” That article articulates the human as a goal for digital scholarship under social conditions where raced bodies are denied access to “full human status” (Weheliye, qtd in Gallon). In the earlier edition, then, humanities would seem to be doing a different kind of work.

We can begin to outline the intellectual work that is being done by humanities and digital in 2012 by hanging at this bird’s-eye-view for a few moments longer. There are roughly three segments of context-space as separated by the loading arrows: the area to the upper-left of humanities, that below digital, and the space between these.

The words that are contextualized primarily by humanities describe a set of institutional actors: NEH, arts, sciences, colleges, disciplines, scholars, faculty, departments, as well as that previous institutional configuration “humanities computing.” The words contextualized primarily by digital are especially humanities, humanists, and humanist. (This is after all, the name of the field, they are seeking to articulate.) Further down, however, we find fields of research and methods: media, tools, technologies, pedagogy, publishing, learning, archives, resources.

If humanities had described a set of actors, and digital had described a set of research fields, then the space of their overlap is best accounted by one of its prominent keywords, doing. Other words contextualized by both humanities and digital include: centers, research, community, projects, scholarship. These are the things that we, as digital humanists, do.

Returning to our initial research question, it appears that the media theory terms media and technology are prominently conceived as digital in this discourse, whereas information and communication are not pulled strongly toward either context. This leads us to a new set of questions: What does it mean, within this discourse, for the former terms to be conceived as digital? What lacuna exists that neither of the latter terms is conceived digitally nor humanistically?

The answers to these questions call for a turn to close reading.

Postscript: Debates without the Digital Humanities

During the pre-processing, I made a heavy handed and highly constestable decision. When observing context words, I omitted those appearing in the bi-gram “new york.” That is, I have treated that word pair as noise rather than signal, and the strength of its presence to be a distortion of the scholarly discourse.

The reasoning for such a decision is that it may have been an artifact of method. I have taken a unigram approach to the text, such that the new of “New York” is treated the same as in “new media” or “new forms of research.” At the same time, the quick-and-dirty text ingestion had pulled in footnotes and bibliographies along with the bodies of the essays. This also partly explains why the “new” vector acts as context for dots like “university” and “press” as well. (These words continue to cluster near “new” in Figure 1 but much less visibly or strongly.)

Figure 2. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities, where tokens belonging to the bi-gram “new york” have been included during pre-processing. (Click for larger image.)

If we treat “new york” as textual signal, we may be inclined to draw a few further conclusions. First, as the film cliche goes, “The city is almost another character in the movie.” New York is a synecdoche for an academic institutional configuration that is both experimental and public facing, since the city is its geographic location. Second, the bi-grams “humanities computing” and “digital humanities” are as firmly entrenched in this comparatively new discourse as the largest city in the United States (the nationality of many but not all of the scholars in the volume), which offers a metric for the consistency of their usage.

But we can go in the other direction as well.

As Liu has suggested in his scholarly writing, distant readers may find the practice of “glitching” their texts revealing of institutional and social commitments that animate these. I take one important example of this strategy to be the counterfactual, as has been used by literature scholars in social network analysis. In a sense, this post has given primacy to a glitched/counterfactual version of Debates — from which “new york” has been omitted — and we have begun to recover the text’s conditions of production by moving between versions of the text.

I will close, however, with a final question that results from a further counterfactual. Let’s omit a second bi-gram: “digital humanities.” What do we talk about when we don’t talk about digital humanities?

Figure 3. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities, where tokens belonging to the bi-gram “digital humanities” have been excluded during pre-processing. (Click for larger image.)


1. This context-accumulation method is based on one that was developed by Richard So and myself for our forthcoming article “Whiteness: A Computational Literary History.” The interpretive questions in that article primarily deal with semantics and differences in usage, and therefore the keyword-context matrix is passed through a different set of operations than those seen here. However, the basic goal is the same: to observe the relationships between words that are mediated by their actual usage in context.

Note that two parameters must be used in this method: a minimum frequency to include a token as a keyword and a window-size in which context words are observed. In this case, keywords were considered the 300 most common tokens in the text, since our least common keyword of interest “communication” was about the 270th most common token. Similarly, we would hope to observe conjunctions of our media theoretical terms in the presence of either digital or human, so we give these a relatively wide berth with a three-word window on either side.

2. This matrix is then normalized using a Laplace smooth over each row (as opposed to the more typical method of dividing by the row’s sum). In essence, this smoothing asks about the distance of a keyword’s observed context from a distribution where every context word had been equally likely. This minimizes the influence of keywords that appear comparatively few times and increases our confidence that changes to the corpus will not have a great impact on our findings.

This blog post, however, does not transform column values into standard units. Although this is a standard method when passing data into PCA, it would have the effect of rendering each context word equally influential in our model, eliminating information regarding the strength of the contextual relationships we hope to observe. If we were interested in semantics on the other hand, transformation to standard units would work toward that goal.

Update Feb 16, 2017: Code supporting this blog post is available on Github.

Reading Distant Readings

This post offers a brief reflection on the previous three on distant reading, topic modeling, and natural language processing. These were originally posted to the Digital Humanities at Berkeley blog.

When I began writing a short series of blog posts for the Digital Humanities at Berkeley, the task had appeared straightforward: answer a few simple questions for people who were new to DH and curious. Why do distant reading? Why use popular tools like mallet or NLTKIn particular, I would emphasize how these methods had been implemented in existing research because, frankly, it is really hard to imagine what interpretive problems computers can even remotely begin to address. This was the basic format of the posts, but as I finished the last one, it became clear that the posts themselves were a study in contrasts. Teasing out those differences suggests a general model for distant reading.

Whereas the first post was designed as a general introduction to the field, the latter two had been organized around individual tools. Their motivations were something like: “Topic modeling is popular. The NLTK book offers a good introduction to Python.” More pedagogical than theoretical. However, digging into the research for each tool unexpectedly revealed that the problems NLTK and mallet sought to address were nearly orthogonal. It wasn’t simply that they each addressed different problems, but that they addressed different categories of problems.

Perhaps the place where that categorical difference was thrown into starkest relief was Matt Jockers’s note on part-of-speech tags and topic modeling, which was examined in the post on NLTK. The thrust of his chapter’s argument had been that topic modeling is a useful way to get at literary theme. However, in a telling footnote, Jockers makes the observation that the topics produced from his set of novels looked very different when he restricted the texts to their nouns alone versus including all words. As he found, the noun-only topics seemed to get closer to literary theoretical treatments of theme. This enabled him to proceed answering his research questions, but the methodological point itself was profound: modifying the way he processed his texts into the topic model performed interpretively useful work — even while using the same basic statistical model.

The post on topic modeling itself made this kind of argument implicitly, but along even a third axis. Many of the research projects described there use a similar natural language processing workflow (tokenization, stop word removal) and a similar statistical model (the mallet implementation of LDA or a close relative). The primary difference across them is the corpus under observation. A newspaper corpus makes newspaper topics, a novel corpus makes novel topics, etc. Selecting one’s corpus is then a major interpretive move as well, separate from either natural language processing or statistical modeling.

Of course, in any discussion of topic modeling, the question consistently arises of how even to interpret the topics once they had been produced. What actually is the pattern they identify in the texts? Nearly all projects arrived at a slightly different answer.

I’ll move quickly to the punchline. There seem to be four major interpretive moments that can be found across the board in these distant readings: corpus construction, natural language processing, statistical modeling, and linguistic pattern.

The first three are a formalization of one’s research question, in the sense that they capture aspects of an interpretive problem. For example, returning to the introductory post, Ted Underwood and Jordan Sellers ask the question “How quickly do literary standards change?” which we may recast in a naive fashion: “How well can prestigious vs non-prestigious poetry (corpus) be distinguished over time (model) on the basis of diction (natural language features)?” Answering this formal question produces a measurement of a linguistic pattern. In Underwood and Sellers’s case, this is a list of percentage values representing how likely each text is to be prestigious. That output then requires its own interpretation if any substantial claim is to be made.

(I described my rephrasing of their research question as “naive” in the sense that it had divorced the output from what was interpretively at stake. The authors’ discursive account makes this clear.)

Distant Reading Model.png

In terms of workflow, all of these interpretive moments occur sequentially, yet are interrelated. The research question directly informs decisions regarding corpus constructionnatural language processing, and the statistical model, while each of the three passes into the next. All of these serve to identify a linguistic pattern, which — if the middle three have been well chosen — allows one to answer that initial question. To illustrate this, I offer the above visualization from Laura K. Nelson’s and my recent workshop on distant reading (literature)/text analysis (social science) at the Digital Humanities at Berkeley Summer Institute.

Although these interpretive moments are designed to account for the particular distant readings which I have written about, there is perhaps even a more general version of this model as well. Replace natural language processing with feature representation and linguistic pattern with simply pattern. In this way, we may also account for sound or image based distant readings alongside those of text.

My aim here is to articulate the process of distant reading, but the more important point is that this is necessarily an interpretive process at every step. Which texts one selects to observe, how one transforms the text into something machine-interpretable, what model one uses to account for a phenomenon of interest: These decisions encode our beliefs about the texts. Perhaps we believe that literary production is organized around novelistic themes or cultural capital. Perhaps those beliefs bear out as a pattern across texts. Or perhaps not — which is potentially just as interesting.

Distant reading has never meant a cold machine evacuating life from literature. It is neither a Faustian bargain, nor is it hopelessly naive. It is just one segment in a slightly enlarged hermeneutic circle.

I continue to believe, however, that computers are basically magic.

A Humanist Apologetic of Natural Language Processing; or A New Introduction to NLTK

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. Images have been included in the body of this post, which we were unable to originally. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

Computer reading can feel like a Faustian bargain. Sure, we can learn about linguistic patterns in literary texts, but it comes at the expense of their richness. At bottom, the computer simply doesn’t know what or how words mean. Instead, it merely recognizes strings of characters and tallies them. Statistical models then try to identify relationships among the tallies. How could this begin to capture anything like irony or affect or subjectivity that we take as our entry point to interpretive study?

I have framed computer reading in this way before – simple counting and statistics – however I should apologize for misleading anyone, since that account gives the computer far too much credit. It might imply that the computer has an easy way to recognize useful strings of characters. (Or to know which statistical models to use for pattern-finding!) Let me be clear: the computer does not even know what constitutes a word or any linguistically meaningful element without direct instruction from a human programmer.

In a sense, this exacerbates the problem the computer had initially posed. The signifier is not merely divorced from the signified but it is not even understood to signify at all. The presence of an aesthetic, interpretable object is entirely unknown to the computer.

Teasing out the depth of the computer’s naivety to language, however, highlights the opportunity for humanists to use computers in research. Simply put, the computer needs a human to tell it what language consists of – that is, which objects to count. Following the description I’ve given so far, one might be inclined to start by telling the computer how to find the boundaries between words and treat those as individual units. On the other hand, any humanist can tell you that equal attention to each word as a separable unit is not the only way to traverse the language of a text.

Generating instructions for how a computer should read requires us to make many decisions about how language should be handled. Some decisions are intuitive, others arbitrary; some have unexpected consequences. Within the messiness of computer reading, we find ourselves encoding an interpretation. What do we take to be the salient features of language in the text? For that matter, how do we generally guide our attention across language when we perform humanistic research?

The instructions we give the computer are part of a field referred to as natural language processing, or NLP. In the parlance, natural languages are ones spoken by humans, as opposed to the formal languages of computers. Most broadly, NLP might be thought of as the translation from one language type to another. In practice, it consists of a set of techniques and conventions that linguists, computer scientists, and now humanists use in the service of that translation.

For the remainder of this blog post, I will offer an introduction to the Natural Language Toolkit, which is a suite of NLP tools available for the programming language Python. Each section will focus on a particular tool or resource in NLTK and connect it to an interpretive research question. The implicit understanding is that NLP is not a set of tools that exists in isolation but necessarily perform part of the work of textual interpretation.

I am highlighting NLTK for several reasons, not the least of which is the free, online textbook describing their implementation (with exercises for practice!). That textbook doubles as a general introduction to Python and assumes no prior knowledge of programming.[1] Beyond pedagogical motivation, however, NLTK contains both tools that are implemented in a great number of digital humanistic projects and others that have not yet been fully explored for their interpretive power.

from nltk import word_tokenize

As described above, the basic entry point into NLP is simply to take a text and split it into a series of words, or tokens. In fact, this can be a tricky task. Even though most words are divided by spaces or line breaks there are many exceptions, especially involving punctuation. Fortunately, NLTK’s tokenizing function, word_tokenize(), is relatively clever about finding word boundaries. One simply places a text of interest inside the parentheses and the function returns an ordered list of the words it had contained.

As it turns out, simply knowing which words appear in a text encodes a great deal of information about higher-order textual features, such as genre. The technique of dividing a text into tokens is so common it would be difficult to offer a representative example, but one might look at Hoyt Long and Richard So’s study of the haiku in modernist poetry, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” They use computational methods to learn the genre’s distinctive vocabulary and think about its dissemination across the literary field.


“A sample list of probability measures generated from a single classification test. In this instance, the word sky was 5.7 times more likely to be associated with nonhaiku (not-haiku) than with haiku. Conversely, the word snow was 3.7 times more likely to be associated with haiku than with nonhaiku (not-haiku).” Long, So 236; Figure 8

I would point out here that tokenization itself requires interpretive decisions be made on the part of the programmer. For example, by default when word_tokenize() sees the word “wouldn’t” in a text, it will produce two separate tokens “would” and “n’t”. If one’s research question were to examine ideas of negation in a text, it might serve one well to tokenize in this way, since it would handle all negative contractions as instances of the same phenomenon. That is, “n’t” would be drawn from “shouldn’t” and “hadn’t” as well. On the other hand, these default interpretive assumptions might adversely affect your research into a corpus, so NLTK offers the capability to turn that aspect of its tokenizer off.

NLTK similarly offers a sent_tokenize() function, if one wishes to divide the text along sentence boundaries. Segmentation at this level underpins the stylistic study by Sarah Allison et al in their pamphlet, “Style at the Scale of the Sentence.”

from nltk.stem import *

When tokens consist of individual words, they contain a semantic meaning but in most natural languages they carry grammatical inflection as well. For example, loveloveslovable, and lovely all have the same root word while the ending maps it into a grammatical position. If we wish to shed grammar in order to focus on semantics, there are two major strategies.

The simpler and more flexible method is to artificially re-construct a root word – the word’s stem – by removing common endings. A very popular function that gets used for this is the SnowballStemmer(). For example, all of the words listed above are stemmed to lov. The stem itself is not a complete word but captures instances of all forms. Snowball is especially powerful in that it is designed to work for many Western languages.

If we wish to keep our output in the natural language at hand, we may prefer a more sophisticated but less universally applicable technique that identifies a word’s lemma, essentially its dictionary form. For English nouns, that typically means changing plurals to singular; for verbs it means the infinitive. In NLTK, this is done with WordNetLemmatizer(). Unless told otherwise, that function assumes all words are nouns, and as of now, it is limited to English. (This is just one application of WordNet itself, which I will describe in greater detail below.)

As it happens, Long and So performed lemmatization of nouns during the pre-processing in their study above. The research questions they were asking revolved around vocabulary and imagery, so it proved expedient to collapse, for instance, skies and sky into the same form.

from nltk import pos_tag

As trained readers, we know that language partly operates according to (or sometimes against!) abstract, underlying structures. For as many cases where we may wish to remove grammatical information from our text by lemmatizing, we can imagine others for which it is essential. Identifying a word’s part of speech, or tagging it, is an extremely sophisticated task that remains an open problem in the NLP world. At this point, state-of-the-art taggers have somewhere in the neighborhood of 98% accuracy. (Be warned that accuracy is typically gauged on non-literary texts.)

NLTK’s default tagger, pos_tag(), has an accuracy just shy of that with the trade-off that it is comparatively fast. Simply place a list of tokens between its parentheses and it returns a new list where each item is the original word alongside its predicted part of speech.

This kind of tool might be used in conjunction with general tokenization. For example, Matt Jockers’s exploration of theme in Macroanalysis relied on word tokens but specifically those the computer had identified as nouns. Doing so, he is sensitive to the interpretive problems this selection raises. Dropping adjectives from his analysis, he reports, loses information about sentiment. “I must offer the caveat […] that the noun-based approach used here is specific to the type of thematic results I wish to derive; I do not suggest this as a blanket approach” (131-133). Part-of-speech tags are used consciously to direct the computer’s attention toward features of the text that are salient to Jockers’ particular research question.


Thematically related nouns on the subject of “Crime and Justice;”
from Jockers’s blog post on methods

Recently, researchers at Stanford’s Literary Lab have used the part-of-speech tags themselves as objects for measurement, since they offer a strategy to abstract from the particulars of a given text while capturing something about the mode of its writing. In the pamphlet “Canon/Archive: Large-scale Dynamics in the Literary Field,” Mark Algee-Hewitt counts part-of-speech-tag pairs to think about different “categories” of stylistic repetition (7-8). As it happens, canonic literary texts have a preference for repetitions that include function words like conjunctions and prepositions, whereas ones from a broader, non-canonic archive lean heavily on proper nouns.

from nltk import ne_chunk

Among parts of speech, names and proper nouns are of particular significance, since they are the more-or-less unique keywords that identify phenomena of social relevance (including people, places, and institutions). After all, there is just one World War II, and in a novel, a name like Mr. Darcy typically acts as a more-or-less stable referent over the course of the text. (Or perhaps we are interested in thinking about the degree of stability with which it is used!)

The identification of these kinds of names is referred to as Named Entity Recognition, or NER. The challenge is twofold. First, it has to be determined whether a name spans multiple tokens. (These multi-token grammatical units are referred to as chunks; the process, chunking.) Second, we would ideally distinguish among categories of entity. Is Mr. Darcy a geographic location? Just who is this World War II I hear so much about?

To this end, the function ne_chunk() receives a list of tokens including their parts of speech and returns a nested list where named entities’ tokens are chunked together, along with their category as predicted by the computer.


Log-Scaled Counts of named locations by US State, 1851-1875; Wilkens 6, Figure 4

Similar to the way Jockers had used part of speech to instruct the computer which tokens to count, Matt Wilkens uses NER to direct his study of the “Geographic Imagination of Civil War Era American Fiction.” By simply counting the number of times each unique location was mentioned across many text (and alternately the number of novels in which it appeared), Wilkens is able to raise questions about the conventional wisdom around the American Renaissance, post-war regionalism, and just how much of a shift in literary attention the war had actually caused. Only chunks of tokens tagged GPE, or Geo-Political Entity, are needed for such a project.

from nltk.corpus import wordnet

I have spent a good deal of time explaining that the computer definitionally does not know what words mean, however there are strategies by which we can begin to recover semantics. Once we have tokenized a text, for instance, we might look up those tokens in a dictionary or thesaurus. The latter is potentially of great value, since it creates clusters among words on the basis of meaning (i.e. synonyms). What happens when we start to think about semantics as a network?

WordNet is a resource that organizes language in precisely this way. In its nomenclature, clusters of synonyms around particular meanings are referred to as synsets. WordNet’s power comes from the fact that synsets are arranged hierarchically into hypernyms and hyponyms. Essentially, a synset’s hypernym is a category to which it belongs and its hyponyms are specific instances. Hypernyms for “dog” include “canine” and “domestic animal;” the hyponyms include “poodle” and “dalmatian.”

This kind of “is-a” hierarchical relationship goes all the way up and down a tree of relationships. If one goes directly up the tree, the hypernyms become increasingly abstract until one gets to a root hypernym. These are words like “entity” and “place.” Very abstract.

As an interpretive instrument, one can broadly gauge the abstractness – or rather, the specificity – of a given word by counting the number of steps taken to get from the word to its root hypernym, i.e. the length of the hypernym path. The greater the number of steps, the more specific the word is thought to be. In this case, the computer ultimately reads a number (a word’s specificity score) rather than the token itself.

In her study of feminist movements across cities and over time, “Political Logics as Cultural Memory: Local Continuities and Women’s Organizations in Chicago and New York City”, Laura K. Nelson gauges the abstractness of each movement’s essays and manifestos by measuring the average hypernym path length for each word in a given document. In turn, she finds that movements out of Chicago had tended to focus on specific events and political institutions whereas those out of New York situate themselves among broader ideas and concepts.

from nltk.corpus import cmudict

Below semantics, below even the word, is of course phonology. Phonemes lie at a rich intersection of dialect, etymology, and poetics that digital humanists have only just begun to explore. Fortunately, the process of looking up dictionary pronunciations can be automated using a resource like the CMU (Carnegie Mellon University) Pronouncing Dictionary.

In NLTK, this English-language dictionary is distributed as a simple list in which each entry consists of a word and its most common North American pronunciations. The entry includes not only the word’s phonemes but whether syllables are stressed or unstressed. Texts then are no longer processed into semantically identifiable units but into representations of its aurality.

Clement et al.png

Segments of each text colored by their aural affinities to each of the other books under consideration. For example, the window on the left shows the text of Tender Buttons, while the prevalence of fuchsia highlighting indicates its aural similarity to the New England Cook Book; Clement et al, Figure 14

These features, among others, form the basis of a study by Tanya Clement et al on aurality in literature, “Sounding for Meaning: Using Theories of Knowledge Representation to Analyze Aural Patterns in Texts”.[2] In the essay, the authors computationally explore the aural affinity between the New England Cookbook and Stein’s poem “Cooking” in Tender Buttons. Their findings offer a tentative confirmation of Margueritte S. Murphy‘s previous literary-interpretive claims that Stein “exploits the vocabulary, syntax, rhythms, and cadences of conventional women’s prose and talk” to “[explain] her own idiosyncratic domestic arrangement by using and displacing the authoritative discourse of the conventional woman’s world.”

Closing Thought

Looking closely at NLP – the first step in the computer reading process – we find that our own interpretive assumptions are everywhere present. Our definition of literary theme may compel us to perform part-of-speech tagging; our theorization of gender may move us away from semantics entirely. The processing that occurs is not a simple mapping from natural language to formal, but constructs a new representation. We have already begun the work of interpreting a text once we focus attention on its salient aspects and render them as countable units.

Minimally, NLP is an opportunity for humanists to formalize the assumptions we bring to the table about language and culture. In terms of our research, that degree of explicitness means that we lay bare the humanistic foundations of our arguments each time we code our NLP. And therein lie the beginnings of scholarly critique and discourse.



Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. “Canon/Archive. Large-scale Dynamics in the Literary Field.” Literary Lab Pamphlet. 11 (2016).

Allison, Sarah, Marissa Gemma, Ryan Heuser, Franco Moretti, Amir Tevel, and Irena Yamboliev. “Style at the Scale of the Sentence.” Literary Lab Pamphlet. 5 (2013).

Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastapol, CA: O’Reilly Media, Inc. 2009.

Clement, Tanya, David Cheng, Loretta Auvil, Boris Capitanu, and Megan Monroe. “Sounding for Meaning: Using Theories of Knowledge Representation to Analyze Aural Patterns in Texts.” Digital Humanities Quarterly. 7:1 (2013).

Jockers, Matthew. “Theme.” Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press, 2013. 118-153.

Jockers, Matthew. “‘Secret’ Recipe for Topic Modeling Themes.”…(2013).

Long, Hoyt and Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:4 (2016): 235-267.

Nelson, Laura K. “Political Logics as Cultural Memory: Local Continuities and Women’s Organizations in Chicago and New York City.” (under review)

Wilkens, Matthew. “The Geographic Imagination of Civil War-Era American Fiction.” American Literary History. 25:4 (2013): 803-840.

[1] In fact, there is one piece of prior knowledge required: how to open an interface in which to do the programming. This took me an embarrassingly long time to figure out when I first started! I recommend downloading the latest version of Python 3.x through the Anaconda platform and following the instructions to launch the Jupyter Notebook interface.

[2] As the authors note, they experimented with the CMU Pronouncing Dictionary specifically but selected an alternative, OpenMary, for their project. CMU is a simple (albeit very long) list of words whereas OpenMary is a suite of tools that includes the ability to guess pronunciations for words that it does not already know and to identify points of rising and falling intonation over the course of a sentence. Which tool you ultimately use for a research project will depend on the problem you wish to study.