Operationalizing The Urn: Part 3

This post is the third in a series on operationalizing the method of close reading in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading, and the second post had performed that distant reading in order to test Brooks’s literary historical claims. This final post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Irony (& Paradox & Ambiguity)

For Cleanth Brooks, the word is the site where irony is produced. Individual words do not carry meanings with them a priori, but instead their meanings are constructed dynamically and contingently through their use in the text: “[T]he word as the poet uses it, has to be conceived of, not as a discrete particle of meaning, but as a potential of meaning, a nexus or cluster of meanings” (210). This means that words are deployed as flexible semantic objects that are neither predetermined nor circumscribed by a dictionary entry. In fact, he refuses to settle on a particular name for this phenomenon of semantic construction, saying that further work must be done in order to better understand it. Brooks uses the terms paradox and ambiguity at several points; however, as a shorthand, I will simply use the term irony to refer to the co-presence of multiple, unstable, or incommensurate meanings.

This commitment to the word as a discrete-yet-contextualized unit is already encoded into the distant reading of the previous post. We had found provisional evidence for Brooks’s empirical claims about literary history, based on patterns across words that traverse multiple textual scales. The bag-of-words (BOW) model used to represent texts in our model had counted the frequencies of words as individual units, while the logistic regression had looked at trends across these frequencies, including co-occurence. (Indeed, Brooks’s own interpretive commitment had quietly guided the selection of the logistic  regression model.)

Previously, I described the process of learning the difference between initial and final BOWs in terms of geometry, however I now will point us to the only-slightly grittier algebra behind the spatial intuition. When determining where to draw the line between categories of BOW, logistic regression learns how much to weight each word in the BOW while making its consideration. For example, the model may have found that some words appear systematically in a single category of BOW; these have received larger weights. Other words will occur equally in both initial and final BOWs, making them unreliable predictors of the BOW’s category. As a result, these words receive very little weight. Similarly, some words are too infrequent to give evidence one way or the other.

Word Weight
chapter -7.65
oh -5.98
yes -5.40
took -4.67
thank -4.57
tall -4.33
does -3.74
sat -3.51
let -3.12
built -3.10
Word Weight
asked 4.76
away 4.33
happy 3.62
lose 3.51
forever 3.50
rest 3.48
tomorrow 3.21
kill 3.20
cheek 3.16
help 3.12

Table 1. Top 10 weighted initial and final words in the model. Weights reported in standard units (z-score) to facilitate comparison

We can build an intuition for what our model has found by circling back to the human-language text.1 Weights have been assigned to individual words — excerpted in Table 1 — which convey whether and how much their presence indicates the likelihood of a given category. It is a little more complicated than this, since words do no appear in isolation but often in groups, and the weights for the whole grouping get distributed over the individual words. This makes it difficult to separate out the role of any particular word in the assignment of a BOW to a particular category. That said, looking at where these highly-weighted words aggregate and play off one another may gesture toward the textual structure that Brooks had theorized. When looking at the texts themselves, I will highlight any words whose weights lean strongly toward the initial (blue) or the final (red) class.2

Let us turn to a well-structured paragraph in a well-structured novel: Agatha Christie’s The A.B.C. Murders. In this early passage, Hercule Poirot takes a statement from Mrs. Fowler, the neighbor of a murder victim, Mrs. Ascher. Poirot asks first whether the deceased had received any strange letters recently. Fowler guesses such a letter may have come from Ascher’s estranged husband, Franz.

I know the kind of thing you mean—anonymous letters they call them—mostly full of words you’d blush to say out loud. Well, I don’t know, I’m sure, if Franz Ascher ever took to writing those. Mrs. Ascher never let on to me if he did. What’s that? A railway guide, an A B C? No, I never saw such a thing about—and I’m sure if Mrs. Ascher had been sent one I’d have heard about it. I declare you could have knocked me down with a feather when I heard about this whole business. It was my girl Edie what came to me. ‘Mum,’ she says, ‘there’s ever so many policemen next door.’ Gave me quite a turn, it did. ‘Well,’ I said, when I heard about it, ‘it does show that she ought never to have been alone in the house—that niece of hers ought to have been with her. A man in drink can be like a ravening wolf,’ I said, ‘and in my opinion a wild beast is neither more nor less than what that old devil of a husband of hers is. I’ve warned her,’ I said, ‘many times and now my words have come true. He’ll do for you,’ I said. And he has done for her! You can’t rightly estimate what a man will do when he’s in drink and this murder’s a proof of it.

There are several turns in the paragraph, and we find that Mrs. Fowler’s train of thought (quietly guided by Poirot’s questioning) is paralleled by the color of the highlighted words. The largest turn occurs about midway through the paragraph when the topic changes from clues to the murder itself. Where initially Mrs. Fowler had been sure to have no knowledge of the clues, she confidently furnishes the murder’s suspect, opportunity, and motive. Structurally, we find that the balance of initial and final words flips at this point as well. The first several sentences rest on hearsay — what she has heard, what has been let on or uttered out loud — while the latter rests on Fowler’s self-authorization — what she herself has previously said. By moving into the position of the author of her own story, she overwrites her previously admitted absence of knowledge and validates her claims about the murder.

The irony of Fowler’s claiming to know (the circumstances of murder) despite not knowing (the clues), in fact does not invalidate her knowledge. Her very misunderstandings reveal a great deal about the milieu in which the murder took place. For example, it is a world where anonymous letters are steamy romances rather than death threats. (Poirot himself had recently received an anonymous letter regarding the murder.) More importantly, Fowler had earlier revealed that it is a world of door-to-door salesman, when she had mistaken Poirot for one. This becomes an important clue toward solving the case, but only much later once Poirot learns to recognize it.

Zooming our attention to the scale of the sentence, however, leads us to a different kind of tension than the one that animates Poirot. At the scale of the paragraph, the acquisition and transmission of knowledge are the central problems: what is known and what is not yet known. At the scale of the sentence, the question becomes knowability.

In the final sentence of her statement, Mrs. Fowler makes a fine epistemological point. Characterizing the estranged husband, Franz:

You can’t rightly estimate what a man will do when he’s in drink and this murder’s a proof of it.

Not simply a recoiling from gendered violence here, Fowler expresses the unpredictability of violence in a man under the influence. The potential for violence is recognizable, yet its particular, material manifestation is not. Paradoxically, the evidence before Fowler confirms that very unknowability.

Again, we find that the structure of the sentence mirrors this progression of potential-for-action and eventual materialization. A man is a multivalent locus of potential; drink intervenes by narrowing his potential for action, while disrupting its predictability; and the proof adds another layer by making the unknowable known. Among highlighted words earlier in the paragraph, we observe this pattern to an extent as well. Looking to a different character, the moment when Mrs. Fowler’s girl, Edie, came into the house (both initial words) sets up the delivery of unexpected news. And what had Fowler heard (initial)? This whole business (final). The final words seem to  materialize or delimit the possibilities made available by the initial words. Yet the paths not taken constitute the significance of the ones that are finally selected.

To be totally clear, I am in no way arguing that the close readings I have offered are fully articulated by the distant reading. The patterns to which the initial and final words belong were found at the scale of more than one thousand texts. As a result, no individual instance of a word is strictly interpretable by way of the model: we would entirely expect by chance to find, say, a final word in the beginning of a sentence, at the start of a paragraph, in the first half of a book. (This is the case for letters in the paragraph above.) This touches on an outstanding problem in distant reading: how far to interpret our model?

I have attempted to sidestep this question by oscillating between close and distant readings. My method here has been to offer a close reading of the passage, highlighting its constitutive irony, and to show how the model’s features follow the contour of that reading. The paragraph turns from not-knowing to claiming knowledge; several sentences turn from unknowability to the horror of knowing. Although it is true that the sequential distribution of the model’s features parallel these interpretive shifts, they do not tell us what the irony actually consists of in either case. That is, heard-heard-heard-said-said-said or man-drinkproof do not produce semantic meaning, even while they trace its arc. If, as in Brooks’s framework, the structure of a text constitutes the solution to problems raised by its subject matter, then the distant reading seems to offer a solution to a problem it does not know. The model is merely a shadow on the wall cast by the richness of the text.

Let us take one last glance at The ABC Murders‘s full shadow at the scale of the novel. In the paragraph and sentences above, I had emphasized the relative proportions of initial and final words in short segments of text. While this same approach could be taken with the full text, we will add a small twist. Rather than simply scanning for locations where one or the other type of word appears with high density, we will observe the sequential accumulation of such words. That is, we will pass through the text word-by-word, while keeping a running tally of the difference between feature categories: when we see an initial word, we will add one to the tally and when we see a final word, we will subtract one.3

Figure 3. Line graph representing cumulative sum of initial and final words over the book The ABC Murders. Tally rises early in the book and remains high until near the middle. In the last part of the book, the tally moves downward rapidly and ends low.

Figure 3. Cumulative sum of initial (+1) and final (-1) words over The ABC Murders. X-axis indicates position in text by word count.

In Figure 3, we find something tantalizingly close to plot arc. If initial words had opened up spaces of possibility and final words had manifested or intervened on these, then perhaps we can see the rise and fall of tension (even, suspense?) in this mystery novel. To orient ourselves, the paragraph we had attended to closely appears around the 9000th word in the text. This puts the first murder (out of four) and speculations like those of Mrs. Fowler’s at the initial rise in the tally that dominates the first half of the book. At the other end, the book closes per convention with Poirot’s unmasking of the murderer and extended explanation, during which ambiguity is systematically eliminated. This is paralleled by the final precipitous drop in the tally.

I’ll close by asking not what questions our model has answered about the text (these are few), but what the text raises about the model (these are many more). What exactly has our distant reading found? Analysis began with BOWs, yet can we say that the logistic regression has found patterns that exceed mere word frequencies?

Brooks had indicated what he believed we might find through these reading practices: irony, paradox, ambiguity. Since Brooks himself expressed ambivalence as to their distinctions and called for clarification of terms, I have primarily used the word irony as a catch-all. However, irony alone did not fully seem to capture the interpretive phenomenon articulated by the model. While observing the actual distribution of the model features in the text, I found it expedient to use those terms paradox and ambiguity at certain points, as well. Perhaps, this is a sign that our distant reading has picked up precisely the phenomenon Brooks was attending. If that is the case, then distant reading is well positioned to extend Brooks’s own close reading project.


1. This method has been used, for example, by Underwood & Sellers in “The Longue Durée of Literary Prestige.” Goldstone’s insightful response to a preprint of that paper, “Of Literary Standards and Logistic Regression: A Reproduction,” describes some of the interpretive limits to “word lists,” which are used in this study as well. For a broader discussion of the relationship between machine classification and close reading, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.”

2. Not all words from the model have been highlighted. In order to increase our confidence of their validity, only those whose weight is at least one standard deviation from the mean of all weights. As a result, we have a highlighted vocabulary of about 800 words, accounting for about 10% of all words in the corpus. In the passage we examine, just over one-in-ten words is highlighted.

The full 812-word vocabulary is available in plaintext files as initial and final lists.

3. Note that it is a short jump from this accumulative method to identifying the relative densities of these words at any point in the text. We would simply look at the slope of the line, rather than its height.


Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Christie, Agatha. The ABC Murders. New York: Harper Collins. 2011 (1936).

Goldstone, Andrew. “Of Literary Standards and Logistic Regression: A Reproduction.” Personal blog. 2016. Accessed March 2, 2017. https://andrewgoldstone.com/blog/2016/01/04/standards/

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

Operationalizing The Urn: Part 2

This post is the second in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading. This post will perform that distant reading in order to test Brooks’s literary historical claims. The third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Distant Reading a Century of Structure

By Brooks’s account, close reading enjoys universal application over Anglo-American poems produced “since Shakespeare” because these employ the same broad formal structure. Let us test this hypothesis. In order to do so we need to operationalize his textual model and use this to read a fairly substantial corpus.

Under Brooks’s model, structure is a matter of both sequence and scale. In order to evaluate sequence in its most rudimentary form, we will look at halves: the second half of a text sequentially follows the first. The matter of scale indicates for us what are to be halved: sentences, paragraphs, full texts. As a goal for our model, we will need to identify how second halves of things differentiate themselves from first halves. Moreover, this must be done in such a way that the differentiation of sentences‘ halves occurs in dialogue with the differentiation paragraphs‘ halves and of books‘ halves and vice-versa.

Regarding corpus, we will deviate from Brooks’s own study, by turning from poetry to fiction. This brings our study closer into line with current digital humanities scholarship, speaking to issues that have recently been raised regarding narrative and scale. We also need a very large corpus and access to full texts. To this end, we will draw from the Chicago Text Lab’s collection of twentieth-century novels.1 Because we hope to observe historical trends, we require balance across publication dates: twelve texts were randomly sampled from each year of the twentieth century.

Each text in our corpus is divided into an initial set of words and a final set of words, however what constitutes initial-ness and final-ness for each text will be scale-dependent and randomly assigned. We will break our corpus into three groups of four hundred novels, associated with each level of scale discussed. For example, from the first group of novels, we will collect the words belonging to the first half of each sentence into an initial bag of words (BOW) and those belonging to the second half each sentence into a final BOW. To be sure, a BOW is simply a list of words and the frequencies with which they appear. In essence, we are asking whether certain words (or, especially, groups of words) conventionally indicate different structural positions. What are the semantics of qualification?

For example, Edna Ferber’s Come and Get It (1935) was included among the sentence-level texts. The novel begins:

DOWN the stairway of his house came Barney Glasgow on his way to breakfast. A fine stairway, black walnut and white walnut. A fine house. A fine figure of a man, Barney Glasgow himself, at fifty-three. And so he thought as he descended with his light quick step. He was aware of these things this morning. He savored them as for the first time.

The words highlighted in blue are considered to be the novel’s initial words and those in red are its final words. Although this is only a snippet, we may note Ferber’s repeated use of sentence fragments, where each refers to “[a] fine” aspect of Barney Glasgow’s self-reflected life. That is, fine occurs three times as an initial word here and not at all as a final word. Do we expect certain narrative turns to follow this sentence-level set up? (Perhaps, things will not be so fine after all!)

This process is employed at each of the other scales as well. From the second group of novels, we do the same with paragraphs: words belonging to the first half of each paragraph are collected into the initial BOW and from the second half into the final BOW. For the third group of novels, the same process is performed over the entire body of the text. In sum, each novel is represented by two BOWs, and we will search for patterns that distinguish all initial BOWs from all final BOWs simultaneously. That is, we hope to find patterns that operate across scales.

The comparison of textual objects belonging to binary categories has been performed in ways congenial to humanistic interpretation by using logistic regression.2 One way to think about this method is in terms of geometry. Imagining our texts as points floating in space, classification would consist of drawing a line that separates the categories at hand: initial BOWs, say, above the line and final BOWs below it. Logistic regression is a technique for choosing where to draw that line, based on the frequencies of the words it observes in each BOW.

The patterns that it identifies are not necessarily ones that are expressed in any given text, but that become visible at scale. There are a several statistical virtues to this method that I will elide here, but I will mention that humanists have found it valuable for the fact that it returns a probability of membership in a given class. Its predictions are not hard-and-fast but rather fuzzy; these allow us to approach categorization as a problem of legibility and instability.3

The distant reading will consist of this: we will perform a logistic regression over all of our initial and final BOWs (1200 of each).4 In this way, the computer will “learn” by attempting to draw a line that will put sentence-initial BOWs, paragraph-initial and narrative-initial BOWs on the same side, but none of the final BOWs. The most robust interpretive use of such a model is to make predictions about whether it thinks new, unseen BOWs belong to the initial or final category, and we will make use of this shortly.

Before proceeding, however, we may wish to know how well our statistical model had learned to separate these categories of BOWs in the first place. We can do this using a technique called Leave-One-Out Cross Validation. Basically, we set aside the novels belonging to a given author at training time, when the logistic regression learns where to draw its line. We then make a prediction for the initial and final BOWs (regardless of scale) for that particular author’s texts. By doing this for every author in the corpus, we can get a sense of its performance.5

Such a method results in 87% accuracy.6 This is a good result, since it indicates that substantial and generalizable patterns have been found. Looking under the hood, we can learn a bit more about the patterns the model had identified. Among BOWs constructed from the sentence-scale, the model predicts initial and final classes with 99% accuracy; at the paragraph-scale, accuracy is 95%; and at the full-text-scale, it is 68%.7 The textual structure that we have modeled makes itself most felt in the unfolding of the sentences, followed by that of paragraphs, and only just noticeably in the move across halves of the novel. The grand unity of the text has lower resolution than its fine details.

We have now arrived at the point where we can test our hypothesis. The model had learned how to categorize BOWs as initial and final according to the method described above. We can now ask it to predict the categories of an entirely new set of texts: a control set of 400 previously unseen novels from the Chicago Text Lab corpus. These texts will not have been divided in half according to the protocols described above. Instead, we will select half of their words randomly.

To review, Brooks had claimed that the textual structures he had identified were universally present across modernity. If the model’s predictions for these new texts skew toward either initial or final, and especially if we find movement in one direction or the other over time, then we will have preliminary evidence to the contrary of Brooks’s claim. That is, we will have observed a shift in the degree or mode by which textual structure has been made legible linguistically.  Do we find such evidence that this structure is changing over a century of novels? In fact, we do not.

Fig 1. Scatter plot of BOWs drawn randomly from control texts. X-axis corresponds to novels' publication dates and Y-axis to their predicted probability of being

Figure 1, Distribution of 400 control texts’ probabilities of initial-ness, by publication date; overlaid with best-fit line

The points in Figure 1 represent each control text, while their height indicates the probability that a given text is initial. (Subtract that value from 1 to get the probability it is final.) We find a good deal of variation in the predictions — indeed, we had expected to find such variation, since we had chosen words randomly from the text — however the important finding is that this variation does not reflect a chronological pattern. The best-fit line through the points is flat, and the correlation between predictions and publication date is virtually zero (r2 < 0.001).

This indicates that the kinds of words (and clusters of words) that had indexed initial-ness and final-ness remain in balance with one another over the course of twentieth-century fiction. It is true that further tests would need to be performed in order to increase our confidence in this finding.8 However, on the basis of this first experiment, we have tentative evidence that Brooks is correct. The formalism that underpins close reading has a provisional claim to empirical validity.

I would like to reiterate that these findings are far from the final word on operationalizing close reading. More empirical work must be done to validate these findings, and more theoretical work must be done to create more sophisticated models. Bearing that qualification in mind, I will take the opportunity to explore this particular model a bit further. Brooks had claimed that the textual structure that we have operationalized underpins semantic problems of irony, paradox, and ambiguity. Is it possible that this model can point us toward moments like these in a text?


I heartily thank Hoyt Long and Richard So for permission to use the Chicago Text Lab corpus. I would also like to thank Andrew Piper for generously making available the txtLAB_450 corpus.


1. This corpus spans the period 1880-2000 and is designed to reflect WorldCat library holdings of fiction by American authors across that period. Recent projects using this corpus include Piper, “Fictionality” and Underwood, “The Life Cycles of Genres.” For an exploration of the relationship between the Chicago Text Lab corpus and the HathiTrust Digital Library viz representations of gender, see Underwood & Bamman, “The Gender Balance of Fiction, 1800-2007”

2. See, for example: Jurafsky, Chahuneau, Routledge, & Smith, “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus;” Underwood, “The Life Cycles of Genres;” Underwood & Sellers, “The Longue Durée of Literary Prestige”

3. For an extended theorization of these problems, using computer classification methods, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning ”

4. Specifically, this experiment employs a regularized logistic regression as implemented in the Python package scikit-learn. Regularization is a technique that minimizes the effect that any individual feature is able to have on the model’s predictions. On one hand, this increases our confidence that the model we develop is generalizable beyond its training corpus. On the other hand, this is particularly important for literary text analysis, since each word in the corpus’s vocabulary may constitute a feature, which leads to a risk of overfitting the model. Regularizations reduces this risk.

When performing regularized logistic regression for text analysis, there are two parameters that must be determined: the regularization constant and the feature set. Regarding the feature set, it is typical to use only the most common words in the corpus. The questions of how large/small to make the regularization and how many words to use can be approached empirically.

The specific values were chosen through a grid search over combinations of parameter values, using ten-fold cross validation on the training set (over authors). This grid search was not exhaustive but found a pair of values that lie within the neighborhood of the optimal pair. C = 0.001; 3000 Most Frequent Words.

Note also that prior to logistic regression, word frequencies in BOWs were normalized and transformed to standard units. Stop words were not included in the feature set.

5. This method is described by Underwood and Sellers in “Longue Durée.” The rationale for setting aside all texts by a particular author, rather than single texts at a time, is that what we think of as authorial style may produce consistent word usages across texts. Our goal is to create an optimally generalizable model, which requires we prevent the “leak” of information from the training set to the test set.

6. This and all accuracies reported are F1-Scores. This value is generally considered more robust than a simple count of correct classifications, since it balances true-positives (and false-negatives) against false-positives.

7. An F1-Score of 99% is extraordinary in literary text analysis, and as such it should be met with increased skepticism. I have taken two preliminary steps in order to convince myself of its validity, but beyond these, I invite readers to experiment with their own texts to find whether these results are consistent across novels and literary corpora. The code has also been posted online for examination.

First, I performed an unsupervised analysis over the sentence BOWs. A quick PCA visualization indicates an almost total separation between sentence-initial and sentence-final BOWs.

Figure 2. Scatter plot of sentence-initial and sentence-final BOWs, visualized using PCA. Points representing initial BOWs are colored blue; points representing final BOWs are colored red. The clusters of points are mostly separate, however there is some noticeable overlap.

Figure 2. Distribution of sentence-initial BOWs (blue) and sentence-final BOWs (red) in the third and fourth principle components of PCA. PCA was performed over sentence BOWs alone.

The two PCs that are visualized here account for just 3.5% of the variance in the matrix. As an aside, I would point out that these are not the first two PCs but the third and four (ranked by their explained variance). This suggests that the difference between initial and final BOWs is not even the most substantial pattern across them. Perhaps it makes sense that something like chronology of publication dates or genre would dominate. By his own account, Brooks sought to look past these in order to uncover structure.

Second, I performed the same analysis using a different corpus: the 150 English-language novels in the txtLAB450, a multilingual novel corpus distributed by McGill’s txtLab. Although only 50 novels were used for sentence-level modeling (compared to 400 from the Chicago corpus), sentence-level accuracy under Leave-One-Out Cross Validation was 98%. Paragraph-level accuracy dropped much further, while text-level accuracy remained about the same.

8. First and foremost, if we hope to test shifts over time, we will have to train on subsets of the corpus, corresponding to shorter chronological periods, and make predictions about other periods. This is an essential methodological point made in Underwood and Sellers’s “Longue Durée.” As such, we can only take the evidence here as preliminary.

In making his own historical argument, Brooks indicates that the methods used to read centuries of English literature were honed on poetry from the earliest (Early Modern) and latest (High Modern) periods. A training set drawn from the beginning and end years of the century should be the first such test. Ideally, one might use precisely the time periods he names, over a corpus of poetry.

Other important tests include building separate models for each scale of text individually and comparing these with the larger scale-simultaneous model. Preliminary tests on a smaller corpus had shown differences in predictive accuracy between these types of models, suggesting that they were identifying different patterns and which I took to license using the scale-simultaneous model. This would need to be repeated with the larger corpus.

We may also wish to tweak the model as it stands. For example, we have treated single-sentence paragraphs as full paragraphs. The motivation is to see how their words perform double duty at both scales, yet it is conceivable that we would wish to remove this redundancy.

Or we may wish to build a far more sophisticated model. This one is built on a binary logic of first and second halves, which is self-consciously naive, whereas further articulation of the texts may offer higher resolution. Perhaps an unsupervised learning method would be better since it is not required to find a predetermined set of patterns.

And if one wished to contradict the claims I have made here, one would do well to examine the text-level of the novel. The accuracy of this model is low enough at that scale that we can be certain there are other interesting phenomena at work.

The most important point to be made here is not to claim that we have settled our research question, but to see that our preliminary findings direct us toward an entire program of research.


Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Jurafsky, Dan, et al. “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus.” Journal of Cultural Analytics. 2016.

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Pedregosa, F, et al. “Scikit-learn: Machine Learning in Python.” JMLR 12 (2011). 2825-2830.

Underwood, Ted. “The Life Cycles of Genres.” Journal of Cultural Analytics. 2016.

Underwood, Ted & David Bamman. “The Gender Balance of Fiction, 1800-2007.” The Stone and the Shell. 2016. Accessed March 2, 2017. https://tedunderwood.com/2016/12/28/the-gender-balance-of-fiction-1800-2007/

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

Operationalizing The Urn: Part 1

This post is the first in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. This post lays out the rationale and stakes for such a method of reading. The second post will perform that distant reading in order to test Brooks’s literary historical claims, and the third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Meaning & Structure

keats_urnCleanth Brooks’s The Well Wrought Urn lays out a program of close reading that continues to enjoy great purchase in literary study more than seventy years on. Each of ten chapters performs a virtuosic reading of an individual, canonic poem, while the eleventh chapter steps back to discuss findings and theorize methods. That theorization is familiar today: to paraphrase is heresy; literary interpretation must be highly sensitive to irony, ambiguity, and paradox; and aesthetic texts are taken as “total patterns” or “unities” that give structure to heterogenous materials. These are, of course, the tenets of close reading.

For Brooks, it is precisely the structuring of the subject matter which produces a text’s meaning. On the relationship between these, he claims, “The nature of the material sets the problem to be solved, and the solution is the ordering of the material” (194). A poem is not simply a meditation on a theme, but unfolds as a sequence of ideas, emotions, and images. This framework for understanding the textual object is partly reflected by his method of reading texts dramatically. The mode of reading Brooks advocates requires attention to the process of qualification and revision that is brought by each new phrase and stanza.

This textual structure takes the form of a hierarchy of “resolved tensions,” which produce irony. It is telling that Brooks initially casts textual tension and resolution as a synchronic, spatial pattern, akin to architecture or painting. We may zoom in to observe the text’s images in detail to observe how one particular phrase qualifies the previous one, or we may zoom out to reveal how so many micro-tensions are arranged in relation to one another. To perform this kind of reading, “the relation of each item to the whole context is crucial” (207). Often, irony is a way of accounting for the double meanings that context produces, as it traverses multiple scales of the text.

More recently, distant readers have taken up different scales of textual structure as sites of interpretation. The early LitLab pamphlet, “Style at the Scale of the Sentence” by Sarah Allison et al, offers a taxonomy of the literary sentence observed in thousands of novels. At its root, the taxonomy is characterized by the ordering of independent and dependent clauses — what comes first and how it is qualified. A later pamphlet, “On Paragraphs. Scale, Themes, and Narrative Form” by Mark Algee-Hewitt et al, takes up the paragraph as the novel’s structural middle-scale, where themes are constructed and interpenetrate. Moving to the highest structural level, Andrew Piper’s article “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel”examines how the second half a of novel constitutes a transformation away from its first half in his article.1

Each of these distant readings offers a theorization of its scale of analysis: sentence, paragraph, novel. In the first example, Allison et al ask how it is possible that the relationship between the first half of a given sentence and the second encodes so much information about the larger narrative in which it appears. Indeed, all of the instances above take narrative — the grand unity of the text — as an open question, which they attempt to answer by way of patterns revealed at a particular scale. Brooks offers us a challenge, then: to read at these multiple scales simultaneously.

This simultaneity directs us to an ontological question behind the distant readings mentioned above. (Here, I use ontology especially in its taxonomic sense from information science.) Despite the different scales examined, each of those studies takes the word as a baseline feature for representing a text to the computer. That is, all studies take the word as the site in which narrative patterns have been encoded. Yet, paradoxically, each study decodes a pattern that only becomes visible at a particular scale. For example, the first study interprets patterns of words’ organization within sentence-level boundaries. This is not to imply that the models developed in each of those studies are somehow incomplete — after all, each deals with research questions whose terms are properly defined by their scale. However, the fact that multiple studies have found the word-feature to do differently-scaled work indicates an understanding of its ontological plurality.

Although ontology will not be taken up as an explicit question in this blog series, it haunts the distant readings performed in Parts 2 & 3. In brief, when texts are represented to the computer, it will be shown all three scales of groupings at once. (Only one scale per text to prevent overfitting.) Reading across ontological difference is partly motivated by Alan Liu’s article, “N+1: A Plea for Cross Domain  Data in the Digital Humanities.” There, he calls for distant readings across types of cultural objects, in order to produce unexpected encounters which render them ontologically alien. Liu’s goal is to unsettle familiar disciplinary divisions, so, by that metric, this blog series is the tamest version of such a project. That said, my contention is that irony partly registers the alienness of the cross-scale encounter at the site of the word. Close reading is a method for navigating this all-too-familiar alienness.

While close reading is characterized by its final treatment of texts as closed, organic unities, it is important to remember that this method rests on a fundamentally intertextual and historical claim. Describing the selection of texts he dramatically reads in The Well Wrought Urn, Brooks claims to have represented all important periods “since Shakespeare” and that poems have been chosen in order to demonstrate what they have in common. Rather than content or subject matter, poems from the Early Modern to High Modernism share the structure of meaning described above. Irony would seem to be an invariant feature of modernity.

We can finally raise this as a set of questions. If we perform a distant reading of textual structure at multiple, simultaneous levels, would we find any changes to that structure at the scale of the century? Could such a structure show some of the multiple meanings that traverse individual words in a text? More suggestively: would that be irony?


1. Related work in Schmidt’s “Plot Arceology: a vector-space model of narrative structure.” Similar to the other studies mentioned, Schmidt takes narrative as a driving concern, following the movement of each text through plot space. Methodologically, that article segments texts (in this case, film and TV scripts) into 2-4 minute chunks. This mode of articulation does not square easily with any particular level of novelistic scale, although it speaks to some of the issues of segmentation raised in “On Paragraphs.”


Algee-Hewitt, Mark, et al. “On Paragraphs. Scale, Themes, and Narrative Form.” Literary Lab Pamphlet 10. 2015.

Allison, Sarah, et al. “Style at the Scale of the Sentence.” Literary Lab Pamphlet 5. 2013.

Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Liu, Alan. “N+1: A Plea for Cross Domain  Data in the Digital Humanities.” Debates in the Digital Humanities 2016. eds, Matthew K. Gold and Lauren F. Klein. 2016. Accessed March 2, 2017. http://dhdebates.gc.cuny.edu/debates/text/101

Piper, Andrew. “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel.” New Literary History, 46:1 (2015). 63–98.

Schmidt, Ben. “Plot arceology: A vector-space model of narrative structure,” 2015 IEEE International Conference on Big Data (Big Data). Santa Clara, CA. 2015. 1667-1672.

What We Talk About When We Talk About Digital Humanities

The first day of Alan Liu’s Introduction to the Digital Humanities seminar opens with a provocation. At one end of the projection screen is the word DIGITAL and at the other HUMAN. Within the space they circumscribe, we organize and re-organize familiar terms from media studies: media, communication, information, and technology. What happens to these terms when they are modified by DIGITAL or HUMAN? What happens when they modify one another in the presence of those master terms? There are endless iterations of these questions but one effect is clear: the spaces of overlap, contradiction, and possibility that are encoded in the term Digital Humanities.

Pushing off from that exercise, this blog post puts Liu’s question to an extant body of DH scholarship: How does the scholarly discourse of DH organize these media theoretic terms? Indeed, answering that question may shed insight on the fraught relationship between these fields. We can also ask a more fundamental question as well. To what extent does DH discourse differentiate between DIGITAL and HUMAN? Are they the primary framing terms?

Provisional answers to these latter questions could be offered through distant reading of scholarship in the digital humanities. This would give us breadth of scope across time, place, and scholarly commitments. Choosing this approach changes the question we need to ask first: What texts and methods could operationalize the very framework we had employed in the classroom?

For a corpus of texts, this blog post turns to to Matthew K. Gold’s Debates in the Digital Humanities (2012). That edited volume has been an important piece of scholarship precisely because it collected essays from a great number of scholars representing just as many perspectives on what DH is, can do, and would become. Its essays (well… articles, keynote speeches, blog posts, and other genres of text) especially deal with problems that were discussed in the period 2008-2011. These include the big tent, tool/archive, and cultural criticism debates, among others, that continue to play out in 2017.

Token Frequency
digital 2729
humanities 2399
work 740
new 691
university 429
research 412
media 373
data 328
social 300
dh 291

Table 1. Top 10 Tokens in Matthew K. Gold’s Debates in the Digital Humanities (2012)

The questions we had been asking in class dealt with the relationships among keywords and especially the ways that they contextualize one another. As an operation, we need some method that will look at each word in a text and collect the words that constitute its context. (q from above: How do these words modify one another?) With these mappings of keyword-to-context in hand, we need another method that will identify which are the most important context words overall and which will separate out their spheres of influence. (q from above: How do DIGITAL and HUMAN overlap and contradict one another? Are DIGITAL and HUMAN the most important context words?)

For this brief study, the set of operations ultimately employed were designed to be the shortest line from A to B. In order to map words to their contexts, the method simply iterated through the text, looking at each word in sequence and cumulatively tallying the three words to the right and left of it.1 This produced a square matrix in which each row was a given unique word in Debates and each column represents the number of times that another word had appeared within the given window.2

DTM that records keywordwords (rows) and their contexts (columns

Table 2. Selection from Document-Term Matrix, demonstrating relationship between rows and columns. For example, the context word “2011” appears within a 3-word window of the keyword “association” on three separate occasions in Debates in the Digital Humanities. This likely refers to the 2011 Modern Language Association conference, an annual conference for literature scholars that is attended by many digital humanists.

Principle Component Analysis was then used to identify patterns in that matrix. Without going deeply into the details of PCA, the method looks for variables that tend to covary with one another (related to correlation). In theory, PCA can go through a matrix and identify every distinct covariance (This cluster of context words tends to appear with one another, and this other cluster appears with one another, etc…). In practice, researchers typically only base their analyses on the principle components (in this case, context-word clusters) that account for the largest amounts of variance, since these are the most prominent and least subject to noise.

Figure 1. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities. (Click for larger image.)

The above visualization was produced using the first two principle components of the context space and projecting the keywords into it. The red arrows represent the loadings, or important context words, and the blue dots represent the keywords which they contextualize. Blue dots that appear near one another can be thought to have relatively similar contexts in Debates.

What we find is that digital and humanities are by far the two most prominent context words. Moreover, they are nearly perpendicular to one another, which means that they constitute very different kinds of contexts. Alan’s provocation turns out to be entirely well-founded in the literature under this set of operations in the sense that digital and humanities are doing categorically different intellectual work. (Are we surprised that a human close reader and scholar in the field should find the very pattern revealed by distant reading?)

Granted, Alan’s further provocation is to conceive of humanities as its root human, which is not the case in the discourse represented here. This lacuna in the 2012 edition of Debates sets the stage for Kim Gallon’s intervention in the 2016 edition, “Making a Case for the Black Digital Humanities.” That article articulates the human as a goal for digital scholarship under social conditions where raced bodies are denied access to “full human status” (Weheliye, qtd in Gallon). In the earlier edition, then, humanities would seem to be doing a different kind of work.

We can begin to outline the intellectual work that is being done by humanities and digital in 2012 by hanging at this bird’s-eye-view for a few moments longer. There are roughly three segments of context-space as separated by the loading arrows: the area to the upper-left of humanities, that below digital, and the space between these.

The words that are contextualized primarily by humanities describe a set of institutional actors: NEH, arts, sciences, colleges, disciplines, scholars, faculty, departments, as well as that previous institutional configuration “humanities computing.” The words contextualized primarily by digital are especially humanities, humanists, and humanist. (This is after all, the name of the field, they are seeking to articulate.) Further down, however, we find fields of research and methods: media, tools, technologies, pedagogy, publishing, learning, archives, resources.

If humanities had described a set of actors, and digital had described a set of research fields, then the space of their overlap is best accounted by one of its prominent keywords, doing. Other words contextualized by both humanities and digital include: centers, research, community, projects, scholarship. These are the things that we, as digital humanists, do.

Returning to our initial research question, it appears that the media theory terms media and technology are prominently conceived as digital in this discourse, whereas information and communication are not pulled strongly toward either context. This leads us to a new set of questions: What does it mean, within this discourse, for the former terms to be conceived as digital? What lacuna exists that neither of the latter terms is conceived digitally nor humanistically?

The answers to these questions call for a turn to close reading.

Postscript: Debates without the Digital Humanities

During the pre-processing, I made a heavy handed and highly constestable decision. When observing context words, I omitted those appearing in the bi-gram “new york.” That is, I have treated that word pair as noise rather than signal, and the strength of its presence to be a distortion of the scholarly discourse.

The reasoning for such a decision is that it may have been an artifact of method. I have taken a unigram approach to the text, such that the new of “New York” is treated the same as in “new media” or “new forms of research.” At the same time, the quick-and-dirty text ingestion had pulled in footnotes and bibliographies along with the bodies of the essays. This also partly explains why the “new” vector acts as context for dots like “university” and “press” as well. (These words continue to cluster near “new” in Figure 1 but much less visibly or strongly.)

Figure 2. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities, where tokens belonging to the bi-gram “new york” have been included during pre-processing. (Click for larger image.)

If we treat “new york” as textual signal, we may be inclined to draw a few further conclusions. First, as the film cliche goes, “The city is almost another character in the movie.” New York is a synecdoche for an academic institutional configuration that is both experimental and public facing, since the city is its geographic location. Second, the bi-grams “humanities computing” and “digital humanities” are as firmly entrenched in this comparatively new discourse as the largest city in the United States (the nationality of many but not all of the scholars in the volume), which offers a metric for the consistency of their usage.

But we can go in the other direction as well.

As Liu has suggested in his scholarly writing, distant readers may find the practice of “glitching” their texts revealing of institutional and social commitments that animate these. I take one important example of this strategy to be the counterfactual, as has been used by literature scholars in social network analysis. In a sense, this post has given primacy to a glitched/counterfactual version of Debates — from which “new york” has been omitted — and we have begun to recover the text’s conditions of production by moving between versions of the text.

I will close, however, with a final question that results from a further counterfactual. Let’s omit a second bi-gram: “digital humanities.” What do we talk about when we don’t talk about digital humanities?

Figure 3. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities, where tokens belonging to the bi-gram “digital humanities” have been excluded during pre-processing. (Click for larger image.)


1. This context-accumulation method is based on one that was developed by Richard So and myself for our forthcoming article “Whiteness: A Computational Literary History.” The interpretive questions in that article primarily deal with semantics and differences in usage, and therefore the keyword-context matrix is passed through a different set of operations than those seen here. However, the basic goal is the same: to observe the relationships between words that are mediated by their actual usage in context.

Note that two parameters must be used in this method: a minimum frequency to include a token as a keyword and a window-size in which context words are observed. In this case, keywords were considered the 300 most common tokens in the text, since our least common keyword of interest “communication” was about the 270th most common token. Similarly, we would hope to observe conjunctions of our media theoretical terms in the presence of either digital or human, so we give these a relatively wide berth with a three-word window on either side.

2. This matrix is then normalized using a Laplace smooth over each row (as opposed to the more typical method of dividing by the row’s sum). In essence, this smoothing asks about the distance of a keyword’s observed context from a distribution where every context word had been equally likely. This minimizes the influence of keywords that appear comparatively few times and increases our confidence that changes to the corpus will not have a great impact on our findings.

This blog post, however, does not transform column values into standard units. Although this is a standard method when passing data into PCA, it would have the effect of rendering each context word equally influential in our model, eliminating information regarding the strength of the contextual relationships we hope to observe. If we were interested in semantics on the other hand, transformation to standard units would work toward that goal.

Update Feb 16, 2017: Code supporting this blog post is available on Github.

Reading Distant Readings

This post offers a brief reflection on the previous three on distant reading, topic modeling, and natural language processing. These were originally posted to the Digital Humanities at Berkeley blog.

When I began writing a short series of blog posts for the Digital Humanities at Berkeley, the task had appeared straightforward: answer a few simple questions for people who were new to DH and curious. Why do distant reading? Why use popular tools like mallet or NLTKIn particular, I would emphasize how these methods had been implemented in existing research because, frankly, it is really hard to imagine what interpretive problems computers can even remotely begin to address. This was the basic format of the posts, but as I finished the last one, it became clear that the posts themselves were a study in contrasts. Teasing out those differences suggests a general model for distant reading.

Whereas the first post was designed as a general introduction to the field, the latter two had been organized around individual tools. Their motivations were something like: “Topic modeling is popular. The NLTK book offers a good introduction to Python.” More pedagogical than theoretical. However, digging into the research for each tool unexpectedly revealed that the problems NLTK and mallet sought to address were nearly orthogonal. It wasn’t simply that they each addressed different problems, but that they addressed different categories of problems.

Perhaps the place where that categorical difference was thrown into starkest relief was Matt Jockers’s note on part-of-speech tags and topic modeling, which was examined in the post on NLTK. The thrust of his chapter’s argument had been that topic modeling is a useful way to get at literary theme. However, in a telling footnote, Jockers makes the observation that the topics produced from his set of novels looked very different when he restricted the texts to their nouns alone versus including all words. As he found, the noun-only topics seemed to get closer to literary theoretical treatments of theme. This enabled him to proceed answering his research questions, but the methodological point itself was profound: modifying the way he processed his texts into the topic model performed interpretively useful work — even while using the same basic statistical model.

The post on topic modeling itself made this kind of argument implicitly, but along even a third axis. Many of the research projects described there use a similar natural language processing workflow (tokenization, stop word removal) and a similar statistical model (the mallet implementation of LDA or a close relative). The primary difference across them is the corpus under observation. A newspaper corpus makes newspaper topics, a novel corpus makes novel topics, etc. Selecting one’s corpus is then a major interpretive move as well, separate from either natural language processing or statistical modeling.

Of course, in any discussion of topic modeling, the question consistently arises of how even to interpret the topics once they had been produced. What actually is the pattern they identify in the texts? Nearly all projects arrived at a slightly different answer.

I’ll move quickly to the punchline. There seem to be four major interpretive moments that can be found across the board in these distant readings: corpus construction, natural language processing, statistical modeling, and linguistic pattern.

The first three are a formalization of one’s research question, in the sense that they capture aspects of an interpretive problem. For example, returning to the introductory post, Ted Underwood and Jordan Sellers ask the question “How quickly do literary standards change?” which we may recast in a naive fashion: “How well can prestigious vs non-prestigious poetry (corpus) be distinguished over time (model) on the basis of diction (natural language features)?” Answering this formal question produces a measurement of a linguistic pattern. In Underwood and Sellers’s case, this is a list of percentage values representing how likely each text is to be prestigious. That output then requires its own interpretation if any substantial claim is to be made.

(I described my rephrasing of their research question as “naive” in the sense that it had divorced the output from what was interpretively at stake. The authors’ discursive account makes this clear.)

Distant Reading Model.png

In terms of workflow, all of these interpretive moments occur sequentially, yet are interrelated. The research question directly informs decisions regarding corpus constructionnatural language processing, and the statistical model, while each of the three passes into the next. All of these serve to identify a linguistic pattern, which — if the middle three have been well chosen — allows one to answer that initial question. To illustrate this, I offer the above visualization from Laura K. Nelson’s and my recent workshop on distant reading (literature)/text analysis (social science) at the Digital Humanities at Berkeley Summer Institute.

Although these interpretive moments are designed to account for the particular distant readings which I have written about, there is perhaps even a more general version of this model as well. Replace natural language processing with feature representation and linguistic pattern with simply pattern. In this way, we may also account for sound or image based distant readings alongside those of text.

My aim here is to articulate the process of distant reading, but the more important point is that this is necessarily an interpretive process at every step. Which texts one selects to observe, how one transforms the text into something machine-interpretable, what model one uses to account for a phenomenon of interest: These decisions encode our beliefs about the texts. Perhaps we believe that literary production is organized around novelistic themes or cultural capital. Perhaps those beliefs bear out as a pattern across texts. Or perhaps not — which is potentially just as interesting.

Distant reading has never meant a cold machine evacuating life from literature. It is neither a Faustian bargain, nor is it hopelessly naive. It is just one segment in a slightly enlarged hermeneutic circle.

I continue to believe, however, that computers are basically magic.

A Humanist Apologetic of Natural Language Processing; or A New Introduction to NLTK

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. Images have been included in the body of this post, which we were unable to originally. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

Computer reading can feel like a Faustian bargain. Sure, we can learn about linguistic patterns in literary texts, but it comes at the expense of their richness. At bottom, the computer simply doesn’t know what or how words mean. Instead, it merely recognizes strings of characters and tallies them. Statistical models then try to identify relationships among the tallies. How could this begin to capture anything like irony or affect or subjectivity that we take as our entry point to interpretive study?

I have framed computer reading in this way before – simple counting and statistics – however I should apologize for misleading anyone, since that account gives the computer far too much credit. It might imply that the computer has an easy way to recognize useful strings of characters. (Or to know which statistical models to use for pattern-finding!) Let me be clear: the computer does not even know what constitutes a word or any linguistically meaningful element without direct instruction from a human programmer.

In a sense, this exacerbates the problem the computer had initially posed. The signifier is not merely divorced from the signified but it is not even understood to signify at all. The presence of an aesthetic, interpretable object is entirely unknown to the computer.

Teasing out the depth of the computer’s naivety to language, however, highlights the opportunity for humanists to use computers in research. Simply put, the computer needs a human to tell it what language consists of – that is, which objects to count. Following the description I’ve given so far, one might be inclined to start by telling the computer how to find the boundaries between words and treat those as individual units. On the other hand, any humanist can tell you that equal attention to each word as a separable unit is not the only way to traverse the language of a text.

Generating instructions for how a computer should read requires us to make many decisions about how language should be handled. Some decisions are intuitive, others arbitrary; some have unexpected consequences. Within the messiness of computer reading, we find ourselves encoding an interpretation. What do we take to be the salient features of language in the text? For that matter, how do we generally guide our attention across language when we perform humanistic research?

The instructions we give the computer are part of a field referred to as natural language processing, or NLP. In the parlance, natural languages are ones spoken by humans, as opposed to the formal languages of computers. Most broadly, NLP might be thought of as the translation from one language type to another. In practice, it consists of a set of techniques and conventions that linguists, computer scientists, and now humanists use in the service of that translation.

For the remainder of this blog post, I will offer an introduction to the Natural Language Toolkit, which is a suite of NLP tools available for the programming language Python. Each section will focus on a particular tool or resource in NLTK and connect it to an interpretive research question. The implicit understanding is that NLP is not a set of tools that exists in isolation but necessarily perform part of the work of textual interpretation.

I am highlighting NLTK for several reasons, not the least of which is the free, online textbook describing their implementation (with exercises for practice!). That textbook doubles as a general introduction to Python and assumes no prior knowledge of programming.[1] Beyond pedagogical motivation, however, NLTK contains both tools that are implemented in a great number of digital humanistic projects and others that have not yet been fully explored for their interpretive power.

from nltk import word_tokenize

As described above, the basic entry point into NLP is simply to take a text and split it into a series of words, or tokens. In fact, this can be a tricky task. Even though most words are divided by spaces or line breaks there are many exceptions, especially involving punctuation. Fortunately, NLTK’s tokenizing function, word_tokenize(), is relatively clever about finding word boundaries. One simply places a text of interest inside the parentheses and the function returns an ordered list of the words it had contained.

As it turns out, simply knowing which words appear in a text encodes a great deal of information about higher-order textual features, such as genre. The technique of dividing a text into tokens is so common it would be difficult to offer a representative example, but one might look at Hoyt Long and Richard So’s study of the haiku in modernist poetry, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” They use computational methods to learn the genre’s distinctive vocabulary and think about its dissemination across the literary field.


“A sample list of probability measures generated from a single classification test. In this instance, the word sky was 5.7 times more likely to be associated with nonhaiku (not-haiku) than with haiku. Conversely, the word snow was 3.7 times more likely to be associated with haiku than with nonhaiku (not-haiku).” Long, So 236; Figure 8

I would point out here that tokenization itself requires interpretive decisions be made on the part of the programmer. For example, by default when word_tokenize() sees the word “wouldn’t” in a text, it will produce two separate tokens “would” and “n’t”. If one’s research question were to examine ideas of negation in a text, it might serve one well to tokenize in this way, since it would handle all negative contractions as instances of the same phenomenon. That is, “n’t” would be drawn from “shouldn’t” and “hadn’t” as well. On the other hand, these default interpretive assumptions might adversely affect your research into a corpus, so NLTK offers the capability to turn that aspect of its tokenizer off.

NLTK similarly offers a sent_tokenize() function, if one wishes to divide the text along sentence boundaries. Segmentation at this level underpins the stylistic study by Sarah Allison et al in their pamphlet, “Style at the Scale of the Sentence.”

from nltk.stem import *

When tokens consist of individual words, they contain a semantic meaning but in most natural languages they carry grammatical inflection as well. For example, loveloveslovable, and lovely all have the same root word while the ending maps it into a grammatical position. If we wish to shed grammar in order to focus on semantics, there are two major strategies.

The simpler and more flexible method is to artificially re-construct a root word – the word’s stem – by removing common endings. A very popular function that gets used for this is the SnowballStemmer(). For example, all of the words listed above are stemmed to lov. The stem itself is not a complete word but captures instances of all forms. Snowball is especially powerful in that it is designed to work for many Western languages.

If we wish to keep our output in the natural language at hand, we may prefer a more sophisticated but less universally applicable technique that identifies a word’s lemma, essentially its dictionary form. For English nouns, that typically means changing plurals to singular; for verbs it means the infinitive. In NLTK, this is done with WordNetLemmatizer(). Unless told otherwise, that function assumes all words are nouns, and as of now, it is limited to English. (This is just one application of WordNet itself, which I will describe in greater detail below.)

As it happens, Long and So performed lemmatization of nouns during the pre-processing in their study above. The research questions they were asking revolved around vocabulary and imagery, so it proved expedient to collapse, for instance, skies and sky into the same form.

from nltk import pos_tag

As trained readers, we know that language partly operates according to (or sometimes against!) abstract, underlying structures. For as many cases where we may wish to remove grammatical information from our text by lemmatizing, we can imagine others for which it is essential. Identifying a word’s part of speech, or tagging it, is an extremely sophisticated task that remains an open problem in the NLP world. At this point, state-of-the-art taggers have somewhere in the neighborhood of 98% accuracy. (Be warned that accuracy is typically gauged on non-literary texts.)

NLTK’s default tagger, pos_tag(), has an accuracy just shy of that with the trade-off that it is comparatively fast. Simply place a list of tokens between its parentheses and it returns a new list where each item is the original word alongside its predicted part of speech.

This kind of tool might be used in conjunction with general tokenization. For example, Matt Jockers’s exploration of theme in Macroanalysis relied on word tokens but specifically those the computer had identified as nouns. Doing so, he is sensitive to the interpretive problems this selection raises. Dropping adjectives from his analysis, he reports, loses information about sentiment. “I must offer the caveat […] that the noun-based approach used here is specific to the type of thematic results I wish to derive; I do not suggest this as a blanket approach” (131-133). Part-of-speech tags are used consciously to direct the computer’s attention toward features of the text that are salient to Jockers’ particular research question.


Thematically related nouns on the subject of “Crime and Justice;”
from Jockers’s blog post on methods

Recently, researchers at Stanford’s Literary Lab have used the part-of-speech tags themselves as objects for measurement, since they offer a strategy to abstract from the particulars of a given text while capturing something about the mode of its writing. In the pamphlet “Canon/Archive: Large-scale Dynamics in the Literary Field,” Mark Algee-Hewitt counts part-of-speech-tag pairs to think about different “categories” of stylistic repetition (7-8). As it happens, canonic literary texts have a preference for repetitions that include function words like conjunctions and prepositions, whereas ones from a broader, non-canonic archive lean heavily on proper nouns.

from nltk import ne_chunk

Among parts of speech, names and proper nouns are of particular significance, since they are the more-or-less unique keywords that identify phenomena of social relevance (including people, places, and institutions). After all, there is just one World War II, and in a novel, a name like Mr. Darcy typically acts as a more-or-less stable referent over the course of the text. (Or perhaps we are interested in thinking about the degree of stability with which it is used!)

The identification of these kinds of names is referred to as Named Entity Recognition, or NER. The challenge is twofold. First, it has to be determined whether a name spans multiple tokens. (These multi-token grammatical units are referred to as chunks; the process, chunking.) Second, we would ideally distinguish among categories of entity. Is Mr. Darcy a geographic location? Just who is this World War II I hear so much about?

To this end, the function ne_chunk() receives a list of tokens including their parts of speech and returns a nested list where named entities’ tokens are chunked together, along with their category as predicted by the computer.


Log-Scaled Counts of named locations by US State, 1851-1875; Wilkens 6, Figure 4

Similar to the way Jockers had used part of speech to instruct the computer which tokens to count, Matt Wilkens uses NER to direct his study of the “Geographic Imagination of Civil War Era American Fiction.” By simply counting the number of times each unique location was mentioned across many text (and alternately the number of novels in which it appeared), Wilkens is able to raise questions about the conventional wisdom around the American Renaissance, post-war regionalism, and just how much of a shift in literary attention the war had actually caused. Only chunks of tokens tagged GPE, or Geo-Political Entity, are needed for such a project.

from nltk.corpus import wordnet

I have spent a good deal of time explaining that the computer definitionally does not know what words mean, however there are strategies by which we can begin to recover semantics. Once we have tokenized a text, for instance, we might look up those tokens in a dictionary or thesaurus. The latter is potentially of great value, since it creates clusters among words on the basis of meaning (i.e. synonyms). What happens when we start to think about semantics as a network?

WordNet is a resource that organizes language in precisely this way. In its nomenclature, clusters of synonyms around particular meanings are referred to as synsets. WordNet’s power comes from the fact that synsets are arranged hierarchically into hypernyms and hyponyms. Essentially, a synset’s hypernym is a category to which it belongs and its hyponyms are specific instances. Hypernyms for “dog” include “canine” and “domestic animal;” the hyponyms include “poodle” and “dalmatian.”

This kind of “is-a” hierarchical relationship goes all the way up and down a tree of relationships. If one goes directly up the tree, the hypernyms become increasingly abstract until one gets to a root hypernym. These are words like “entity” and “place.” Very abstract.

As an interpretive instrument, one can broadly gauge the abstractness – or rather, the specificity – of a given word by counting the number of steps taken to get from the word to its root hypernym, i.e. the length of the hypernym path. The greater the number of steps, the more specific the word is thought to be. In this case, the computer ultimately reads a number (a word’s specificity score) rather than the token itself.

In her study of feminist movements across cities and over time, “Political Logics as Cultural Memory: Local Continuities and Women’s Organizations in Chicago and New York City”, Laura K. Nelson gauges the abstractness of each movement’s essays and manifestos by measuring the average hypernym path length for each word in a given document. In turn, she finds that movements out of Chicago had tended to focus on specific events and political institutions whereas those out of New York situate themselves among broader ideas and concepts.

from nltk.corpus import cmudict

Below semantics, below even the word, is of course phonology. Phonemes lie at a rich intersection of dialect, etymology, and poetics that digital humanists have only just begun to explore. Fortunately, the process of looking up dictionary pronunciations can be automated using a resource like the CMU (Carnegie Mellon University) Pronouncing Dictionary.

In NLTK, this English-language dictionary is distributed as a simple list in which each entry consists of a word and its most common North American pronunciations. The entry includes not only the word’s phonemes but whether syllables are stressed or unstressed. Texts then are no longer processed into semantically identifiable units but into representations of its aurality.

Clement et al.png

Segments of each text colored by their aural affinities to each of the other books under consideration. For example, the window on the left shows the text of Tender Buttons, while the prevalence of fuchsia highlighting indicates its aural similarity to the New England Cook Book; Clement et al, Figure 14

These features, among others, form the basis of a study by Tanya Clement et al on aurality in literature, “Sounding for Meaning: Using Theories of Knowledge Representation to Analyze Aural Patterns in Texts”.[2] In the essay, the authors computationally explore the aural affinity between the New England Cookbook and Stein’s poem “Cooking” in Tender Buttons. Their findings offer a tentative confirmation of Margueritte S. Murphy‘s previous literary-interpretive claims that Stein “exploits the vocabulary, syntax, rhythms, and cadences of conventional women’s prose and talk” to “[explain] her own idiosyncratic domestic arrangement by using and displacing the authoritative discourse of the conventional woman’s world.”

Closing Thought

Looking closely at NLP – the first step in the computer reading process – we find that our own interpretive assumptions are everywhere present. Our definition of literary theme may compel us to perform part-of-speech tagging; our theorization of gender may move us away from semantics entirely. The processing that occurs is not a simple mapping from natural language to formal, but constructs a new representation. We have already begun the work of interpreting a text once we focus attention on its salient aspects and render them as countable units.

Minimally, NLP is an opportunity for humanists to formalize the assumptions we bring to the table about language and culture. In terms of our research, that degree of explicitness means that we lay bare the humanistic foundations of our arguments each time we code our NLP. And therein lie the beginnings of scholarly critique and discourse.



Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. “Canon/Archive. Large-scale Dynamics in the Literary Field.” Literary Lab Pamphlet. 11 (2016).

Allison, Sarah, Marissa Gemma, Ryan Heuser, Franco Moretti, Amir Tevel, and Irena Yamboliev. “Style at the Scale of the Sentence.” Literary Lab Pamphlet. 5 (2013).

Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastapol, CA: O’Reilly Media, Inc. 2009.

Clement, Tanya, David Cheng, Loretta Auvil, Boris Capitanu, and Megan Monroe. “Sounding for Meaning: Using Theories of Knowledge Representation to Analyze Aural Patterns in Texts.” Digital Humanities Quarterly. 7:1 (2013).

Jockers, Matthew. “Theme.” Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press, 2013. 118-153.

Jockers, Matthew. “‘Secret’ Recipe for Topic Modeling Themes.” http://www.matthewjockers.net/2013/04/12/secret-recipe-for-topic-modelin…(2013).

Long, Hoyt and Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:4 (2016): 235-267.

Nelson, Laura K. “Political Logics as Cultural Memory: Local Continuities and Women’s Organizations in Chicago and New York City.” (under review)

Wilkens, Matthew. “The Geographic Imagination of Civil War-Era American Fiction.” American Literary History. 25:4 (2013): 803-840.

[1] In fact, there is one piece of prior knowledge required: how to open an interface in which to do the programming. This took me an embarrassingly long time to figure out when I first started! I recommend downloading the latest version of Python 3.x through the Anaconda platform and following the instructions to launch the Jupyter Notebook interface.

[2] As the authors note, they experimented with the CMU Pronouncing Dictionary specifically but selected an alternative, OpenMary, for their project. CMU is a simple (albeit very long) list of words whereas OpenMary is a suite of tools that includes the ability to guess pronunciations for words that it does not already know and to identify points of rising and falling intonation over the course of a sentence. Which tool you ultimately use for a research project will depend on the problem you wish to study.