Operationalizing The Urn: Part 2

This post is the second in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading. This post will perform that distant reading in order to test Brooks’s literary historical claims. The third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Distant Reading a Century of Structure

By Brooks’s account, close reading enjoys universal application over Anglo-American poems produced “since Shakespeare” because these employ the same broad formal structure. Let us test this hypothesis. In order to do so we need to operationalize his textual model and use this to read a fairly substantial corpus.

Under Brooks’s model, structure is a matter of both sequence and scale. In order to evaluate sequence in its most rudimentary form, we will look at halves: the second half of a text sequentially follows the first. The matter of scale indicates for us what are to be halved: sentences, paragraphs, full texts. As a goal for our model, we will need to identify how second halves of things differentiate themselves from first halves. Moreover, this must be done in such a way that the differentiation of sentences‘ halves occurs in dialogue with the differentiation paragraphs‘ halves and of books‘ halves and vice-versa.

Regarding corpus, we will deviate from Brooks’s own study, by turning from poetry to fiction. This brings our study closer into line with current digital humanities scholarship, speaking to issues that have recently been raised regarding narrative and scale. We also need a very large corpus and access to full texts. To this end, we will draw from the Chicago Text Lab’s collection of twentieth-century novels.1 Because we hope to observe historical trends, we require balance across publication dates: twelve texts were randomly sampled from each year of the twentieth century.

Each text in our corpus is divided into an initial set of words and a final set of words, however what constitutes initial-ness and final-ness for each text will be scale-dependent and randomly assigned. We will break our corpus into three groups of four hundred novels, associated with each level of scale discussed. For example, from the first group of novels, we will collect the words belonging to the first half of each sentence into an initial bag of words (BOW) and those belonging to the second half each sentence into a final BOW. To be sure, a BOW is simply a list of words and the frequencies with which they appear. In essence, we are asking whether certain words (or, especially, groups of words) conventionally indicate different structural positions. What are the semantics of qualification?

For example, Edna Ferber’s Come and Get It (1935) was included among the sentence-level texts. The novel begins:

DOWN the stairway of his house came Barney Glasgow on his way to breakfast. A fine stairway, black walnut and white walnut. A fine house. A fine figure of a man, Barney Glasgow himself, at fifty-three. And so he thought as he descended with his light quick step. He was aware of these things this morning. He savored them as for the first time.

The words highlighted in blue are considered to be the novel’s initial words and those in red are its final words. Although this is only a snippet, we may note Ferber’s repeated use of sentence fragments, where each refers to “[a] fine” aspect of Barney Glasgow’s self-reflected life. That is, fine occurs three times as an initial word here and not at all as a final word. Do we expect certain narrative turns to follow this sentence-level set up? (Perhaps, things will not be so fine after all!)

This process is employed at each of the other scales as well. From the second group of novels, we do the same with paragraphs: words belonging to the first half of each paragraph are collected into the initial BOW and from the second half into the final BOW. For the third group of novels, the same process is performed over the entire body of the text. In sum, each novel is represented by two BOWs, and we will search for patterns that distinguish all initial BOWs from all final BOWs simultaneously. That is, we hope to find patterns that operate across scales.

The comparison of textual objects belonging to binary categories has been performed in ways congenial to humanistic interpretation by using logistic regression.2 One way to think about this method is in terms of geometry. Imagining our texts as points floating in space, classification would consist of drawing a line that separates the categories at hand: initial BOWs, say, above the line and final BOWs below it. Logistic regression is a technique for choosing where to draw that line, based on the frequencies of the words it observes in each BOW.

The patterns that it identifies are not necessarily ones that are expressed in any given text, but that become visible at scale. There are a several statistical virtues to this method that I will elide here, but I will mention that humanists have found it valuable for the fact that it returns a probability of membership in a given class. Its predictions are not hard-and-fast but rather fuzzy; these allow us to approach categorization as a problem of legibility and instability.3

The distant reading will consist of this: we will perform a logistic regression over all of our initial and final BOWs (1200 of each).4 In this way, the computer will “learn” by attempting to draw a line that will put sentence-initial BOWs, paragraph-initial and narrative-initial BOWs on the same side, but none of the final BOWs. The most robust interpretive use of such a model is to make predictions about whether it thinks new, unseen BOWs belong to the initial or final category, and we will make use of this shortly.

Before proceeding, however, we may wish to know how well our statistical model had learned to separate these categories of BOWs in the first place. We can do this using a technique called Leave-One-Out Cross Validation. Basically, we set aside the novels belonging to a given author at training time, when the logistic regression learns where to draw its line. We then make a prediction for the initial and final BOWs (regardless of scale) for that particular author’s texts. By doing this for every author in the corpus, we can get a sense of its performance.5

Such a method results in 87% accuracy.6 This is a good result, since it indicates that substantial and generalizable patterns have been found. Looking under the hood, we can learn a bit more about the patterns the model had identified. Among BOWs constructed from the sentence-scale, the model predicts initial and final classes with 99% accuracy; at the paragraph-scale, accuracy is 95%; and at the full-text-scale, it is 68%.7 The textual structure that we have modeled makes itself most felt in the unfolding of the sentences, followed by that of paragraphs, and only just noticeably in the move across halves of the novel. The grand unity of the text has lower resolution than its fine details.

We have now arrived at the point where we can test our hypothesis. The model had learned how to categorize BOWs as initial and final according to the method described above. We can now ask it to predict the categories of an entirely new set of texts: a control set of 400 previously unseen novels from the Chicago Text Lab corpus. These texts will not have been divided in half according to the protocols described above. Instead, we will select half of their words randomly.

To review, Brooks had claimed that the textual structures he had identified were universally present across modernity. If the model’s predictions for these new texts skew toward either initial or final, and especially if we find movement in one direction or the other over time, then we will have preliminary evidence to the contrary of Brooks’s claim. That is, we will have observed a shift in the degree or mode by which textual structure has been made legible linguistically.  Do we find such evidence that this structure is changing over a century of novels? In fact, we do not.

Fig 1. Scatter plot of BOWs drawn randomly from control texts. X-axis corresponds to novels' publication dates and Y-axis to their predicted probability of being

Figure 1, Distribution of 400 control texts’ probabilities of initial-ness, by publication date; overlaid with best-fit line

The points in Figure 1 represent each control text, while their height indicates the probability that a given text is initial. (Subtract that value from 1 to get the probability it is final.) We find a good deal of variation in the predictions — indeed, we had expected to find such variation, since we had chosen words randomly from the text — however the important finding is that this variation does not reflect a chronological pattern. The best-fit line through the points is flat, and the correlation between predictions and publication date is virtually zero (r2 < 0.001).

This indicates that the kinds of words (and clusters of words) that had indexed initial-ness and final-ness remain in balance with one another over the course of twentieth-century fiction. It is true that further tests would need to be performed in order to increase our confidence in this finding.8 However, on the basis of this first experiment, we have tentative evidence that Brooks is correct. The formalism that underpins close reading has a provisional claim to empirical validity.

I would like to reiterate that these findings are far from the final word on operationalizing close reading. More empirical work must be done to validate these findings, and more theoretical work must be done to create more sophisticated models. Bearing that qualification in mind, I will take the opportunity to explore this particular model a bit further. Brooks had claimed that the textual structure that we have operationalized underpins semantic problems of irony, paradox, and ambiguity. Is it possible that this model can point us toward moments like these in a text?


I heartily thank Hoyt Long and Richard So for permission to use the Chicago Text Lab corpus. I would also like to thank Andrew Piper for generously making available the txtLAB_450 corpus.


1. This corpus spans the period 1880-2000 and is designed to reflect WorldCat library holdings of fiction by American authors across that period. Recent projects using this corpus include Piper, “Fictionality” and Underwood, “The Life Cycles of Genres.” For an exploration of the relationship between the Chicago Text Lab corpus and the HathiTrust Digital Library viz representations of gender, see Underwood & Bamman, “The Gender Balance of Fiction, 1800-2007”

2. See, for example: Jurafsky, Chahuneau, Routledge, & Smith, “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus;” Underwood, “The Life Cycles of Genres;” Underwood & Sellers, “The Longue Durée of Literary Prestige”

3. For an extended theorization of these problems, using computer classification methods, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning ”

4. Specifically, this experiment employs a regularized logistic regression as implemented in the Python package scikit-learn. Regularization is a technique that minimizes the effect that any individual feature is able to have on the model’s predictions. On one hand, this increases our confidence that the model we develop is generalizable beyond its training corpus. On the other hand, this is particularly important for literary text analysis, since each word in the corpus’s vocabulary may constitute a feature, which leads to a risk of overfitting the model. Regularizations reduces this risk.

When performing regularized logistic regression for text analysis, there are two parameters that must be determined: the regularization constant and the feature set. Regarding the feature set, it is typical to use only the most common words in the corpus. The questions of how large/small to make the regularization and how many words to use can be approached empirically.

The specific values were chosen through a grid search over combinations of parameter values, using ten-fold cross validation on the training set (over authors). This grid search was not exhaustive but found a pair of values that lie within the neighborhood of the optimal pair. C = 0.001; 3000 Most Frequent Words.

Note also that prior to logistic regression, word frequencies in BOWs were normalized and transformed to standard units. Stop words were not included in the feature set.

5. This method is described by Underwood and Sellers in “Longue Durée.” The rationale for setting aside all texts by a particular author, rather than single texts at a time, is that what we think of as authorial style may produce consistent word usages across texts. Our goal is to create an optimally generalizable model, which requires we prevent the “leak” of information from the training set to the test set.

6. This and all accuracies reported are F1-Scores. This value is generally considered more robust than a simple count of correct classifications, since it balances true-positives (and false-negatives) against false-positives.

7. An F1-Score of 99% is extraordinary in literary text analysis, and as such it should be met with increased skepticism. I have taken two preliminary steps in order to convince myself of its validity, but beyond these, I invite readers to experiment with their own texts to find whether these results are consistent across novels and literary corpora. The code has also been posted online for examination.

First, I performed an unsupervised analysis over the sentence BOWs. A quick PCA visualization indicates an almost total separation between sentence-initial and sentence-final BOWs.

Figure 2. Scatter plot of sentence-initial and sentence-final BOWs, visualized using PCA. Points representing initial BOWs are colored blue; points representing final BOWs are colored red. The clusters of points are mostly separate, however there is some noticeable overlap.

Figure 2. Distribution of sentence-initial BOWs (blue) and sentence-final BOWs (red) in the third and fourth principle components of PCA. PCA was performed over sentence BOWs alone.

The two PCs that are visualized here account for just 3.5% of the variance in the matrix. As an aside, I would point out that these are not the first two PCs but the third and four (ranked by their explained variance). This suggests that the difference between initial and final BOWs is not even the most substantial pattern across them. Perhaps it makes sense that something like chronology of publication dates or genre would dominate. By his own account, Brooks sought to look past these in order to uncover structure.

Second, I performed the same analysis using a different corpus: the 150 English-language novels in the txtLAB450, a multilingual novel corpus distributed by McGill’s txtLab. Although only 50 novels were used for sentence-level modeling (compared to 400 from the Chicago corpus), sentence-level accuracy under Leave-One-Out Cross Validation was 98%. Paragraph-level accuracy dropped much further, while text-level accuracy remained about the same.

8. First and foremost, if we hope to test shifts over time, we will have to train on subsets of the corpus, corresponding to shorter chronological periods, and make predictions about other periods. This is an essential methodological point made in Underwood and Sellers’s “Longue Durée.” As such, we can only take the evidence here as preliminary.

In making his own historical argument, Brooks indicates that the methods used to read centuries of English literature were honed on poetry from the earliest (Early Modern) and latest (High Modern) periods. A training set drawn from the beginning and end years of the century should be the first such test. Ideally, one might use precisely the time periods he names, over a corpus of poetry.

Other important tests include building separate models for each scale of text individually and comparing these with the larger scale-simultaneous model. Preliminary tests on a smaller corpus had shown differences in predictive accuracy between these types of models, suggesting that they were identifying different patterns and which I took to license using the scale-simultaneous model. This would need to be repeated with the larger corpus.

We may also wish to tweak the model as it stands. For example, we have treated single-sentence paragraphs as full paragraphs. The motivation is to see how their words perform double duty at both scales, yet it is conceivable that we would wish to remove this redundancy.

Or we may wish to build a far more sophisticated model. This one is built on a binary logic of first and second halves, which is self-consciously naive, whereas further articulation of the texts may offer higher resolution. Perhaps an unsupervised learning method would be better since it is not required to find a predetermined set of patterns.

And if one wished to contradict the claims I have made here, one would do well to examine the text-level of the novel. The accuracy of this model is low enough at that scale that we can be certain there are other interesting phenomena at work.

The most important point to be made here is not to claim that we have settled our research question, but to see that our preliminary findings direct us toward an entire program of research.


Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Jurafsky, Dan, et al. “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus.” Journal of Cultural Analytics. 2016.

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Pedregosa, F, et al. “Scikit-learn: Machine Learning in Python.” JMLR 12 (2011). 2825-2830.

Underwood, Ted. “The Life Cycles of Genres.” Journal of Cultural Analytics. 2016.

Underwood, Ted & David Bamman. “The Gender Balance of Fiction, 1800-2007.” The Stone and the Shell. 2016. Accessed March 2, 2017. https://tedunderwood.com/2016/12/28/the-gender-balance-of-fiction-1800-2007/

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

2 thoughts on “Operationalizing The Urn: Part 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s