A Naive Empirical Post about DTM Weighting

In light of word embeddings’ recent popularity, I’ve been playing around with a version called Latent Semantic Analysis (LSA). Admittedly, LSA has fallen out of favor with the rise of neural embeddings like Word2Vec, but there are several virtues to LSA including decades of study by linguists and computer scientists. (For an introduction to LSA for humanists, I highly recommend Ted Underwood’s post “LSA is a marvelous tool, but…“.) In reality, though, this blog post is less about LSA and more about tinkering with it and using it for parts.

Like other word embeddings, LSA seeks to learn about words’ semantics by way of context. I’ll sidestep discussion of LSA’s specific mechanics by saying that it uses techniques that are closely related to ones commonly used in distant reading. Broadly, LSA constructs something like a document-term matrix (DTM) and then performs something like Principle Component Analysis (PCA) on it.1 (I’ll be using those later in the post.) The art of LSA, however, lies in between these steps.

Typically, after constructing a corpus matrix, LSA involves some kind of weighting of the raw word counts. The most familiar weight scheme is (l1) normalization: sum up the number of words in a document and divide the individual word counts, so that each cell in the matrix represents a word’s relative frequency. This is something distant readers do all the time. However, there is an extensive literature on LSA devoted to alternate weights that improve performance on certain tasks, such as analogies or document retrieval, and on different types of documents.

This is the point that piques my curiosity. Can we use different weightings strategically to capture valuable features of a textual corpus? How might we use a semantic model like LSA in existing distant reading practices? The similarity of LSA to a common technique (i.e. PCA) for pattern finding and featurization in distant reading suggests that we can profitably apply its weight schemes to work that we are already doing.

What follows is naive empirical observation. No hypotheses will be tested! No texts will be interpreted! But I will demonstrate the impact that alternate matrix weightings have on the patterns we identify in a test corpus. Specifically, I find that applying LSA-style weights has the effect of emphasizing textual subject matter which correlates strongly with diachronic trends in the corpus.

I resist making strong claims about the role that matrix weighting plays in the kinds of arguments distant readers have made previously — after all, that would require hypothesis testing — however, I hope to shed some light on this under-explored piece of the interpretive puzzle.

Matrix Weighting

Matrix weighting for LSA gets broken out into two components: local and global weights. The local weight measures the importance of a word within a given text, while the global weight measures the importance of that word across texts. These are treated as independent functions that multiply one another.

W_{i,j} = L_{i,j}\cdot G_{i}

where i refers to the row corresponding to a given word and j refers to the column corresponding to a given text.2 One common example of such a weight scheme is tf-idf, or term frequency-inverse document frequency. (Tf-idf has been discussed for its application to literary study at length by Stephen Ramsey and others.)

Research on LSA weighting generally favors measurements of information and entropy. Typically, term frequencies are locally weighted using a logarithm and globally weighted by measuring the term’s entropy across documents. How much information does a term contribute to a document? How much does the corpus attribute to that term?

While there are many variations of these weights, I have found one of the most common formulations to be the most useful on my toy corpus.3

L_{i,j} = log(tf_{i,j} + 1)
G_{i} = 1 + \frac{\sum_{1}^{J}p_{i,j}\cdot log(p_{i,j})}{log(N)}

where tf is the raw count of words in a document, p is the conditional probability of a document given a word, and N is the total number of documents in the corpus. For the global weights, I used normalized (l1) frequencies for each term when calculating the probabilities (pi,j). This prevents longer texts from having disproportionate influence on the term’s entropy. Vector (l2) normalization is applied over each document, after weights have been applied.

To be sure, this is not the only weight and normalization procedure one might wish to use. The formulas above merely describe the configuration I have found most satisfying (so far) on my toy corpus. Applying variants of those formulas on new data sets is highly encouraged!

Log-Entropy vs. Relative Frequency

My immediate goal for this post is to ask what patterns get emphasized by log-entropy weighting a DTM. Specifically, I’m asking what gets emphasized when we perform a Principle Component Analysis, however the fact of modeling feature variance suggests that the results may have bearing on further statistical models. We can do this at a first approximation by comparing the results from log-entropy to those of vanilla normalization. To this end, I have pulled together a toy corpus of 400 twentieth-century American novels. (This sample approximates WorldCat holdings, distributed evenly across the century.)

PCA itself is a pretty useful model for asking intuitive questions about a corpus. The basic idea is that it looks for covariance among features: words that appear in roughly the same proportions to one another (or inversely) across texts probably indicate some kind of interesting subject matter. Similarly, texts that contain a substantive share of these terms may signal some kind of discourse. Naturally, there will be many such clusters of covarying words, and PCA will rank each of these by the magnitude of variance they account for in the DTM.

Below are two graphs that represent the corpus of twentieth-century novels. Each novel appears as a circle, while the color indicates its publication date: the lightest blue corresponds to 1901 and the brightest purple to 2000. Note that positive/negative signs are assigned arbitrarily, so that, in Fig. 1, time appears to move from the lower-right hand corner of the graph to the upper-left.

To produce each graph, PCA was performed on the corpus DTM. The difference between graphs consists in the matrices’ weighting and normalization. To produce Fig. 2, the DTM of raw word counts was (l1) normalized by dividing all terms by the sum of the row. On the other hand, Fig. 1 was produced by weighting the DTM according to the log-entropy formula described earlier. Note that stop words were not removed from either DTM.4

PCA: 400 20th Century American Novels
Figure 1. Log-Entropy Weighted PCA
Figure 1. Log-Entropy Weighted DTM.
Explained Variance: PC 1 (X) 2.8%, PC 2 (Y) 1.8%
Figure 2. Normalized PCA
Figure 2. Normalized DTM.
Explained Variance: PC 1 (X) 25%, PC 2 (Y) 10%

The differentiation between earlier and later novels is visibly greater in Fig. 1 than Fig. 2.  We can gauge that intuition formally, as well. The correlation between these two principle components and publication year is substantial, r2 = 0.49. Compare that to Fig. 2, where r2 = 0.07. For the log-entropy matrix, the first two principle components encode a good deal of information about the publication year, while those of the other matrix appear to capture something more synchronic.5

Looking at the most important words for the first principle component of each graph (visualized as the x-axis; accounting for the greatest variance among texts in the corpus) is pretty revealing about what’s happening under the hood. Table 1 and Table 2 show the top positive and negative words for the first principle component from each analysis. Bear in mind that these signs are arbitrary and solely indicate that words are inversely proportional to one another.

Top 10 ranked positive and negative words in first principle component
Table 1: Log-Weighted DTM
Positive Negative
upon guy
morrow guys
presently gonna
colour yeah
thus cop
carriage cops
madame phone
ah uh
exclaimed okay
_i_ jack
Table 2: Normalized DTM
Positive Negative
the she
of you
his her
and it
he said
in me
from to
their don
by what
they know

The co-varying terms in the log-entropy DTM in Table 1 potentially indicate high level differences in social milieu and genre. Among the positive terms, we find British spelling and self-consciously literary diction. (Perhaps it is not surprising that many of the most highly ranked novels on this metric belong to Henry James.) The negative terms, conversely, include informal spelling and word choice. Similarly, the transition from words like carriage and madame to phone and cops gestures toward changes that we might expect in fiction and society during the twentieth century.

Although I won’t describe them at length, I would like to point out that the second principle component (y-axis) is equally interesting. The positive terms refer to planetary conflict, and the cluster of novels at the top of the graph comprise science fiction. The inverse terms include familial relationships, common first names, and phonetic renderings of African American Vernacular English (AAVE). (It is telling that some of the novels located furthest in that direction are those of Paul Laurence Dunbar.) This potentially speaks to ongoing conversations about race and science fiction.

On the other hand, the top ranking terms in Table 2 for the vanilla-normalized DTM are nearly all stop words. This makes sense when we remember that they are the most frequent words by far in a typical text. By their sheer magnitude, any variation in these common words will constitute a strong signal for PCA. Granted, stop words can be easily removed during pre-processing of texts. However, this means a trade-off for the researcher, since they are known to encode valuable textual features like authorship or basic structure.

The most notable feature of the normalized DTM’s highly ranked words is the inverse relationship between gendered pronouns. Novels about she and her tend to speak less about he and his, and vice versa. The other words in the list don’t give us much of an interpretive foothold on the subject matter of such novels, however we can find a pattern in terms of the metadata: the far left side of the graph is populated mostly by women-authored novels and the far right by men. PCA seems to be performing stylometry and finding substantial gendered difference. This potentially gestures toward the role that novels play as social actors, part of a larger cultural system that reproduces gendered categories.

This is a useful finding in and of itself, however, looking at further principle components of the normalized DTM reveals the limitation of simple normalization. The top ranking words for next several PCs are comprised of the same stop words but in different proportions. PCA is simply accounting for stylometric variation. Again, it is easy enough to remove stop words from the corpus before processing. Indeed, removing stop words reveals an inverse relationship between dialogue and moralizing narrative.6 Yet, even with a different headline pattern, one finds oneself in roughly the same position. The first several PCs articulate variations of the most common words. The computer is more sensitive to patterns among words with relatively higher frequency, and Zipf’s Law indicates that some cluster of words will always rise to the top.

Conclusion

From this pair of analyses, we can see that our results were explicitly shaped by the matrix weighting. Log-entropy got us closer to the subject matter of the corpus, while typical normalization captures more about authorship (or basic textual structure, when stop words are removed). Each of these is attuned to different but equally valuable research questions, and the selection of matrix weights will depend on the category of phenomena one hopes to explore.

I would like to emphasize that it is entirely possible that the validity of these findings is limited to applications of PCA. Log-entropy was chosen here because it optimizes semantic representation on a PCA-like model. Minimally, we may find it useful when we are looking specifically at patterns of covariance or even when using the principle components as features in other models.

However, this points us to a larger discussion about the role of feature representation in our modeling. As a conceptual underpinning, diction takes us pretty far toward justifying the raw counts of words that make up a DTM. (How often did the author choose this word?) But it is likely that we wish to find patterns that do not reduce to diction alone. The basis of a weight scheme like log-entropy in information theory moves us to a different set of questions about representating a text to the computer. (How many bits does this term contribute to the novel?)

The major obstacle then is not a technical one but interpretive. Information theory frames the text as a message that has been stored and transmitted. The question this raises, then: what else is a novel, if not a form of communication?

Footnotes

1. In reality, LSA employs a term-context matrix, that turns the typical DTM on its side: rows correspond unique terms in the corpus and columns to documents (or another unit that counts as “context,” such as a window of adjacent words). After weighting and normalizing the matrix, LSA performs a Singular Value Decomposition (SVD). PCA is an application of SVD.

2. Subscripts i and j correspond to the rows and columns of the term-context matrix described in fn. 1, rather than the document-term matrix that I will use later.

3. This formulation appears, for example in Nakov et al (2001) and Pincombe (2004). I have tested variations of these formulas semi-systematically, given the corpus and pre-processing. This pair of local and global weights were chosen because they resulted in principle components that correlated most strongly with publication date, as described in the next section. It was also the case that weight schemes tended to minimize the role of stop words in rough proportion to that correlation with date. The value of this optimization is admittedly more intuitive than formal.

For the results of other local and global weights, see the Jupyter Notebook in the GitHub repo for this post.

4. This is an idiosyncratic choice. It is much more common to remove stop words in both LSA specifically and distant reading generally. At bottom, the motivation to keep them is to identify baseline performance for log-entropy weighting, before varying pre-processing procedures. In any case, I have limited my conclusions to ones that are valid regardless of stop word removal.

Note also that about the most common 12,000 tokens were used in this model, or accounting for 95% of the total words in the corpus. Reducing that number to the most common 3,000 tokens, or about 80% of total words, did not appreciably change the results reported in this post.

5. When stop words were removed, log-entropy performed slightly worse on this metric (r2 = 0.48), while typical normalization performed slightly better (r2 = 0.10). The latter result is consistent with the overall trend that minimizing stop words improves correlation with publication date. However, the negative effect on log-entropy suggests that it is able to leverage valuable information encoded in stop words, even while the model is not dominated by them.

6. See Table 3 below for list of top ranked words in first principle component when stop words are removed from the normalized DTM. Note that, for log-entropy, removing stop words does not substantially change its list of top words.

Top 10 ranked positive and negative words in first principle component
Table 3: Normalized DTM, stopwords removed
Positive Negative
said great
don life
got world
didn men
know shall
ll young
just day
asked new
looked long
right moment

Operationalizing The Urn: Part 3

This post is the third in a series on operationalizing the method of close reading in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading, and the second post had performed that distant reading in order to test Brooks’s literary historical claims. This final post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Irony (& Paradox & Ambiguity)

For Cleanth Brooks, the word is the site where irony is produced. Individual words do not carry meanings with them a priori, but instead their meanings are constructed dynamically and contingently through their use in the text: “[T]he word as the poet uses it, has to be conceived of, not as a discrete particle of meaning, but as a potential of meaning, a nexus or cluster of meanings” (210). This means that words are deployed as flexible semantic objects that are neither predetermined nor circumscribed by a dictionary entry. In fact, he refuses to settle on a particular name for this phenomenon of semantic construction, saying that further work must be done in order to better understand it. Brooks uses the terms paradox and ambiguity at several points; however, as a shorthand, I will simply use the term irony to refer to the co-presence of multiple, unstable, or incommensurate meanings.

This commitment to the word as a discrete-yet-contextualized unit is already encoded into the distant reading of the previous post. We had found provisional evidence for Brooks’s empirical claims about literary history, based on patterns across words that traverse multiple textual scales. The bag-of-words (BOW) model used to represent texts in our model had counted the frequencies of words as individual units, while the logistic regression had looked at trends across these frequencies, including co-occurence. (Indeed, Brooks’s own interpretive commitment had quietly guided the selection of the logistic  regression model.)

Previously, I described the process of learning the difference between initial and final BOWs in terms of geometry, however I now will point us to the only-slightly grittier algebra behind the spatial intuition. When determining where to draw the line between categories of BOW, logistic regression learns how much to weight each word in the BOW while making its consideration. For example, the model may have found that some words appear systematically in a single category of BOW; these have received larger weights. Other words will occur equally in both initial and final BOWs, making them unreliable predictors of the BOW’s category. As a result, these words receive very little weight. Similarly, some words are too infrequent to give evidence one way or the other.

INITIAL
Word Weight
chapter -7.65
oh -5.98
yes -5.40
took -4.67
thank -4.57
tall -4.33
does -3.74
sat -3.51
let -3.12
built -3.10
FINAL
Word Weight
asked 4.76
away 4.33
happy 3.62
lose 3.51
forever 3.50
rest 3.48
tomorrow 3.21
kill 3.20
cheek 3.16
help 3.12

Table 1. Top 10 weighted initial and final words in the model. Weights reported in standard units (z-score) to facilitate comparison

We can build an intuition for what our model has found by circling back to the human-language text.1 Weights have been assigned to individual words — excerpted in Table 1 — which convey whether and how much their presence indicates the likelihood of a given category. It is a little more complicated than this, since words do no appear in isolation but often in groups, and the weights for the whole grouping get distributed over the individual words. This makes it difficult to separate out the role of any particular word in the assignment of a BOW to a particular category. That said, looking at where these highly-weighted words aggregate and play off one another may gesture toward the textual structure that Brooks had theorized. When looking at the texts themselves, I will highlight any words whose weights lean strongly toward the initial (blue) or the final (red) class.2

Let us turn to a well-structured paragraph in a well-structured novel: Agatha Christie’s The A.B.C. Murders. In this early passage, Hercule Poirot takes a statement from Mrs. Fowler, the neighbor of a murder victim, Mrs. Ascher. Poirot asks first whether the deceased had received any strange letters recently. Fowler guesses such a letter may have come from Ascher’s estranged husband, Franz.

I know the kind of thing you mean—anonymous letters they call them—mostly full of words you’d blush to say out loud. Well, I don’t know, I’m sure, if Franz Ascher ever took to writing those. Mrs. Ascher never let on to me if he did. What’s that? A railway guide, an A B C? No, I never saw such a thing about—and I’m sure if Mrs. Ascher had been sent one I’d have heard about it. I declare you could have knocked me down with a feather when I heard about this whole business. It was my girl Edie what came to me. ‘Mum,’ she says, ‘there’s ever so many policemen next door.’ Gave me quite a turn, it did. ‘Well,’ I said, when I heard about it, ‘it does show that she ought never to have been alone in the house—that niece of hers ought to have been with her. A man in drink can be like a ravening wolf,’ I said, ‘and in my opinion a wild beast is neither more nor less than what that old devil of a husband of hers is. I’ve warned her,’ I said, ‘many times and now my words have come true. He’ll do for you,’ I said. And he has done for her! You can’t rightly estimate what a man will do when he’s in drink and this murder’s a proof of it.

There are several turns in the paragraph, and we find that Mrs. Fowler’s train of thought (quietly guided by Poirot’s questioning) is paralleled by the color of the highlighted words. The largest turn occurs about midway through the paragraph when the topic changes from clues to the murder itself. Where initially Mrs. Fowler had been sure to have no knowledge of the clues, she confidently furnishes the murder’s suspect, opportunity, and motive. Structurally, we find that the balance of initial and final words flips at this point as well. The first several sentences rest on hearsay — what she has heard, what has been let on or uttered out loud — while the latter rests on Fowler’s self-authorization — what she herself has previously said. By moving into the position of the author of her own story, she overwrites her previously admitted absence of knowledge and validates her claims about the murder.

The irony of Fowler’s claiming to know (the circumstances of murder) despite not knowing (the clues), in fact does not invalidate her knowledge. Her very misunderstandings reveal a great deal about the milieu in which the murder took place. For example, it is a world where anonymous letters are steamy romances rather than death threats. (Poirot himself had recently received an anonymous letter regarding the murder.) More importantly, Fowler had earlier revealed that it is a world of door-to-door salesman, when she had mistaken Poirot for one. This becomes an important clue toward solving the case, but only much later once Poirot learns to recognize it.

Zooming our attention to the scale of the sentence, however, leads us to a different kind of tension than the one that animates Poirot. At the scale of the paragraph, the acquisition and transmission of knowledge are the central problems: what is known and what is not yet known. At the scale of the sentence, the question becomes knowability.

In the final sentence of her statement, Mrs. Fowler makes a fine epistemological point. Characterizing the estranged husband, Franz:

You can’t rightly estimate what a man will do when he’s in drink and this murder’s a proof of it.

Not simply a recoiling from gendered violence here, Fowler expresses the unpredictability of violence in a man under the influence. The potential for violence is recognizable, yet its particular, material manifestation is not. Paradoxically, the evidence before Fowler confirms that very unknowability.

Again, we find that the structure of the sentence mirrors this progression of potential-for-action and eventual materialization. A man is a multivalent locus of potential; drink intervenes by narrowing his potential for action, while disrupting its predictability; and the proof adds another layer by making the unknowable known. Among highlighted words earlier in the paragraph, we observe this pattern to an extent as well. Looking to a different character, the moment when Mrs. Fowler’s girl, Edie, came into the house (both initial words) sets up the delivery of unexpected news. And what had Fowler heard (initial)? This whole business (final). The final words seem to  materialize or delimit the possibilities made available by the initial words. Yet the paths not taken constitute the significance of the ones that are finally selected.

To be totally clear, I am in no way arguing that the close readings I have offered are fully articulated by the distant reading. The patterns to which the initial and final words belong were found at the scale of more than one thousand texts. As a result, no individual instance of a word is strictly interpretable by way of the model: we would entirely expect by chance to find, say, a final word in the beginning of a sentence, at the start of a paragraph, in the first half of a book. (This is the case for letters in the paragraph above.) This touches on an outstanding problem in distant reading: how far to interpret our model?

I have attempted to sidestep this question by oscillating between close and distant readings. My method here has been to offer a close reading of the passage, highlighting its constitutive irony, and to show how the model’s features follow the contour of that reading. The paragraph turns from not-knowing to claiming knowledge; several sentences turn from unknowability to the horror of knowing. Although it is true that the sequential distribution of the model’s features parallel these interpretive shifts, they do not tell us what the irony actually consists of in either case. That is, heard-heard-heard-said-said-said or man-drinkproof do not produce semantic meaning, even while they trace its arc. If, as in Brooks’s framework, the structure of a text constitutes the solution to problems raised by its subject matter, then the distant reading seems to offer a solution to a problem it does not know. The model is merely a shadow on the wall cast by the richness of the text.

Let us take one last glance at The ABC Murders‘s full shadow at the scale of the novel. In the paragraph and sentences above, I had emphasized the relative proportions of initial and final words in short segments of text. While this same approach could be taken with the full text, we will add a small twist. Rather than simply scanning for locations where one or the other type of word appears with high density, we will observe the sequential accumulation of such words. That is, we will pass through the text word-by-word, while keeping a running tally of the difference between feature categories: when we see an initial word, we will add one to the tally and when we see a final word, we will subtract one.3

Figure 3. Line graph representing cumulative sum of initial and final words over the book The ABC Murders. Tally rises early in the book and remains high until near the middle. In the last part of the book, the tally moves downward rapidly and ends low.

Figure 3. Cumulative sum of initial (+1) and final (-1) words over The ABC Murders. X-axis indicates position in text by word count.

In Figure 3, we find something tantalizingly close to plot arc. If initial words had opened up spaces of possibility and final words had manifested or intervened on these, then perhaps we can see the rise and fall of tension (even, suspense?) in this mystery novel. To orient ourselves, the paragraph we had attended to closely appears around the 9000th word in the text. This puts the first murder (out of four) and speculations like those of Mrs. Fowler’s at the initial rise in the tally that dominates the first half of the book. At the other end, the book closes per convention with Poirot’s unmasking of the murderer and extended explanation, during which ambiguity is systematically eliminated. This is paralleled by the final precipitous drop in the tally.

I’ll close by asking not what questions our model has answered about the text (these are few), but what the text raises about the model (these are many more). What exactly has our distant reading found? Analysis began with BOWs, yet can we say that the logistic regression has found patterns that exceed mere word frequencies?

Brooks had indicated what he believed we might find through these reading practices: irony, paradox, ambiguity. Since Brooks himself expressed ambivalence as to their distinctions and called for clarification of terms, I have primarily used the word irony as a catch-all. However, irony alone did not fully seem to capture the interpretive phenomenon articulated by the model. While observing the actual distribution of the model features in the text, I found it expedient to use those terms paradox and ambiguity at certain points, as well. Perhaps, this is a sign that our distant reading has picked up precisely the phenomenon Brooks was attending. If that is the case, then distant reading is well positioned to extend Brooks’s own close reading project.

Notes

1. This method has been used, for example, by Underwood & Sellers in “The Longue Durée of Literary Prestige.” Goldstone’s insightful response to a preprint of that paper, “Of Literary Standards and Logistic Regression: A Reproduction,” describes some of the interpretive limits to “word lists,” which are used in this study as well. For a broader discussion of the relationship between machine classification and close reading, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.”

2. Not all words from the model have been highlighted. In order to increase our confidence of their validity, only those whose weight is at least one standard deviation from the mean of all weights. As a result, we have a highlighted vocabulary of about 800 words, accounting for about 10% of all words in the corpus. In the passage we examine, just over one-in-ten words is highlighted.

The full 812-word vocabulary is available in plaintext files as initial and final lists.

3. Note that it is a short jump from this accumulative method to identifying the relative densities of these words at any point in the text. We would simply look at the slope of the line, rather than its height.

Bibliography

Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Christie, Agatha. The ABC Murders. New York: Harper Collins. 2011 (1936).

Goldstone, Andrew. “Of Literary Standards and Logistic Regression: A Reproduction.” Personal blog. 2016. Accessed March 2, 2017. https://andrewgoldstone.com/blog/2016/01/04/standards/

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

Operationalizing The Urn: Part 2

This post is the second in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. The first post had laid out the rationale and stakes for such a method of reading. This post will perform that distant reading in order to test Brooks’s literary historical claims. The third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Distant Reading a Century of Structure

By Brooks’s account, close reading enjoys universal application over Anglo-American poems produced “since Shakespeare” because these employ the same broad formal structure. Let us test this hypothesis. In order to do so we need to operationalize his textual model and use this to read a fairly substantial corpus.

Under Brooks’s model, structure is a matter of both sequence and scale. In order to evaluate sequence in its most rudimentary form, we will look at halves: the second half of a text sequentially follows the first. The matter of scale indicates for us what are to be halved: sentences, paragraphs, full texts. As a goal for our model, we will need to identify how second halves of things differentiate themselves from first halves. Moreover, this must be done in such a way that the differentiation of sentences‘ halves occurs in dialogue with the differentiation paragraphs‘ halves and of books‘ halves and vice-versa.

Regarding corpus, we will deviate from Brooks’s own study, by turning from poetry to fiction. This brings our study closer into line with current digital humanities scholarship, speaking to issues that have recently been raised regarding narrative and scale. We also need a very large corpus and access to full texts. To this end, we will draw from the Chicago Text Lab’s collection of twentieth-century novels.1 Because we hope to observe historical trends, we require balance across publication dates: twelve texts were randomly sampled from each year of the twentieth century.

Each text in our corpus is divided into an initial set of words and a final set of words, however what constitutes initial-ness and final-ness for each text will be scale-dependent and randomly assigned. We will break our corpus into three groups of four hundred novels, associated with each level of scale discussed. For example, from the first group of novels, we will collect the words belonging to the first half of each sentence into an initial bag of words (BOW) and those belonging to the second half each sentence into a final BOW. To be sure, a BOW is simply a list of words and the frequencies with which they appear. In essence, we are asking whether certain words (or, especially, groups of words) conventionally indicate different structural positions. What are the semantics of qualification?

For example, Edna Ferber’s Come and Get It (1935) was included among the sentence-level texts. The novel begins:

DOWN the stairway of his house came Barney Glasgow on his way to breakfast. A fine stairway, black walnut and white walnut. A fine house. A fine figure of a man, Barney Glasgow himself, at fifty-three. And so he thought as he descended with his light quick step. He was aware of these things this morning. He savored them as for the first time.

The words highlighted in blue are considered to be the novel’s initial words and those in red are its final words. Although this is only a snippet, we may note Ferber’s repeated use of sentence fragments, where each refers to “[a] fine” aspect of Barney Glasgow’s self-reflected life. That is, fine occurs three times as an initial word here and not at all as a final word. Do we expect certain narrative turns to follow this sentence-level set up? (Perhaps, things will not be so fine after all!)

This process is employed at each of the other scales as well. From the second group of novels, we do the same with paragraphs: words belonging to the first half of each paragraph are collected into the initial BOW and from the second half into the final BOW. For the third group of novels, the same process is performed over the entire body of the text. In sum, each novel is represented by two BOWs, and we will search for patterns that distinguish all initial BOWs from all final BOWs simultaneously. That is, we hope to find patterns that operate across scales.

The comparison of textual objects belonging to binary categories has been performed in ways congenial to humanistic interpretation by using logistic regression.2 One way to think about this method is in terms of geometry. Imagining our texts as points floating in space, classification would consist of drawing a line that separates the categories at hand: initial BOWs, say, above the line and final BOWs below it. Logistic regression is a technique for choosing where to draw that line, based on the frequencies of the words it observes in each BOW.

The patterns that it identifies are not necessarily ones that are expressed in any given text, but that become visible at scale. There are a several statistical virtues to this method that I will elide here, but I will mention that humanists have found it valuable for the fact that it returns a probability of membership in a given class. Its predictions are not hard-and-fast but rather fuzzy; these allow us to approach categorization as a problem of legibility and instability.3

The distant reading will consist of this: we will perform a logistic regression over all of our initial and final BOWs (1200 of each).4 In this way, the computer will “learn” by attempting to draw a line that will put sentence-initial BOWs, paragraph-initial and narrative-initial BOWs on the same side, but none of the final BOWs. The most robust interpretive use of such a model is to make predictions about whether it thinks new, unseen BOWs belong to the initial or final category, and we will make use of this shortly.

Before proceeding, however, we may wish to know how well our statistical model had learned to separate these categories of BOWs in the first place. We can do this using a technique called Leave-One-Out Cross Validation. Basically, we set aside the novels belonging to a given author at training time, when the logistic regression learns where to draw its line. We then make a prediction for the initial and final BOWs (regardless of scale) for that particular author’s texts. By doing this for every author in the corpus, we can get a sense of its performance.5

Such a method results in 87% accuracy.6 This is a good result, since it indicates that substantial and generalizable patterns have been found. Looking under the hood, we can learn a bit more about the patterns the model had identified. Among BOWs constructed from the sentence-scale, the model predicts initial and final classes with 99% accuracy; at the paragraph-scale, accuracy is 95%; and at the full-text-scale, it is 68%.7 The textual structure that we have modeled makes itself most felt in the unfolding of the sentences, followed by that of paragraphs, and only just noticeably in the move across halves of the novel. The grand unity of the text has lower resolution than its fine details.

We have now arrived at the point where we can test our hypothesis. The model had learned how to categorize BOWs as initial and final according to the method described above. We can now ask it to predict the categories of an entirely new set of texts: a control set of 400 previously unseen novels from the Chicago Text Lab corpus. These texts will not have been divided in half according to the protocols described above. Instead, we will select half of their words randomly.

To review, Brooks had claimed that the textual structures he had identified were universally present across modernity. If the model’s predictions for these new texts skew toward either initial or final, and especially if we find movement in one direction or the other over time, then we will have preliminary evidence to the contrary of Brooks’s claim. That is, we will have observed a shift in the degree or mode by which textual structure has been made legible linguistically.  Do we find such evidence that this structure is changing over a century of novels? In fact, we do not.

Fig 1. Scatter plot of BOWs drawn randomly from control texts. X-axis corresponds to novels' publication dates and Y-axis to their predicted probability of being

Figure 1, Distribution of 400 control texts’ probabilities of initial-ness, by publication date; overlaid with best-fit line

The points in Figure 1 represent each control text, while their height indicates the probability that a given text is initial. (Subtract that value from 1 to get the probability it is final.) We find a good deal of variation in the predictions — indeed, we had expected to find such variation, since we had chosen words randomly from the text — however the important finding is that this variation does not reflect a chronological pattern. The best-fit line through the points is flat, and the correlation between predictions and publication date is virtually zero (r2 < 0.001).

This indicates that the kinds of words (and clusters of words) that had indexed initial-ness and final-ness remain in balance with one another over the course of twentieth-century fiction. It is true that further tests would need to be performed in order to increase our confidence in this finding.8 However, on the basis of this first experiment, we have tentative evidence that Brooks is correct. The formalism that underpins close reading has a provisional claim to empirical validity.

I would like to reiterate that these findings are far from the final word on operationalizing close reading. More empirical work must be done to validate these findings, and more theoretical work must be done to create more sophisticated models. Bearing that qualification in mind, I will take the opportunity to explore this particular model a bit further. Brooks had claimed that the textual structure that we have operationalized underpins semantic problems of irony, paradox, and ambiguity. Is it possible that this model can point us toward moments like these in a text?

Acknowledgement

I heartily thank Hoyt Long and Richard So for permission to use the Chicago Text Lab corpus. I would also like to thank Andrew Piper for generously making available the txtLAB_450 corpus.

Notes

1. This corpus spans the period 1880-2000 and is designed to reflect WorldCat library holdings of fiction by American authors across that period. Recent projects using this corpus include Piper, “Fictionality” and Underwood, “The Life Cycles of Genres.” For an exploration of the relationship between the Chicago Text Lab corpus and the HathiTrust Digital Library viz representations of gender, see Underwood & Bamman, “The Gender Balance of Fiction, 1800-2007”

2. See, for example: Jurafsky, Chahuneau, Routledge, & Smith, “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus;” Underwood, “The Life Cycles of Genres;” Underwood & Sellers, “The Longue Durée of Literary Prestige”

3. For an extended theorization of these problems, using computer classification methods, see Long & So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning ”

4. Specifically, this experiment employs a regularized logistic regression as implemented in the Python package scikit-learn. Regularization is a technique that minimizes the effect that any individual feature is able to have on the model’s predictions. On one hand, this increases our confidence that the model we develop is generalizable beyond its training corpus. On the other hand, this is particularly important for literary text analysis, since each word in the corpus’s vocabulary may constitute a feature, which leads to a risk of overfitting the model. Regularizations reduces this risk.

When performing regularized logistic regression for text analysis, there are two parameters that must be determined: the regularization constant and the feature set. Regarding the feature set, it is typical to use only the most common words in the corpus. The questions of how large/small to make the regularization and how many words to use can be approached empirically.

The specific values were chosen through a grid search over combinations of parameter values, using ten-fold cross validation on the training set (over authors). This grid search was not exhaustive but found a pair of values that lie within the neighborhood of the optimal pair. C = 0.001; 3000 Most Frequent Words.

Note also that prior to logistic regression, word frequencies in BOWs were normalized and transformed to standard units. Stop words were not included in the feature set.

5. This method is described by Underwood and Sellers in “Longue Durée.” The rationale for setting aside all texts by a particular author, rather than single texts at a time, is that what we think of as authorial style may produce consistent word usages across texts. Our goal is to create an optimally generalizable model, which requires we prevent the “leak” of information from the training set to the test set.

6. This and all accuracies reported are F1-Scores. This value is generally considered more robust than a simple count of correct classifications, since it balances true-positives (and false-negatives) against false-positives.

7. An F1-Score of 99% is extraordinary in literary text analysis, and as such it should be met with increased skepticism. I have taken two preliminary steps in order to convince myself of its validity, but beyond these, I invite readers to experiment with their own texts to find whether these results are consistent across novels and literary corpora. The code has also been posted online for examination.

First, I performed an unsupervised analysis over the sentence BOWs. A quick PCA visualization indicates an almost total separation between sentence-initial and sentence-final BOWs.

Figure 2. Scatter plot of sentence-initial and sentence-final BOWs, visualized using PCA. Points representing initial BOWs are colored blue; points representing final BOWs are colored red. The clusters of points are mostly separate, however there is some noticeable overlap.

Figure 2. Distribution of sentence-initial BOWs (blue) and sentence-final BOWs (red) in the third and fourth principle components of PCA. PCA was performed over sentence BOWs alone.

The two PCs that are visualized here account for just 3.5% of the variance in the matrix. As an aside, I would point out that these are not the first two PCs but the third and four (ranked by their explained variance). This suggests that the difference between initial and final BOWs is not even the most substantial pattern across them. Perhaps it makes sense that something like chronology of publication dates or genre would dominate. By his own account, Brooks sought to look past these in order to uncover structure.

Second, I performed the same analysis using a different corpus: the 150 English-language novels in the txtLAB450, a multilingual novel corpus distributed by McGill’s txtLab. Although only 50 novels were used for sentence-level modeling (compared to 400 from the Chicago corpus), sentence-level accuracy under Leave-One-Out Cross Validation was 98%. Paragraph-level accuracy dropped much further, while text-level accuracy remained about the same.

8. First and foremost, if we hope to test shifts over time, we will have to train on subsets of the corpus, corresponding to shorter chronological periods, and make predictions about other periods. This is an essential methodological point made in Underwood and Sellers’s “Longue Durée.” As such, we can only take the evidence here as preliminary.

In making his own historical argument, Brooks indicates that the methods used to read centuries of English literature were honed on poetry from the earliest (Early Modern) and latest (High Modern) periods. A training set drawn from the beginning and end years of the century should be the first such test. Ideally, one might use precisely the time periods he names, over a corpus of poetry.

Other important tests include building separate models for each scale of text individually and comparing these with the larger scale-simultaneous model. Preliminary tests on a smaller corpus had shown differences in predictive accuracy between these types of models, suggesting that they were identifying different patterns and which I took to license using the scale-simultaneous model. This would need to be repeated with the larger corpus.

We may also wish to tweak the model as it stands. For example, we have treated single-sentence paragraphs as full paragraphs. The motivation is to see how their words perform double duty at both scales, yet it is conceivable that we would wish to remove this redundancy.

Or we may wish to build a far more sophisticated model. This one is built on a binary logic of first and second halves, which is self-consciously naive, whereas further articulation of the texts may offer higher resolution. Perhaps an unsupervised learning method would be better since it is not required to find a predetermined set of patterns.

And if one wished to contradict the claims I have made here, one would do well to examine the text-level of the novel. The accuracy of this model is low enough at that scale that we can be certain there are other interesting phenomena at work.

The most important point to be made here is not to claim that we have settled our research question, but to see that our preliminary findings direct us toward an entire program of research.

Bibliography

Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Jurafsky, Dan, et al. “Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus.” Journal of Cultural Analytics. 2016.

Long, Hoyt & Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:2 (2016). 235-267.

Pedregosa, F, et al. “Scikit-learn: Machine Learning in Python.” JMLR 12 (2011). 2825-2830.

Underwood, Ted. “The Life Cycles of Genres.” Journal of Cultural Analytics. 2016.

Underwood, Ted & David Bamman. “The Gender Balance of Fiction, 1800-2007.” The Stone and the Shell. 2016. Accessed March 2, 2017. https://tedunderwood.com/2016/12/28/the-gender-balance-of-fiction-1800-2007/

Underwood, Ted & Jordan Sellers. “The Longue Durée of Literary Prestige.” Modern Language Quarterly. 77:3 (2016). 321-344.

Operationalizing The Urn: Part 1

This post is the first in a series on operationalizing the close reading method in Cleanth Brooks’s The Well Wrought Urn. This post lays out the rationale and stakes for such a method of reading. The second post will perform that distant reading in order to test Brooks’s literary historical claims, and the third post will explore the statistical model in order to ask whether it has captured Brooks’s definition of irony.

Meaning & Structure

keats_urnCleanth Brooks’s The Well Wrought Urn lays out a program of close reading that continues to enjoy great purchase in literary study more than seventy years on. Each of ten chapters performs a virtuosic reading of an individual, canonic poem, while the eleventh chapter steps back to discuss findings and theorize methods. That theorization is familiar today: to paraphrase is heresy; literary interpretation must be highly sensitive to irony, ambiguity, and paradox; and aesthetic texts are taken as “total patterns” or “unities” that give structure to heterogenous materials. These are, of course, the tenets of close reading.

For Brooks, it is precisely the structuring of the subject matter which produces a text’s meaning. On the relationship between these, he claims, “The nature of the material sets the problem to be solved, and the solution is the ordering of the material” (194). A poem is not simply a meditation on a theme, but unfolds as a sequence of ideas, emotions, and images. This framework for understanding the textual object is partly reflected by his method of reading texts dramatically. The mode of reading Brooks advocates requires attention to the process of qualification and revision that is brought by each new phrase and stanza.

This textual structure takes the form of a hierarchy of “resolved tensions,” which produce irony. It is telling that Brooks initially casts textual tension and resolution as a synchronic, spatial pattern, akin to architecture or painting. We may zoom in to observe the text’s images in detail to observe how one particular phrase qualifies the previous one, or we may zoom out to reveal how so many micro-tensions are arranged in relation to one another. To perform this kind of reading, “the relation of each item to the whole context is crucial” (207). Often, irony is a way of accounting for the double meanings that context produces, as it traverses multiple scales of the text.

More recently, distant readers have taken up different scales of textual structure as sites of interpretation. The early LitLab pamphlet, “Style at the Scale of the Sentence” by Sarah Allison et al, offers a taxonomy of the literary sentence observed in thousands of novels. At its root, the taxonomy is characterized by the ordering of independent and dependent clauses — what comes first and how it is qualified. A later pamphlet, “On Paragraphs. Scale, Themes, and Narrative Form” by Mark Algee-Hewitt et al, takes up the paragraph as the novel’s structural middle-scale, where themes are constructed and interpenetrate. Moving to the highest structural level, Andrew Piper’s article “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel”examines how the second half a of novel constitutes a transformation away from its first half in his article.1

Each of these distant readings offers a theorization of its scale of analysis: sentence, paragraph, novel. In the first example, Allison et al ask how it is possible that the relationship between the first half of a given sentence and the second encodes so much information about the larger narrative in which it appears. Indeed, all of the instances above take narrative — the grand unity of the text — as an open question, which they attempt to answer by way of patterns revealed at a particular scale. Brooks offers us a challenge, then: to read at these multiple scales simultaneously.

This simultaneity directs us to an ontological question behind the distant readings mentioned above. (Here, I use ontology especially in its taxonomic sense from information science.) Despite the different scales examined, each of those studies takes the word as a baseline feature for representing a text to the computer. That is, all studies take the word as the site in which narrative patterns have been encoded. Yet, paradoxically, each study decodes a pattern that only becomes visible at a particular scale. For example, the first study interprets patterns of words’ organization within sentence-level boundaries. This is not to imply that the models developed in each of those studies are somehow incomplete — after all, each deals with research questions whose terms are properly defined by their scale. However, the fact that multiple studies have found the word-feature to do differently-scaled work indicates an understanding of its ontological plurality.

Although ontology will not be taken up as an explicit question in this blog series, it haunts the distant readings performed in Parts 2 & 3. In brief, when texts are represented to the computer, it will be shown all three scales of groupings at once. (Only one scale per text to prevent overfitting.) Reading across ontological difference is partly motivated by Alan Liu’s article, “N+1: A Plea for Cross Domain  Data in the Digital Humanities.” There, he calls for distant readings across types of cultural objects, in order to produce unexpected encounters which render them ontologically alien. Liu’s goal is to unsettle familiar disciplinary divisions, so, by that metric, this blog series is the tamest version of such a project. That said, my contention is that irony partly registers the alienness of the cross-scale encounter at the site of the word. Close reading is a method for navigating this all-too-familiar alienness.

While close reading is characterized by its final treatment of texts as closed, organic unities, it is important to remember that this method rests on a fundamentally intertextual and historical claim. Describing the selection of texts he dramatically reads in The Well Wrought Urn, Brooks claims to have represented all important periods “since Shakespeare” and that poems have been chosen in order to demonstrate what they have in common. Rather than content or subject matter, poems from the Early Modern to High Modernism share the structure of meaning described above. Irony would seem to be an invariant feature of modernity.

We can finally raise this as a set of questions. If we perform a distant reading of textual structure at multiple, simultaneous levels, would we find any changes to that structure at the scale of the century? Could such a structure show some of the multiple meanings that traverse individual words in a text? More suggestively: would that be irony?

Notes

1. Related work in Schmidt’s “Plot Arceology: a vector-space model of narrative structure.” Similar to the other studies mentioned, Schmidt takes narrative as a driving concern, following the movement of each text through plot space. Methodologically, that article segments texts (in this case, film and TV scripts) into 2-4 minute chunks. This mode of articulation does not square easily with any particular level of novelistic scale, although it speaks to some of the issues of segmentation raised in “On Paragraphs.”

Bibliography

Algee-Hewitt, Mark, et al. “On Paragraphs. Scale, Themes, and Narrative Form.” Literary Lab Pamphlet 10. 2015.

Allison, Sarah, et al. “Style at the Scale of the Sentence.” Literary Lab Pamphlet 5. 2013.

Brooks, Cleanth. The Well Wrought Urn: Studies in the Structure of Poetry. New York : Harcourt, Brace, Jovanovich. 1975.

Liu, Alan. “N+1: A Plea for Cross Domain  Data in the Digital Humanities.” Debates in the Digital Humanities 2016. eds, Matthew K. Gold and Lauren F. Klein. 2016. Accessed March 2, 2017. http://dhdebates.gc.cuny.edu/debates/text/101

Piper, Andrew. “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel.” New Literary History, 46:1 (2015). 63–98.

Schmidt, Ben. “Plot arceology: A vector-space model of narrative structure,” 2015 IEEE International Conference on Big Data (Big Data). Santa Clara, CA. 2015. 1667-1672.

What We Talk About When We Talk About Digital Humanities

The first day of Alan Liu’s Introduction to the Digital Humanities seminar opens with a provocation. At one end of the projection screen is the word DIGITAL and at the other HUMAN. Within the space they circumscribe, we organize and re-organize familiar terms from media studies: media, communication, information, and technology. What happens to these terms when they are modified by DIGITAL or HUMAN? What happens when they modify one another in the presence of those master terms? There are endless iterations of these questions but one effect is clear: the spaces of overlap, contradiction, and possibility that are encoded in the term Digital Humanities.

Pushing off from that exercise, this blog post puts Liu’s question to an extant body of DH scholarship: How does the scholarly discourse of DH organize these media theoretic terms? Indeed, answering that question may shed insight on the fraught relationship between these fields. We can also ask a more fundamental question as well. To what extent does DH discourse differentiate between DIGITAL and HUMAN? Are they the primary framing terms?

Provisional answers to these latter questions could be offered through distant reading of scholarship in the digital humanities. This would give us breadth of scope across time, place, and scholarly commitments. Choosing this approach changes the question we need to ask first: What texts and methods could operationalize the very framework we had employed in the classroom?

For a corpus of texts, this blog post turns to to Matthew K. Gold’s Debates in the Digital Humanities (2012). That edited volume has been an important piece of scholarship precisely because it collected essays from a great number of scholars representing just as many perspectives on what DH is, can do, and would become. Its essays (well… articles, keynote speeches, blog posts, and other genres of text) especially deal with problems that were discussed in the period 2008-2011. These include the big tent, tool/archive, and cultural criticism debates, among others, that continue to play out in 2017.

Token Frequency
digital 2729
humanities 2399
work 740
new 691
university 429
research 412
media 373
data 328
social 300
dh 291

Table 1. Top 10 Tokens in Matthew K. Gold’s Debates in the Digital Humanities (2012)

The questions we had been asking in class dealt with the relationships among keywords and especially the ways that they contextualize one another. As an operation, we need some method that will look at each word in a text and collect the words that constitute its context. (q from above: How do these words modify one another?) With these mappings of keyword-to-context in hand, we need another method that will identify which are the most important context words overall and which will separate out their spheres of influence. (q from above: How do DIGITAL and HUMAN overlap and contradict one another? Are DIGITAL and HUMAN the most important context words?)

For this brief study, the set of operations ultimately employed were designed to be the shortest line from A to B. In order to map words to their contexts, the method simply iterated through the text, looking at each word in sequence and cumulatively tallying the three words to the right and left of it.1 This produced a square matrix in which each row was a given unique word in Debates and each column represents the number of times that another word had appeared within the given window.2

DTM that records keywordwords (rows) and their contexts (columns

Table 2. Selection from Document-Term Matrix, demonstrating relationship between rows and columns. For example, the context word “2011” appears within a 3-word window of the keyword “association” on three separate occasions in Debates in the Digital Humanities. This likely refers to the 2011 Modern Language Association conference, an annual conference for literature scholars that is attended by many digital humanists.

Principle Component Analysis was then used to identify patterns in that matrix. Without going deeply into the details of PCA, the method looks for variables that tend to covary with one another (related to correlation). In theory, PCA can go through a matrix and identify every distinct covariance (This cluster of context words tends to appear with one another, and this other cluster appears with one another, etc…). In practice, researchers typically only base their analyses on the principle components (in this case, context-word clusters) that account for the largest amounts of variance, since these are the most prominent and least subject to noise.

Figure 1. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities. (Click for larger image.)

The above visualization was produced using the first two principle components of the context space and projecting the keywords into it. The red arrows represent the loadings, or important context words, and the blue dots represent the keywords which they contextualize. Blue dots that appear near one another can be thought to have relatively similar contexts in Debates.

What we find is that digital and humanities are by far the two most prominent context words. Moreover, they are nearly perpendicular to one another, which means that they constitute very different kinds of contexts. Alan’s provocation turns out to be entirely well-founded in the literature under this set of operations in the sense that digital and humanities are doing categorically different intellectual work. (Are we surprised that a human close reader and scholar in the field should find the very pattern revealed by distant reading?)

Granted, Alan’s further provocation is to conceive of humanities as its root human, which is not the case in the discourse represented here. This lacuna in the 2012 edition of Debates sets the stage for Kim Gallon’s intervention in the 2016 edition, “Making a Case for the Black Digital Humanities.” That article articulates the human as a goal for digital scholarship under social conditions where raced bodies are denied access to “full human status” (Weheliye, qtd in Gallon). In the earlier edition, then, humanities would seem to be doing a different kind of work.

We can begin to outline the intellectual work that is being done by humanities and digital in 2012 by hanging at this bird’s-eye-view for a few moments longer. There are roughly three segments of context-space as separated by the loading arrows: the area to the upper-left of humanities, that below digital, and the space between these.

The words that are contextualized primarily by humanities describe a set of institutional actors: NEH, arts, sciences, colleges, disciplines, scholars, faculty, departments, as well as that previous institutional configuration “humanities computing.” The words contextualized primarily by digital are especially humanities, humanists, and humanist. (This is after all, the name of the field, they are seeking to articulate.) Further down, however, we find fields of research and methods: media, tools, technologies, pedagogy, publishing, learning, archives, resources.

If humanities had described a set of actors, and digital had described a set of research fields, then the space of their overlap is best accounted by one of its prominent keywords, doing. Other words contextualized by both humanities and digital include: centers, research, community, projects, scholarship. These are the things that we, as digital humanists, do.

Returning to our initial research question, it appears that the media theory terms media and technology are prominently conceived as digital in this discourse, whereas information and communication are not pulled strongly toward either context. This leads us to a new set of questions: What does it mean, within this discourse, for the former terms to be conceived as digital? What lacuna exists that neither of the latter terms is conceived digitally nor humanistically?

The answers to these questions call for a turn to close reading.

Postscript: Debates without the Digital Humanities

During the pre-processing, I made a heavy handed and highly constestable decision. When observing context words, I omitted those appearing in the bi-gram “new york.” That is, I have treated that word pair as noise rather than signal, and the strength of its presence to be a distortion of the scholarly discourse.

The reasoning for such a decision is that it may have been an artifact of method. I have taken a unigram approach to the text, such that the new of “New York” is treated the same as in “new media” or “new forms of research.” At the same time, the quick-and-dirty text ingestion had pulled in footnotes and bibliographies along with the bodies of the essays. This also partly explains why the “new” vector acts as context for dots like “university” and “press” as well. (These words continue to cluster near “new” in Figure 1 but much less visibly or strongly.)

Figure 2. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities, where tokens belonging to the bi-gram “new york” have been included during pre-processing. (Click for larger image.)

If we treat “new york” as textual signal, we may be inclined to draw a few further conclusions. First, as the film cliche goes, “The city is almost another character in the movie.” New York is a synecdoche for an academic institutional configuration that is both experimental and public facing, since the city is its geographic location. Second, the bi-grams “humanities computing” and “digital humanities” are as firmly entrenched in this comparatively new discourse as the largest city in the United States (the nationality of many but not all of the scholars in the volume), which offers a metric for the consistency of their usage.

But we can go in the other direction as well.

As Liu has suggested in his scholarly writing, distant readers may find the practice of “glitching” their texts revealing of institutional and social commitments that animate these. I take one important example of this strategy to be the counterfactual, as has been used by literature scholars in social network analysis. In a sense, this post has given primacy to a glitched/counterfactual version of Debates — from which “new york” has been omitted — and we have begun to recover the text’s conditions of production by moving between versions of the text.

I will close, however, with a final question that results from a further counterfactual. Let’s omit a second bi-gram: “digital humanities.” What do we talk about when we don’t talk about digital humanities?

Figure 3. PCA over 300 most frequent keywords and their contexts in Debates in the Digital Humanities, where tokens belonging to the bi-gram “digital humanities” have been excluded during pre-processing. (Click for larger image.)

Footnotes

1. This context-accumulation method is based on one that was developed by Richard So and myself for our forthcoming article “Whiteness: A Computational Literary History.” The interpretive questions in that article primarily deal with semantics and differences in usage, and therefore the keyword-context matrix is passed through a different set of operations than those seen here. However, the basic goal is the same: to observe the relationships between words that are mediated by their actual usage in context.

Note that two parameters must be used in this method: a minimum frequency to include a token as a keyword and a window-size in which context words are observed. In this case, keywords were considered the 300 most common tokens in the text, since our least common keyword of interest “communication” was about the 270th most common token. Similarly, we would hope to observe conjunctions of our media theoretical terms in the presence of either digital or human, so we give these a relatively wide berth with a three-word window on either side.

2. This matrix is then normalized using a Laplace smooth over each row (as opposed to the more typical method of dividing by the row’s sum). In essence, this smoothing asks about the distance of a keyword’s observed context from a distribution where every context word had been equally likely. This minimizes the influence of keywords that appear comparatively few times and increases our confidence that changes to the corpus will not have a great impact on our findings.

This blog post, however, does not transform column values into standard units. Although this is a standard method when passing data into PCA, it would have the effect of rendering each context word equally influential in our model, eliminating information regarding the strength of the contextual relationships we hope to observe. If we were interested in semantics on the other hand, transformation to standard units would work toward that goal.

Update Feb 16, 2017: Code supporting this blog post is available on Github.

Reading Distant Readings

This post offers a brief reflection on the previous three on distant reading, topic modeling, and natural language processing. These were originally posted to the Digital Humanities at Berkeley blog.

When I began writing a short series of blog posts for the Digital Humanities at Berkeley, the task had appeared straightforward: answer a few simple questions for people who were new to DH and curious. Why do distant reading? Why use popular tools like mallet or NLTKIn particular, I would emphasize how these methods had been implemented in existing research because, frankly, it is really hard to imagine what interpretive problems computers can even remotely begin to address. This was the basic format of the posts, but as I finished the last one, it became clear that the posts themselves were a study in contrasts. Teasing out those differences suggests a general model for distant reading.

Whereas the first post was designed as a general introduction to the field, the latter two had been organized around individual tools. Their motivations were something like: “Topic modeling is popular. The NLTK book offers a good introduction to Python.” More pedagogical than theoretical. However, digging into the research for each tool unexpectedly revealed that the problems NLTK and mallet sought to address were nearly orthogonal. It wasn’t simply that they each addressed different problems, but that they addressed different categories of problems.

Perhaps the place where that categorical difference was thrown into starkest relief was Matt Jockers’s note on part-of-speech tags and topic modeling, which was examined in the post on NLTK. The thrust of his chapter’s argument had been that topic modeling is a useful way to get at literary theme. However, in a telling footnote, Jockers makes the observation that the topics produced from his set of novels looked very different when he restricted the texts to their nouns alone versus including all words. As he found, the noun-only topics seemed to get closer to literary theoretical treatments of theme. This enabled him to proceed answering his research questions, but the methodological point itself was profound: modifying the way he processed his texts into the topic model performed interpretively useful work — even while using the same basic statistical model.

The post on topic modeling itself made this kind of argument implicitly, but along even a third axis. Many of the research projects described there use a similar natural language processing workflow (tokenization, stop word removal) and a similar statistical model (the mallet implementation of LDA or a close relative). The primary difference across them is the corpus under observation. A newspaper corpus makes newspaper topics, a novel corpus makes novel topics, etc. Selecting one’s corpus is then a major interpretive move as well, separate from either natural language processing or statistical modeling.

Of course, in any discussion of topic modeling, the question consistently arises of how even to interpret the topics once they had been produced. What actually is the pattern they identify in the texts? Nearly all projects arrived at a slightly different answer.

I’ll move quickly to the punchline. There seem to be four major interpretive moments that can be found across the board in these distant readings: corpus construction, natural language processing, statistical modeling, and linguistic pattern.

The first three are a formalization of one’s research question, in the sense that they capture aspects of an interpretive problem. For example, returning to the introductory post, Ted Underwood and Jordan Sellers ask the question “How quickly do literary standards change?” which we may recast in a naive fashion: “How well can prestigious vs non-prestigious poetry (corpus) be distinguished over time (model) on the basis of diction (natural language features)?” Answering this formal question produces a measurement of a linguistic pattern. In Underwood and Sellers’s case, this is a list of percentage values representing how likely each text is to be prestigious. That output then requires its own interpretation if any substantial claim is to be made.

(I described my rephrasing of their research question as “naive” in the sense that it had divorced the output from what was interpretively at stake. The authors’ discursive account makes this clear.)

Distant Reading Model.png

In terms of workflow, all of these interpretive moments occur sequentially, yet are interrelated. The research question directly informs decisions regarding corpus constructionnatural language processing, and the statistical model, while each of the three passes into the next. All of these serve to identify a linguistic pattern, which — if the middle three have been well chosen — allows one to answer that initial question. To illustrate this, I offer the above visualization from Laura K. Nelson’s and my recent workshop on distant reading (literature)/text analysis (social science) at the Digital Humanities at Berkeley Summer Institute.

Although these interpretive moments are designed to account for the particular distant readings which I have written about, there is perhaps even a more general version of this model as well. Replace natural language processing with feature representation and linguistic pattern with simply pattern. In this way, we may also account for sound or image based distant readings alongside those of text.

My aim here is to articulate the process of distant reading, but the more important point is that this is necessarily an interpretive process at every step. Which texts one selects to observe, how one transforms the text into something machine-interpretable, what model one uses to account for a phenomenon of interest: These decisions encode our beliefs about the texts. Perhaps we believe that literary production is organized around novelistic themes or cultural capital. Perhaps those beliefs bear out as a pattern across texts. Or perhaps not — which is potentially just as interesting.

Distant reading has never meant a cold machine evacuating life from literature. It is neither a Faustian bargain, nor is it hopelessly naive. It is just one segment in a slightly enlarged hermeneutic circle.

I continue to believe, however, that computers are basically magic.