A Humanist Apologetic of Natural Language Processing; or A New Introduction to NLTK

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. Images have been included in the body of this post, which we were unable to originally. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

Computer reading can feel like a Faustian bargain. Sure, we can learn about linguistic patterns in literary texts, but it comes at the expense of their richness. At bottom, the computer simply doesn’t know what or how words mean. Instead, it merely recognizes strings of characters and tallies them. Statistical models then try to identify relationships among the tallies. How could this begin to capture anything like irony or affect or subjectivity that we take as our entry point to interpretive study?

I have framed computer reading in this way before – simple counting and statistics – however I should apologize for misleading anyone, since that account gives the computer far too much credit. It might imply that the computer has an easy way to recognize useful strings of characters. (Or to know which statistical models to use for pattern-finding!) Let me be clear: the computer does not even know what constitutes a word or any linguistically meaningful element without direct instruction from a human programmer.

In a sense, this exacerbates the problem the computer had initially posed. The signifier is not merely divorced from the signified but it is not even understood to signify at all. The presence of an aesthetic, interpretable object is entirely unknown to the computer.

Teasing out the depth of the computer’s naivety to language, however, highlights the opportunity for humanists to use computers in research. Simply put, the computer needs a human to tell it what language consists of – that is, which objects to count. Following the description I’ve given so far, one might be inclined to start by telling the computer how to find the boundaries between words and treat those as individual units. On the other hand, any humanist can tell you that equal attention to each word as a separable unit is not the only way to traverse the language of a text.

Generating instructions for how a computer should read requires us to make many decisions about how language should be handled. Some decisions are intuitive, others arbitrary; some have unexpected consequences. Within the messiness of computer reading, we find ourselves encoding an interpretation. What do we take to be the salient features of language in the text? For that matter, how do we generally guide our attention across language when we perform humanistic research?

The instructions we give the computer are part of a field referred to as natural language processing, or NLP. In the parlance, natural languages are ones spoken by humans, as opposed to the formal languages of computers. Most broadly, NLP might be thought of as the translation from one language type to another. In practice, it consists of a set of techniques and conventions that linguists, computer scientists, and now humanists use in the service of that translation.

For the remainder of this blog post, I will offer an introduction to the Natural Language Toolkit, which is a suite of NLP tools available for the programming language Python. Each section will focus on a particular tool or resource in NLTK and connect it to an interpretive research question. The implicit understanding is that NLP is not a set of tools that exists in isolation but necessarily perform part of the work of textual interpretation.

I am highlighting NLTK for several reasons, not the least of which is the free, online textbook describing their implementation (with exercises for practice!). That textbook doubles as a general introduction to Python and assumes no prior knowledge of programming.[1] Beyond pedagogical motivation, however, NLTK contains both tools that are implemented in a great number of digital humanistic projects and others that have not yet been fully explored for their interpretive power.

from nltk import word_tokenize

As described above, the basic entry point into NLP is simply to take a text and split it into a series of words, or tokens. In fact, this can be a tricky task. Even though most words are divided by spaces or line breaks there are many exceptions, especially involving punctuation. Fortunately, NLTK’s tokenizing function, word_tokenize(), is relatively clever about finding word boundaries. One simply places a text of interest inside the parentheses and the function returns an ordered list of the words it had contained.

As it turns out, simply knowing which words appear in a text encodes a great deal of information about higher-order textual features, such as genre. The technique of dividing a text into tokens is so common it would be difficult to offer a representative example, but one might look at Hoyt Long and Richard So’s study of the haiku in modernist poetry, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” They use computational methods to learn the genre’s distinctive vocabulary and think about its dissemination across the literary field.


“A sample list of probability measures generated from a single classification test. In this instance, the word sky was 5.7 times more likely to be associated with nonhaiku (not-haiku) than with haiku. Conversely, the word snow was 3.7 times more likely to be associated with haiku than with nonhaiku (not-haiku).” Long, So 236; Figure 8

I would point out here that tokenization itself requires interpretive decisions be made on the part of the programmer. For example, by default when word_tokenize() sees the word “wouldn’t” in a text, it will produce two separate tokens “would” and “n’t”. If one’s research question were to examine ideas of negation in a text, it might serve one well to tokenize in this way, since it would handle all negative contractions as instances of the same phenomenon. That is, “n’t” would be drawn from “shouldn’t” and “hadn’t” as well. On the other hand, these default interpretive assumptions might adversely affect your research into a corpus, so NLTK offers the capability to turn that aspect of its tokenizer off.

NLTK similarly offers a sent_tokenize() function, if one wishes to divide the text along sentence boundaries. Segmentation at this level underpins the stylistic study by Sarah Allison et al in their pamphlet, “Style at the Scale of the Sentence.”

from nltk.stem import *

When tokens consist of individual words, they contain a semantic meaning but in most natural languages they carry grammatical inflection as well. For example, loveloveslovable, and lovely all have the same root word while the ending maps it into a grammatical position. If we wish to shed grammar in order to focus on semantics, there are two major strategies.

The simpler and more flexible method is to artificially re-construct a root word – the word’s stem – by removing common endings. A very popular function that gets used for this is the SnowballStemmer(). For example, all of the words listed above are stemmed to lov. The stem itself is not a complete word but captures instances of all forms. Snowball is especially powerful in that it is designed to work for many Western languages.

If we wish to keep our output in the natural language at hand, we may prefer a more sophisticated but less universally applicable technique that identifies a word’s lemma, essentially its dictionary form. For English nouns, that typically means changing plurals to singular; for verbs it means the infinitive. In NLTK, this is done with WordNetLemmatizer(). Unless told otherwise, that function assumes all words are nouns, and as of now, it is limited to English. (This is just one application of WordNet itself, which I will describe in greater detail below.)

As it happens, Long and So performed lemmatization of nouns during the pre-processing in their study above. The research questions they were asking revolved around vocabulary and imagery, so it proved expedient to collapse, for instance, skies and sky into the same form.

from nltk import pos_tag

As trained readers, we know that language partly operates according to (or sometimes against!) abstract, underlying structures. For as many cases where we may wish to remove grammatical information from our text by lemmatizing, we can imagine others for which it is essential. Identifying a word’s part of speech, or tagging it, is an extremely sophisticated task that remains an open problem in the NLP world. At this point, state-of-the-art taggers have somewhere in the neighborhood of 98% accuracy. (Be warned that accuracy is typically gauged on non-literary texts.)

NLTK’s default tagger, pos_tag(), has an accuracy just shy of that with the trade-off that it is comparatively fast. Simply place a list of tokens between its parentheses and it returns a new list where each item is the original word alongside its predicted part of speech.

This kind of tool might be used in conjunction with general tokenization. For example, Matt Jockers’s exploration of theme in Macroanalysis relied on word tokens but specifically those the computer had identified as nouns. Doing so, he is sensitive to the interpretive problems this selection raises. Dropping adjectives from his analysis, he reports, loses information about sentiment. “I must offer the caveat […] that the noun-based approach used here is specific to the type of thematic results I wish to derive; I do not suggest this as a blanket approach” (131-133). Part-of-speech tags are used consciously to direct the computer’s attention toward features of the text that are salient to Jockers’ particular research question.


Thematically related nouns on the subject of “Crime and Justice;”
from Jockers’s blog post on methods

Recently, researchers at Stanford’s Literary Lab have used the part-of-speech tags themselves as objects for measurement, since they offer a strategy to abstract from the particulars of a given text while capturing something about the mode of its writing. In the pamphlet “Canon/Archive: Large-scale Dynamics in the Literary Field,” Mark Algee-Hewitt counts part-of-speech-tag pairs to think about different “categories” of stylistic repetition (7-8). As it happens, canonic literary texts have a preference for repetitions that include function words like conjunctions and prepositions, whereas ones from a broader, non-canonic archive lean heavily on proper nouns.

from nltk import ne_chunk

Among parts of speech, names and proper nouns are of particular significance, since they are the more-or-less unique keywords that identify phenomena of social relevance (including people, places, and institutions). After all, there is just one World War II, and in a novel, a name like Mr. Darcy typically acts as a more-or-less stable referent over the course of the text. (Or perhaps we are interested in thinking about the degree of stability with which it is used!)

The identification of these kinds of names is referred to as Named Entity Recognition, or NER. The challenge is twofold. First, it has to be determined whether a name spans multiple tokens. (These multi-token grammatical units are referred to as chunks; the process, chunking.) Second, we would ideally distinguish among categories of entity. Is Mr. Darcy a geographic location? Just who is this World War II I hear so much about?

To this end, the function ne_chunk() receives a list of tokens including their parts of speech and returns a nested list where named entities’ tokens are chunked together, along with their category as predicted by the computer.


Log-Scaled Counts of named locations by US State, 1851-1875; Wilkens 6, Figure 4

Similar to the way Jockers had used part of speech to instruct the computer which tokens to count, Matt Wilkens uses NER to direct his study of the “Geographic Imagination of Civil War Era American Fiction.” By simply counting the number of times each unique location was mentioned across many text (and alternately the number of novels in which it appeared), Wilkens is able to raise questions about the conventional wisdom around the American Renaissance, post-war regionalism, and just how much of a shift in literary attention the war had actually caused. Only chunks of tokens tagged GPE, or Geo-Political Entity, are needed for such a project.

from nltk.corpus import wordnet

I have spent a good deal of time explaining that the computer definitionally does not know what words mean, however there are strategies by which we can begin to recover semantics. Once we have tokenized a text, for instance, we might look up those tokens in a dictionary or thesaurus. The latter is potentially of great value, since it creates clusters among words on the basis of meaning (i.e. synonyms). What happens when we start to think about semantics as a network?

WordNet is a resource that organizes language in precisely this way. In its nomenclature, clusters of synonyms around particular meanings are referred to as synsets. WordNet’s power comes from the fact that synsets are arranged hierarchically into hypernyms and hyponyms. Essentially, a synset’s hypernym is a category to which it belongs and its hyponyms are specific instances. Hypernyms for “dog” include “canine” and “domestic animal;” the hyponyms include “poodle” and “dalmatian.”

This kind of “is-a” hierarchical relationship goes all the way up and down a tree of relationships. If one goes directly up the tree, the hypernyms become increasingly abstract until one gets to a root hypernym. These are words like “entity” and “place.” Very abstract.

As an interpretive instrument, one can broadly gauge the abstractness – or rather, the specificity – of a given word by counting the number of steps taken to get from the word to its root hypernym, i.e. the length of the hypernym path. The greater the number of steps, the more specific the word is thought to be. In this case, the computer ultimately reads a number (a word’s specificity score) rather than the token itself.

In her study of feminist movements across cities and over time, “Political Logics as Cultural Memory: Local Continuities and Women’s Organizations in Chicago and New York City”, Laura K. Nelson gauges the abstractness of each movement’s essays and manifestos by measuring the average hypernym path length for each word in a given document. In turn, she finds that movements out of Chicago had tended to focus on specific events and political institutions whereas those out of New York situate themselves among broader ideas and concepts.

from nltk.corpus import cmudict

Below semantics, below even the word, is of course phonology. Phonemes lie at a rich intersection of dialect, etymology, and poetics that digital humanists have only just begun to explore. Fortunately, the process of looking up dictionary pronunciations can be automated using a resource like the CMU (Carnegie Mellon University) Pronouncing Dictionary.

In NLTK, this English-language dictionary is distributed as a simple list in which each entry consists of a word and its most common North American pronunciations. The entry includes not only the word’s phonemes but whether syllables are stressed or unstressed. Texts then are no longer processed into semantically identifiable units but into representations of its aurality.

Clement et al.png

Segments of each text colored by their aural affinities to each of the other books under consideration. For example, the window on the left shows the text of Tender Buttons, while the prevalence of fuchsia highlighting indicates its aural similarity to the New England Cook Book; Clement et al, Figure 14

These features, among others, form the basis of a study by Tanya Clement et al on aurality in literature, “Sounding for Meaning: Using Theories of Knowledge Representation to Analyze Aural Patterns in Texts”.[2] In the essay, the authors computationally explore the aural affinity between the New England Cookbook and Stein’s poem “Cooking” in Tender Buttons. Their findings offer a tentative confirmation of Margueritte S. Murphy‘s previous literary-interpretive claims that Stein “exploits the vocabulary, syntax, rhythms, and cadences of conventional women’s prose and talk” to “[explain] her own idiosyncratic domestic arrangement by using and displacing the authoritative discourse of the conventional woman’s world.”

Closing Thought

Looking closely at NLP – the first step in the computer reading process – we find that our own interpretive assumptions are everywhere present. Our definition of literary theme may compel us to perform part-of-speech tagging; our theorization of gender may move us away from semantics entirely. The processing that occurs is not a simple mapping from natural language to formal, but constructs a new representation. We have already begun the work of interpreting a text once we focus attention on its salient aspects and render them as countable units.

Minimally, NLP is an opportunity for humanists to formalize the assumptions we bring to the table about language and culture. In terms of our research, that degree of explicitness means that we lay bare the humanistic foundations of our arguments each time we code our NLP. And therein lie the beginnings of scholarly critique and discourse.



Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. “Canon/Archive. Large-scale Dynamics in the Literary Field.” Literary Lab Pamphlet. 11 (2016).

Allison, Sarah, Marissa Gemma, Ryan Heuser, Franco Moretti, Amir Tevel, and Irena Yamboliev. “Style at the Scale of the Sentence.” Literary Lab Pamphlet. 5 (2013).

Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastapol, CA: O’Reilly Media, Inc. 2009.

Clement, Tanya, David Cheng, Loretta Auvil, Boris Capitanu, and Megan Monroe. “Sounding for Meaning: Using Theories of Knowledge Representation to Analyze Aural Patterns in Texts.” Digital Humanities Quarterly. 7:1 (2013).

Jockers, Matthew. “Theme.” Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press, 2013. 118-153.

Jockers, Matthew. “‘Secret’ Recipe for Topic Modeling Themes.” http://www.matthewjockers.net/2013/04/12/secret-recipe-for-topic-modelin…(2013).

Long, Hoyt and Richard So. “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning.” Critical Inquiry. 42:4 (2016): 235-267.

Nelson, Laura K. “Political Logics as Cultural Memory: Local Continuities and Women’s Organizations in Chicago and New York City.” (under review)

Wilkens, Matthew. “The Geographic Imagination of Civil War-Era American Fiction.” American Literary History. 25:4 (2013): 803-840.

[1] In fact, there is one piece of prior knowledge required: how to open an interface in which to do the programming. This took me an embarrassingly long time to figure out when I first started! I recommend downloading the latest version of Python 3.x through the Anaconda platform and following the instructions to launch the Jupyter Notebook interface.

[2] As the authors note, they experimented with the CMU Pronouncing Dictionary specifically but selected an alternative, OpenMary, for their project. CMU is a simple (albeit very long) list of words whereas OpenMary is a suite of tools that includes the ability to guess pronunciations for words that it does not already know and to identify points of rising and falling intonation over the course of a sentence. Which tool you ultimately use for a research project will depend on the problem you wish to study.

Topic Modeling: What Humanists Actually Do With It

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

One of the hardest questions we can pose to a computer is asking what a human-language text is about. Given an article, what are its keywords or subjects? What are some other texts on the same subjects? For us as human readers, these kinds of tasks may seem inseparable from the very act of reading: we direct our attention over a sequence of words in order to connect them to one another syntactically and interpret their semantic meanings. Reading a text, for us, is a process of unfolding its subject matter.

Computer reading, by contrast, seems hopelessly naive. The computer is well able to recognize unique strings of characters like words and can perform tasks like locating or counting these strings throughout a document. For instance, by pressing Control-F in my word processor, I can tell it to search for the string of letters reading which reveals that so far I have used the word three times and highlights each instance. But that’s about it. The computer doesn’t know that the word is part of the English language, much less that I am referring to its practice as a central method in the humanities.

To their credit, however, computers make excellent statisticians and this can be leveraged toward the kind of textual synthesis that initiates higher-order inquiry. If a computer were shown many academic articles, it might discover that articles containing the word reading frequently include others like interpretationcriticismdiscourse. Without foreknowledge of these words’ meanings, it could statistically learn that there is a useful relationship between them. In turn, the computer would be able to identify articles in which this cluster of words seems to be prominent, corresponding to humanist methods.

This process is popularly referred to as topic modeling, since it attempts to capture a list of many topics (that is, statistical word clusters) that would describe a given set of texts. The most commonly used implementation of a topic modeling algorithm is MALLET, which is written an maintained by Andrew McCallum. It is distributed as well in the form of an easy-to-use R package, ‘mallet‘, by David Mimno.

Since there are already several excellent introductions to topic modeling for humanists, I won’t go further into the mathematical details here. For those looking for an intuitive introduction to topic modeling, I would point out Matt Jockers’ fable of the “LDA Buffet.” LDA is the most popular algorithm for topic modeling. For those curious about the math behind it, but aren’t interested in deriving any equations, I highly recommend Ted Underwood’s “Topic Modeling Made Just Simple Enough” and David Blei’s “Probabilistic Topic Models.”

Despite its algorithmic nature, it would be a gross mischaracterization to claim that topic modeling is somehow objective or absent interpretation. I will simply emphasize that human evaluative decisions and textual assumptions are encoded in each step of the process, including text selection and topic scope. In light of this, I will focus on how topic modeling has been used critically to work on humanistic research questions.

Topic modeling’s use in humanistic research might be thought of in terms of three broad approaches: as a tool to guide our close readings, as a technique for capturing the social conditions of texts, and as a literary method that defamiliarizes texts and language.

Topic Modeling as Exploratory Archival Tool

Early examples of topic modeling in the humanities emphasize its ability to help scholars navigate large archives, in order to find useful texts for close reading.

Describing her work on the Pennsylvania Gazette, an American colonial newspaper spanning nearly a century, Sharon Block frames topic modeling as a “promising way to move beyond keyword searching.” Instead of relying on individual words to identify articles relevant to our research questions, we can watch how the “entire contents of an eighteenth-century newspaper change over time.”

To make this concrete, Block reports some of the most common topics that appeared across Gazette articles, including the words that were found to cluster and a label reflecting her own after-the-fact interpretation of those words and articles in which they appear.

% of Gazette Most likely words in a topic in order of likelihood Human-added topic label
5.6 away reward servant named feet jacket high paid hair coat run inches master… Runaways
5.1 state government constitution law united power citizen people public congress… Government
4.6 good house acre sold land meadow mile premise plantation stone mill dwelling… Real Estate
3.9 silk cotton ditto white black linen cloth women blue worsted men fine thread… Cloth

Prevalent Topics in Pennsylvania Gazette; source: Sharon Block in Common-Place

If we were searching through an archive for articles on colonial textiles by keyword alone, we might think to look for articles including words like silkcottoncloth but a word like fine would be trickier to use since it has multiple, common meanings, not to mention the multivalence of gendered words like women and men.

Beyond simply guiding us to articles of interest, Block suggests that we can use topic modeling to inform our close readings by tracking topic prevalence over time and especially the relationships among topics. For example, she notes that articles relating to Cloth peak in the 1750s at the very moment the Religion topic is at its lowest, and wonders aloud whether we can see “colonists (or at least Gazette editors) choosing consumption over spirituality during those years.” This observation compels further close readings of articles from that decade in order to understand better why and how consumption and spirituality competed on the eve of the American Revolution.

A similar project that makes the same call for topic modeling in conjunction with close reading is Cameron Blevins’ work on the diary of Martha Ballard.

Topic Modeling as Qualitative Social Evidence

Following Block’s suggestion, several humanists since have tracked topics over time in different corpora in order to interpret underlying social conditions.

Robert K. Nelson’s project Mining the Dispatch topic models articles from the Richmond Daily Dispatch, the paper of record of the Confederacy, over the course of the American Civil War. In a series of short pieces on the project website and that of the New York Times, Nelson does precisely the kind of guided close reading that Block indicates.

Topic Prevalence over time in Richmond Daily Dispatch; source: Robert K. Nelson in New York Times, Opinionator

Following two topics that seem to rise and fall in tandem, Anti-Northern Diatribes and Poetry and Patriotism, Nelson identifies them as two sides of the same coin in the war effort. Taken together, they not only reveal how the Confederacy understood itself in relation to the war, but the simultaneous spikes and drops of these topics offer what he refers to as “a cardiogram of the Confederate nation.”

Andrew Goldstone and Ted Underwood similarly use readings of individual articles to ground and illustrate the trends they discover in their topic model of 30,000 articles in literary studies spanning the twentieth century. Their initial goal is to test the conventional wisdom of literary studies – for example, the mid-century rise of New Criticism that is supplanted by theory during the 1970s-80s – which their study confirms in broad strokes.

However, they also find that there are other kinds of changes that occur at a longer scale regarding an “underlying shift in the justification for literary study.” Whereas the early part of the century had tended to emphasize “literature’s aesthetically uplifting character,” contemporary scholars have refocused attention on “topics that are ethically provocative,” such as violence and power. Questions of how and why to study literature appear deeply intwined with broader changes in the academy and society.

Matt Jockers has used topic modeling to study the social conditions of novelistic production, however he has placed greater emphasis on the relationship between authorial identity – especially gender and nationality – and subject matter. For example, in an article with David Mimno, they look not only at whether topics are used more frequently by women than men, but also how the same topic may be used differently based on authorial gender. (See also Macroanalysis, Ch. 8, “Theme”)

Topic Modeling as Literary Theoretical Springboard

The above-mentioned projects are primarily historical in nature. Recently, literary scholars have used topic modeling to ask more aesthetically oriented questions regarding poetics and theory of the novel.

Studying poetry, Lisa Rhody uses topic modeling as an entry point on figurative language. Looking at the topics generated from a set of more than 4000 poems, Rhody notes that many are semantically opaque. It would be difficult to assign labels to them in the way that Block had for the Pennsylvania Gazette topics, however she does not treat this as a failure on the computer’s part.

In Rhody’s words “Determining a pithy label for a topic with the keywords death, life, heart, dead, long, world, blood, earth… is virtually impossible until you return to the data, read the poems most closely associated with the topic, and infer the commonalities among them.”

So she does just that. As might be expected from the keywords she names, many of the poems in which the topic is most prominent are elegies. However, she admits that a “pithy label” like “Death, Loss, and Inner Turmoil” fails to account for the range of attitudes and problems these poems consider, since this kind of figurative language necessarily broadens a poem’s scope. Rhody closes by noting that several of these prominently elegiac poems are by African-American poets meditating on race and identity. Figurative language serves not only as an abstraction but as a dialogue among poets and traditions.

Most recently, Rachel Sagner Buurma has framed topic modeling as a tool that can productively defamiliarize a text and uses this to explore novelistic genre. Taking Anthony Trollope’s six Barsetshire novels as her object of study, Buurma suggests that we should read the series not as a formal totality – as we might do for a novel with a single, omniscient narrator – but in terms of its partial and uneven nature. The prominence of particular topics across disparate chapters offer alternate traversals through the books and across the series.

As Buurma finds, the topic model reveals the “layered histories of the novel’s many attempts to capture social relations and social worlds through testing out different genres.” In particular, the periodic trickle of a topic letter, write, read, written, letters, note, wrote, writing… captures not only the subject matter of correspondence, but reading those chapters finds “the ghost of the epistolary novel” haunting Trollope long after its demise. Genres and genealogies that only show themselves partially may be recovered through this kind of method.

Closing Thought

What exactly topic modeling captures about a set of texts is an open debate. Among humanists, words like theme and discourse have been used to describe the statistically-derived topics. Buurma frames them as fictions we construct to explain the production of texts. For their part, computer scientists don’t really claim to know what they are either. But as it turns out, this kind of interpretive fuzziness is entirely useful.

Humanists are using topic modeling to reimagine relationships among texts and keywords. This allows us to chart new paths through familiar terrain by drawing ideas together in unexpected or challenging ways. Yet the findings produced by topic modeling consistently call us back to close reading. The hardest work, as always, is making sense of what we’ve found.



Blei, David. “Probabilisitic Topic Models.” Communications of the ACM 55.4 (2012): 77-84.

Blevins, Cameron. “Topic Modeling Martha Ballard’s Diary.” http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ (2010).

Block, Sharon. “Doing More with Digitization.” Common-place 6.2 (2006).

Buurma, Rachel Sagner. “The fictionality of topic modeling: Machine reading Anthony Trollope’s Barsetshire series.” Big Data & Society 2.2 (2015): 1-6

Goldstone, Andrew and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45.3 (2014): 359-384.

Jockers, Matthew. “The LDA Buffet is Now Open; or Latent Dirichlet Allocation for English Majors.” http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ (2011).

Jockers, Matthew. “Theme.” Macroanalysis: Digital Methods & Literary History. Urbana: University of Illinois Press, 2013. 118-153.

Jockers, Matthew and David Mimno. “Significant Themes in 19th-Century Literature.” Poetics 41.6 (2013): 750-769.

Nelson, Robert K. Mining the Dispatchhttp://dsl.richmond.edu/dispatch/

Nelson, Robert K. “Of Monsters, Men – And Topic Modeling.” New York Times, Opinionator (blog)http://opinionator.blogs.nytimes.com/2011/05/29/of-monsters-men-and-topic-modeling/ (2011).

Rhody, Lisa. “Topic Modeling and Figurative Language.” Journal of Digital Humanities. 2.1 (2012).

Underwood, Ted. “Topic Modeling Made Just Simple Enough.” http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/(2012).

Ghost in the Machine

This post originally appeared on the Digital Humanities at Berkeley blog. It is the first in what became an informal series. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.
Text Analysis Demystified: It's just counting.

Computers are basically magic. We turn them on and (mostly!) they do the things we tell them: open a new text document and record my grand ruminations as I type; open a web browser and help me navigate an unprecedented volume of information. Even though we tend to take them for granted, programs like Word or Firefox are extremely sophisticated in their design and implementation. Maybe we have some distant knowledge that computers store and process information as a series of zeros and ones called “binary” – you know, the numbers that stream across the screen in hacker movies – but modern computers have developed enough that the typical user has no need to understand these underlying mechanics in order to perform high level operations. Somehow, rapid computation of numbers in binary has managed to simulate the human-familar typewriter interface that I am using to compose this blog post.

There is a strange disconnect, then, when literature scholars begin to talk about using computers to “read” books. People – myself included – are often surprised or slightly confused when they first hear about these kinds of methods. When we talk about humans reading books, we refer to interpretive processes. For example, words are understood as signifiers that give access to an abstract meaning, with subtle connotations and historical contingencies. An essay makes an argument by presenting evidence and drawing conclusions, while we evaluate it as critical thinkers. These seem to rely on cognitive functions that are the nearly exclusive domain of humans, and that we, in the humanities, have spent a great deal of effort refining. Despite the magic behind our normative experience of computers, we suspect that these high-level interpretive operations lie beyond the ken of machines. We are suddenly ready to reduce computers to simple adding machines.

In fact, this reduction is entirely valid – since counting is really all that’s happening under the hood – but the scholars who are working on these computational methods are increasingly finding clever ways to leverage counting processes toward the kinds of cultural interpretation that interest us as humanists and that help us to rethink our assumptions about language.


Demystifying computational text analysis has to begin with an account of its reading practices, since this says a great deal about how computers interpret language and what research questions they make possible. At the moment, there are three popular reading methods used by humanists: bag of words, dictionary look-up, and word embeddings.

Far and away, the most common of these is the bag of words. This is the tongue-in-cheek name for the process of counting how many times each word in a text actually appears. By this approach Moby Dick consists of the word the 13,721 times, harpoons 30 times, etc. All information about word-order and usage have been stripped away, and the novel itself is no longer human readable. Yet these word frequencies encode a surprising degree of information about authorship, genre, or even theme. If humanists take it as an article of faith that words are densely complex tools for constructing and representing culture, then the simple measurement of a word’s presence in a text appears to capture a great deal of that.

While the vanilla bag-of-words approach eschews any prior knowledge of words’ semantic meanings, dictionary look-ups offer a strategy to re-incorporate that to an extent. In this context, a “dictionary” refers to a set of words that have been agreed before hand to have some particular valence – sometimes with varying degrees. For example, corpus linguists have drawn up lists of English words that indicate reference to concrete objects versus abstract concepts or positive versus negative emotions. Although the computer is not interpreting words semantically, it is able to scan a text to find words that belong to dictionaries of interest to the researcher. This can be a relatively blunt instrument – which justifiably leads humanists to treat it with suspicion – yet it also offers an opportunity to bring our own interpretative assumptions to a text explicitly through dictionary construction and selection. We are not looking at the signifiers themselves, so much as some facet(s) of their abstract signification.


And whereas bag of words had eliminated information about context, word embedding inverts this approach by considering how words relate to one another precisely through shared context. Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick‘s first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts. This chaining of signifiers to one another mirrors some of humanists’ most sophisticated interpretative frameworks of language.

Word embedding has only just recently begun to be used by humanists, however they are worth mentioning because their early results promise to move the field ahead greatly.


The difficult and exciting part of any research project is framing a question that is both answerable and broadly meaningful. To put a fine point on it, our counting machines can answer how many happy and sad words occur in a given novel, but it is not obvious why that would be a meaningful contribution to literary scholarship. Indeed, a common pitfall in computational text analysis is to start with the tool and point it at different texts in search of an interesting problem. Instead, to borrow Franco Moretti’s term, digital humanists operationalize theoretical concepts – such as genre or plot or even critical reception – by thinking explicitly about how they are constructed in (or operate on) a text and then measuring elements of that construction.

For instance, in the study of literary history, there is an open question regarding the extent of the “great divide” between something like an elite versus a mainstream literature. To what extent do these constitute separate modes of cultural production and how might they intervene on one another? Ted Underwood and Jordan Sellers tried to answer one version of this question by seeing whether there were differences in the bag of words belong to books of poetry that were reviewed in prominent literary periodicals versus those that were not. After all, the bag of words is a multivalent thing capturing information about style and subject matter, and it seems intuitive that critics might be drawn to certain elements of these.

In fact, this turned out to be the case, but even more compelling was a trend over the course of the nineteenth century, in which literature overall – both elite and mainstream – tended to look more and more like the kinds of books that got reviews earlier in the century. This, in turn, raises further questions about how literary production changes over time. Perhaps most importantly, the new questions that Underwood and Sellers raise do not have to be pursued necessarily by computational methods but are available to traditional methods as well. Their computationally grounded findings contribute meaningfully to a broader humanistic discourse and may be useful to scholars using a variety of methods themselves. Indeed, close reading and archival research will almost certainly be necessary to account for the ways literary production changed over the nineteenth century.

Of course, the measurement of similarity and difference among bags of words that underpins Underwood and Sellers’ findings requires its own statistical footwork. In fact, one way to think about a good deal of research in computational text analysis right now is that it consists of finding alternate statistical methods to explore or frame bags of words in order to operationalize different concepts. On the other hand, no particular measurement constitutes an authoritative operationalization of a concept but is conditioned by its own interpretive assumptions. For instance, the question of elite versus mainstream literature was partly taken up by Mark Algee-Hewitt, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser in their pamphlet, “Canon/ Archive. Large-Scale Dynamics in the Literary Field.” (See the summary and relevant graph in this pamphlet.) However they take a radically different approach from Underwood and Sellers, by framing the problem as one of linguistic complexity rather than subject matter.

Perhaps, then, we can take this as an invitation to collaborate with colleagues across the disciplinary divide. As humanists we have intensely trained to think about interpretive frameworks and their consequences for cultural objects. And in statistics and computer science departments are experts who have trained in powerful methods of measurement. Operationalizing our most closely-held theoretical concepts like plot or style does not have to be reductive or dry in the way that computers can appear at a distance. Instead, this digital framework can open new routes of inquiry toward long-standing problems and recapture some of the magic in computation.

Attributing Authorship to “Iterating Grace,” or The Smell Test of Style

Author attribution, as a sub-field of stylometry, is well suited to a relatively small set of circumstances: an unsigned letter sent by one of a handful of correspondents, an act of a play written by one of the supposed author’s colleagues, a novel in a series penned by a ghostwriter. The cases where author attribution shines are ones in which there exists (1) a finite list of potential authors (2) for whom we have writing samples in the same genre, and (3) the unknown text is itself long enough to have a clear style. If either of the latter conditions is unmet, the findings start getting fuzzy but are still salvageable. Failing the first condition, all bets are off.

And “Iterating Grace,” fails all three.

When print copies of “IG” — an anonymous short story satirizing tech culture — appeared mysteriously at some very private addresses of San Francisco authors, tech writers, and VCs a couple weeks back, its authorship was almost a challenge posed to its readers. As a journalist at the center of the mystery,  Alexis Madrigal, put it,

A work can be detached from its creator. I get that. But keep in mind: the person who wrote this book is either a friend of mine or someone who has kept an incredibly close watch on both a bunch of my friends and the broader startup scene. Books were sent to our homes. Our names were used to aid the distribution of the text.

The conspirators responsible for the book have been alternately playful and reticent regarding their own identities — even as they stole Madrigal’s. And although it appears they plan not to reveal themselves, there has been a great deal of speculation about the author. Clever distribution? Dave Eggers. Intimate knowledge of San Francisco? Robin Sloan. Conspiracy and paranoia? Thomas Pynchon. And these are just the headline names. Some seem more likely, some less.

It is entirely possible that the actual author has not yet been fingered — which we would have no way of knowing — but we could try using some of the established author attribution methods to see if the current suspects will yield any clues about the true author’s style. However, a difficult problem looms over the methods themselves: when we hone in on authors whose style is closer to that of “IG,” how would we even know how close we’ve gotten?

The First Obstacle: Size Matters

The methods I’m using in this blog post — and in a series of graphs tweeted at Madrigal1 — come out of Maciej Eder’s paper “Does size matter? Authorship attribution, short samples, big problem.” (For those who want to follow along at home without too much codework, Eder is a member of the team responsible for the R package stylo, which implements these methods off-the-shelf, and I’m told there’s even a GUI.)2

In the paper, Eder empirically tests a large number of variables among popular attribution methods for their accuracy. He ultimately favors representing the text with a Most Frequent Words vector and running it through a classifier using either the Burrows Delta or a Support Vector Machine algorithm. I used the SVM classifier because, bottom line, there isn’t a good Delta implementation in Python.3 But also, conceptually, using SVM on MFW treats author attribution as a special case of supervised bag-of-words classification, which is a relatively comfortable technique in computational literary analysis at this point. What makes this classifier a special case of bag-of-words is that we want most but not all stopwords. Frequencies of function words like “the,” “of,” and “to” are the authorial fingerprint we’re looking for, but personal pronouns encode too much information about the text from which they are drawn — for instance, whether a novel is first- or third-person. Authorship gets diluted by textual specificity.4

It’s worth a short digression here to think out loud about the non-trivial literary ramification of the assumptions we are making. As Madrigal had said, we often take it as an article of faith that “a work can be detached from its author” but these attribution methods reinscribe authorial presence, albeit as something that hovers outside of and may compete with textual specificity. In particular, pronouns do much of the work marking out the space of characters and objects — if we want to call up Latour, perhaps the super-category of actors — at a level that is abstracted from the act of naming them. In precisely the words through which characters and objects manifest their continuity in a text, the author drops away. Not to mention that we are very consciously searching for features of unique authorship in the space of textual overlap, since words are selected for their frequency across the corpus. The fact that this idea of authorship is present in bag-of-words methods that are already popular suggests that it may be useful to engage these problems in projects beyond attribution.

The “size” referred to in the title of Eder’s paper is that of the writing samples in question. We have a 2000-word5 story that, for the sake of apples-to-apples comparison, we will classify against 2000-word samples of our suspects. But what if the story had been 1000 words or 10,000? How would that change our confidence in this algorithmic classification?

Eder - Fig 1 - Sample Length vs Predicted Authorship

Eder’s Fig. 1: 63 English novels, 200 MFWs (Eder, 4)

Accuracy increases overall as the size of textual samples increases, but there is an elbow around 5000 words where adding words to authors’ writing samples results in diminishing returns. Eder makes a thoughtful note about the nature of this accuracy:

[F]or all the corpora, no matter which method of sampling is used, the rankings are considerably stable: if a given text is correctly recognized using an excerpt of 2,000 words, it will be also ‘guessed’ in most of the remaining iterations; if a text is successfully assigned using 4,000 words, it will be usually attributable above this word limit, and so on. (14)

The classifier accuracy reported on Fig 1’s y-axis is more like the percent of authors for whom their style is recognizable in a bag-of-words of a given size. Also note in the legend that the two different sets of points in the graph represent classification based on randomly selected words (black) vs words drawn from a continuous passage (gray). It comes as a sobering thought that there is at best a 50% chance that “IG” even represents its author’s style, and that chance is probably closer to 30%, since its 2000 words come from a single passage.

I tried to split the difference between those two arcs by using random selections of words from the known suspects’ texts in conjunction with the single passage of “IG.” This is not ideal, so we will need to be up front that we are judging “IG” not against a total accounting of suspects’ style but a representative sample. This situation is begging for a creative solution: one that identifies passages in a novel that are written in a similar mode to the unknown text (e.g. a dialogue, character description, etc) and compare those, but without overfitting or introducing bias into the results. Sure, Eggers’ dialogue won’t look like “IG” — because the story has no dialogue — but what about the moments when he reports a character’s backstory? My hasty solution was simply to take many random selections of words from suspects’ much larger works since, in aggregate, they will account for the texts’ normative degrees of variation; authors are spread out over stylistic space, weighted by their different modes of writing. In the event that “IG” is largely unrepresentative of its author’s style, it may still fall within their stylistic field near similar slices of text, if only at the periphery.

Intertwined Obstacles: Unusual Suspects

Now that the table has been set, let’s take a look at some of the names that have been thrown around. (For convenience, I will also list their most recent work, where it was included in the corpus.)

Each of these suspects was named by Madrigal, including admittedly himself. But I’d like to point out that many of these names are not fiction writers: Schreier is a performance artist, most known for his work in the 70s; Leong has primarily published infographics recently; Ford, Madrigal, and Honan are all journalists. That is, we have a suspect for whom we have no known published material and four journalists who write on sliding scale of narrativity. (One other suspect I looked into had published almost exclusively Buzzfeed listicles.) Honan and Ford have both published long-form pieces recently, and those have been included in the corpus, but the others’ absence must be noted.

Other names that have come up more than once include:

  • Robin Sloan — Mr. Penumbra’s 24-hour Bookstore
  • Dave Eggers — The Circle
  • Thomas Pynchon — Bleeding Edge
  • Susan Orlean — New Yorker narrative non-fiction, The Orchid Thief
  • the horse_ebooks/Pronunciation Book guys

In evaluating the list of suspects, the nature of detective work encroaches on the role of the literary historian. Sure, those two always overlap, but our cases are usually a century or two cold. To put a very fine point on it, I actually staked out a cafe in San Francisco and met Schreier last week, so that I could ask him, among other things, whether he had published any writing during his career. He had not. But the sleuthing has a degree of frankness that we pretend to disavow as historians. (Do we really think Thomas Pynchon, a hulking literary giant of our time, organized this essentially provincial gag? Maybe I’m wrong. Just a hunch.) To wit, we have to remain skeptical of Madrigal’s reportage, since it may all be a ruse. We may be his dupes, complicit in his misdirection. As he put it to another journalist on the case, Dan Raile, “Of course the most obvious answer would be that it is me and Mat Honan.” Even the assumption that there is a list of suspects beyond Madrigal (and Honan) remains provisional. But it’s the best we’ve got.

At a more practical level, the uneven coverage of our most likely suspects points out the complexity of what is referred to as author attribution “in the wild.” The assumption that the true author is already a member of the list of suspects — called the closed-world assumption — has come under scrutiny recently in stylometry.7 This partly kicks the question of attribution over to verification, a measure of confidence in any given attribution. But the case of “IG” intersects with other open problems of corpus size, as well as cross-genre attribution that makes some of the proposed open-world solutions difficult to apply. Knowing this, and with our eyes wide open, let’s see just how far the closed-world methods can take us.

Interpreting Attribution

Using the version of supervised learning that Eder had recommended (SVM classifier, MFW vector minus personal pronouns), I ran it over the corpus of suspects. Each suspect’s text was represented by 100 different bags of 2000 randomly selected words. Under cross-validation, the classifier easily had 100% accuracy among the suspects. Once the classifier had been trained on these, it was shown “IG” in order to predict its author.

I’ll cut to the chase: if the author is already in our list of suspects, the classifier is confident that she is Susan Orlean, based on the style her recent work in the New Yorker. Under one hundred iterations of four-fold cross-validation — in which three fourths of suspects’ texts were randomly chosen for training in each iteration rather than the entire set — the classifier selected Orlean as the author of “IG” 100% of the time. I also tried replacing her New Yorker pieces with The Orchid Thief, since it is a longer, unified text, to the same result. (When including both in the corpus simultaneously, the classifier leaned toward the New Yorker, with about a 90%-10% split.8) The fact that this kind of cross-validated prediction is proposed as one kind of verification method — albeit one that is more robust with a longer list of suspects — indicates that something important might be happening stylistically in Orlean. In order to explore what some of those features are, there are a few methods closely related to Eder’s that we can use to visualize stylistic relationships.

Two popular unsupervised methods include visualizing texts’ MFW vectors using  Cosine Similarity and Principle Component Analysis (PCA), projected into 2-D space. These appear in Fig 2. (Note that Orlean is represented only by her recent New Yorker work.) In these graphs, each of the Xs represents a 2000-word slice of its text. I won’t go into detail here about Cosine Similarity and PCA as methods, except to say that Cosine Similarity simply measures the overlap between texts on a feature-by-feature basis and PCA attempts to find features that do the most to distinguish among groups of texts. In both graphs, the distance between Xs approximates their similarity on the basis of the different measurements used. Overlaid on the PCA graph are the top loadings — words that help distinguish among groups, along with the direction and magnitude of their influence on the graph.

Cosine Similarity (MDS)

Fig 2a. Cosine Similarity over suspects’ MFW vectors

PCA (with loadings)

Fig 2b. PCA over suspects’ MFW vectors, with top 10 loadings

The most striking feature of these graphs — beyond the fact that IG resides well within the spaces circumscribed by Orlean’s style — is the gulf between most of the writers and a group of three: Robin Sloan, Thomas Pynchon, and Paul Ford. I had been unsure about what to make of their grouping, until I looked at the loadings prominent in the first principle component (on the x-axis) in Fig 2b. Is and are stretch out to the left, indicating those are prominent words shared among the three, and those loadings are diametrically opposed to said, had, and was. The gulf appears to be an issue of present tense versus past and glancing over those texts confirms this to be the case. There is, however, one loading that troubles the easiness of verb tense as our explanation: of. It appears to be strongly tied to verbs in the present tense, although it does a very different kind of conceptual work.

Attribution methods purport to measure something about an authorial fingerprint, but the question of what precisely gets measured remains unsettled. The MFW vector doesn’t quite capture grammar but neither does it reduce to word choice. Counting how often an author uses “the” tells us something about how often nouns show up, how they relate to one another, whether nouns tend to be abstract or concrete, unique or one of a set. Many such high-level concepts intersect on each of the high-frequency words we count. It would be impossible to account entirely for the way(s) in which an author orients the reader to the content of the text based on a single such word frequency, but hundreds of them might. This partly sounds like a Jacobsonian problem in which we try to learn parameters for the axes of selection and combination, but I think our goals are closer to those of generative grammar: What are the habits of the author’s cognition? How does the text think its content? And how do we interpret that thinking through these most frequent words?

The findings in Fig 2 are far too provisional to determine whether of is a universally important feature to present tense constructions, but the idea that it is important to Ford, Pynchon, and Sloan’s constructions of the present is certainly compelling.

The phenomenon captured by the second principle component (the y-axis in Fig 2b.) is less obvious at first glance, but looking back over the texts included for the corpus, there is a clear division between fiction and non-fiction. Xs belonging to Ford, Honan, and Orlean tend to have more negative values. It is well known that author attribution methods are sensitive to genre, even among writings by the same author, so it is satisfying to have found that difference among texts without having made it an a priori assumption. In light of this, however, the loadings are all the more curious: the is a very strong indicator of fiction, whereas that and to construct the conceptual landscape of non-fiction reportage. Does this conform to our understanding of non-fiction as characterized by specification or causation? Or of fiction by the image?

The other work of non-fiction, Lost Cat by Caroline Paul, hovers around zero on the y-axis, but the text opens with a suddenly prescient note:

This is a true story. We didn’t record the precise dialogue and exact order of events at the time, but we have re-created this period of our lives to the best of our mortal ability. Please take into account, however: (1) painkillers, (2) elapsed time, (3) normal confusion for people our age.

Paul’s humorous memoir about a traumatic, confusing period (involving a major injury, hence the painkillers) straddles the line we’ve found between long-form reportage and fictitious-world building. Maybe this is exactly what we mean by the conjunction inherent to narrative non-fiction.

The style of “IG,” then, leans slightly toward non-fiction. That’s not entirely surprising, since the story is framed as a book’s preface. The fictitious book Iterating Grace is supposed to have been written (or rather compiled from Twitter) by Koons Crooks — an out-of-work programmer cum spiritualist. The “short story” we have examined is Iterating Grace‘s anonymous introduction describing what little we know about Crooks’ biography. Action is reported after-the-fact rather than ongoing; the story is centered on a single protagonist whom we don’t see interacting much with other characters; the narrator reflects periodically, trying to understand the motivations of the protagonist. This sounds like a New Yorker article. And if we interpret “IG” to lie within Orlean’s stylistic field in the PCA graph, it lies among the slices of her articles that lean toward fiction.

Despite these similarities in features of generic importance, the fact that IG lies on the edge of her field in the scaled Cosine Similarity graph qualifies the strength of the PCA finding. Without going into the multidimensional scaling formula, IG may be artificially close to Orleans’ work in the Cosine Similarity graph simply because hers is the least dissimilar. This is precisely the kind of uncertainty that verification hopes to resolve. I will, however, sidestep the largest question of whether Orlean is the author of “IG,” in order to ask one about her style. Since we have established that “IG” is similar to Orlean’s non-fiction at its most fictional, I want to know how similar. What degree of resemblance does “IG” bear to the normative range of her work?

To answer this question, I propose not a sophisticated verification method, but a naive one: the smell test.

Naive Attribution: The Smell Test

One of the limitations of using MDS over many different authors, as in Fig 2a, is that it has to negotiate all of their relationships to one another, in addition to their degree of internal differentiation. Therefore, I would suggest that we simply observe the scaled Cosine graphs for each author individually to a single dimension and, then, locate “IG” on it. This has the virtue of capturing information about the normative co-occurance of features in the author’s stylistic field and using that to gauge the textual features of “IG.”

For the sake of comparison, I will show, along with Orlean, the authors who are the next most similar to and the most different from “IG” by this method, Po Bronson and Robin Sloan respectively.


Fig 3a. Stylistic variation in Susan Orlean’s New Yorker non-fiction over 2000-word slices


Fig 3b. Stylistic variation in Po Bronson’s fiction over 2000-word slices


Fig 3c. Stylistic variation in Robin Sloan’s Mr. Penumbra over 2000-word slices

Taking the author’s style as a field circumscribed by so many slices of their text, “IG” lies fully outside of our visualization of that field for both Bronson and Sloan, and just within the periphery for Orlean. Another way to put it is that most of the slices of Orlean’s style are on average more similar to “IG” than they are to her most dissimilar slice. This is certainly not a resounding endorsement for her authorship of “IG,” but it helps us to understand where “IG” falls within her field. The story is just at the outer limit of what we would expect to see from her under normal circumstances. Should she reveal herself to be the true author of “IG” at some point in the future, we would know that it had been a stylistic stretch or an experiment for her. Should either Bronson or Sloan do the same, then we might be compelled to interpret IG as a stylistic break.

I would like to emphasize, however, that the smell test is just that: a non-rigorous test to evaluate our hunches. (Perhaps we can start to imagine a more rigorous version of it based not on MDS but on taking each slice’s median similarity to the others.) I do not want to overstate its value as a verification method, but instead guide our attention toward the idea of an expected degree of stylistic variation within a text and the question of how we account for that, especially in our distant reading practices. The decision to include stopwords (or not) is already understood to be non-trivial, but here we can see that their variations within the text are potentially interpretable for us. One of the risks of modeling an entire novel as a single bag of words is that it may be rendered artificially stable viz the relative frequencies of words and the conceptual frameworks they convey.

Reiterating Grace

So what do we make of the idea that Susan Orlean might be the author of “Iterating Grace?” Her name was probably the most out of place on our initial list of suspects (along with Pynchon, in my opinion). But with the door open on Orlean’s potential authorship, I’d direct attention to Steve DeLong’s hypothesis that she penned it, so it could be distributed by Jacob Bakkila and Thomas Bender, the masterminds behind horse_ebooks and Pronunciation Book. For the uninitiated, these were internet performance art pieces that involved Tweeting and speaking aloud borderline intelligible or unexpected phrases. One of the Internet’s longer-running mysteries was whether horse_ebooks had been run by a human or a poorly-coded Twitterbot. If you would like to know more about Bakkila and Bender, perhaps Orlean’s profile for the New Yorker would interest you.

I’ll refrain from speculative literary history, but DeLong’s approach should remind us to look outside the text for our theorization of its authorship, even as we dig in with computational tools. If we understand a text’s most frequent words to convey something about its conceptual geography, then we might reframe the story as a kind of problem posed to Orlean by Bakkila and Bender, or that it lies at the intersection of problems they work on in their writing and performance art respectively. This suggests that evidence of a potential collaboration may be visible in the text itself. Unfortunately for our methods, even if we wanted to suss out the intellectual contributions of those three maybe-conspirators, most of Bakkila’s published material is the almost-but-never-quite gibberish of horse_ebooks. This returns us to the fundamental problem of attribution in the wild.

In case I have overstated the likelihood of Orlean’s authorship, I offer this qualification. I reran this experiment with one parameter changed. Instead of randomly sampling words from suspects’ texts for the MFW vector, I selected continuous passages, as represented by the lower arc in Eder’s Fig 1. In fact, the overall results were very similar to the ones I have discussed but fuzzier. For instance, even though nine of the top ten loadings from the PCA were the same, the boundaries between author-spaces were less well-defined in that graph, as well as among their Cosine Similarities.

Bottom line, the SVM classifier still typically attributed authorship to Orlean, but depending on the particular passages and training set also often chose Joshua Cohen, whose new novel, Book of Numbers, was published suspiciously close to the first appearance of “IG.” And every now and then, when passages lined up just right, the classifier chose Thomas Pynchon or Po Bronson. (Although it was mostly consistent with the SVM predictions, the smell test was similarly sensitive to the selected passages.) In passages of 2000 words, authorial style is far less differentiable than under samples of the same size: word frequencies reduce quickly to artifacts of their passages. I don’t say this to invalidate the findings I’ve shared at all. Comparing “IG” to the random samples of other authors gave us stylistic points of reference at a high resolution. Instead I hope to make clear that when we try to determine its authorship, we need to decide whether we interpret “IG” as something representative or artifactual. Is the text an avatar for an author we probably will never know or is it simply the product of strange circumstances?

That is perhaps the most important question to ask of a story that begins:

You don’t need to know my name. What’s important is that I recently got a phone call from a young man outside Florence named Luca Albanese. (That’s not his name either.) … He’d had an extraordinary experience, he said, and was in possession of “some unusual materials” that he thought I should see.

Framed by its own disavowel of authorship and identity, “Iterating Grace” is a set of unusual materials — not quite a story — asking only that we bear witness.

1Also tweeting graphs at Madrigal was Anouk Lang, who very generously corresponded with me about methods and corpora. Any intuition that I have gained about stylometric methods has to be credited to her. And reflecting on digital collaboration for a moment, Anouk’s reaching out initially via tweet and the back-and-forth that followed demonstrates for me the power of Twitter as a platform, should we all be so generous.

2Hat tip to Camille Villa for passing along Eder’s work. The DHSI 2015 organizers can rest easy knowing that workshop knowledge is being shared widely.

3The Burrows’s Delta has enjoyed popularity among stylometrists but is somewhat provincial to that field. It is essentially a distance measure between vectors, like Cosine Similarity, but that normalizes feature distances based on their standard deviation across the corpus. See Argamon, “Interpreting Burrows’s Delta: Geometric and Probabilistic Foundations”

4David Hoover recommends further minimizing textual specificity — relative to other texts in the corpus as well as to the author’s personal style — by including in the MFW vector only those words that appear in at least 60-80% of all texts in the corpus, as a rule of thumb. This is one major parameter that Eder does not explore systematically, suggesting that he may not use a minimum document frequency. In his test of different MFW vector sizes, Eder confirms Hoover’s general recommendation that larger MFW vectors perform better (up to about 15,000 words), but adds the qualification that optimal vector size will vary with writing sample size. Since IG is 2000 words and my corpus is relatively small, I will use a 500-word vector without a minimum document frequency.

5Madrigal puts the word count at 2001 — which feels symbolic for a story about the end of the dot-com bubble — although my (imperfect) tokenizer lands at 1970. For convenience, I have referred throughout the blog post to sets of 2000 words, but in practice, I have used 1970 tokens.

6Eder validates the practice of stitching together multiple pieces in the same genre by an author, and I have done this with Bronson, Honan, and Orlean.

7Stolerman, et al. “Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution.” The Tenth Annual IFIP WG 11.9 International Conference on Digital Forensics. January 2014. Vienna, Austria.; Koppel et al. “The Fundamental Problem of Authorship Attribution.” English Studies. 93.3 (2012): 284-291.

8Readers who follow the link to Steve DeLong’s blog post that is embedded toward the end of this one, will find there an excerpt of an email I had sent him with some of my findings. In that email I had reported the opposite effect: when the classifier examined both Orlean’s New Yorker pieces and The Orchid Thief, it leaned toward the latter. At the time, I had been experimenting with different parameters and had followed Hoover’s suggestion to include only words with a minimum document frequency of 60%. In fact, this had reduced the size of the MFW vector to under 200. Since then, I have preferred to not to use a minimum document frequency and simply to use the 500 most frequent words, due to the smallness of the corpus.

Hello world!

Setting out on a new blog is an exciting thing. In its infancy, it is polymorphous, changing form at will without the inhibition of an established identity. As of this writing, my immediate goal for the blog is to share code and interesting findings from projects as they come up.  (In fact, this post should be followed in the next day or so by one describing an author attribution problem I’ve been working on recently.) At some point, however, I’m sure I will feel compelled to weigh in on problems raised elsewhere in the DH interwebs, and then this blog may move into a more dialogical mode.

Maybe I will stick to the technical, computational side of things, or maybe I will dive headlong into humanistic questions or the academic institutional problems that congeal in DH. I have plans for the shape I would like this to take, but those will almost certainly change as I engage with new problems and need new things from the platform. At about a year into DH scholarship, the research I plan to share here has only just recently emerged from its own mirror stage. I hope that you will get as much out of reading this blog as I get from writing it!