Introduction

I am Teddy Roland, a PhD candidate in English at the University of California, Santa Barbara, where I study American Literature in the Twentieth and Twenty-First Centuries.

My research is characterized as “distant reading,” which theorizes the use of computers and statistics toward literary interpretation. Although these methods may appear strange at first glance, the questions I hope to answer are familiar ones in literary history: How do the conditions of reading and writing literature change over the past century? How does the arrival of now-ubiquitous computer technology shape these?

My teaching similarly prepares students for the new literacies that are asked of them by computers and AI.

For more information see the About page, as well as summaries of my Research and Teaching. Below you will find my occasional blog on the Digital Humanities.

Chicago Corpus Word Embeddings

A quick post to announce the public distribution of word embeddings trained from the Chicago Text Lab’s corpus of US novels. They will be hosted by this blog and can be downloaded from this link (download) or through the static Open Code & Data page.

From the Chicago Text Lab’s description of the corpus, the corpus contains

American fiction spanning the period 1880-2000. The corpus contains nearly 9,000 novels which were selected based on the number of library holdings as recorded in WorldCat. They represent a diverse array of authors and genres, including both highly canonical and mass-market works. There are about 7,000 authors represented in the corpus, with peak holdings around 1900 and the 1980s.

Continue reading

Distant Reading: An Exam List

As a resource to future graduate students, I am sharing the reading list I compiled for my qualifying exam on Distant Reading. Below the list, you will find a user manual of sorts that explains the rationale for each of the selections.

My goal for posting is by no means to assert an authoritative list, but to offer a provisional set of principles and touchpoints for ongoing conversations in the field. I hope that many more such lists will eventually be posted, as the field matures and as new students and perspectives contribute to the project of Distant Reading.

Good luck on your exam!

Continue reading

A Naive Empirical Post about DTM Weighting

In light of word embeddings’ recent popularity, I’ve been playing around with a version called Latent Semantic Analysis (LSA). Admittedly, LSA has fallen out of favor with the rise of neural embeddings like Word2Vec, but there are several virtues to LSA including decades of study by linguists and computer scientists. (For an introduction to LSA for humanists, I highly recommend Ted Underwood’s post “LSA is a marvelous tool, but…“.) In reality, though, this blog post is less about LSA and more about tinkering with it and using it for parts.

Continue reading

What We Talk About When We Talk About Digital Humanities

The first day of Alan Liu’s Introduction to the Digital Humanities seminar opens with a provocation. At one end of the projection screen is the word DIGITAL and at the other HUMAN. Within the space they circumscribe, we organize and re-organize familiar terms from media studies: media, communication, information, and technology. What happens to these terms when they are modified by DIGITAL or HUMAN? What happens when they modify one another in the presence of those master terms? There are endless iterations of these questions but one effect is clear: the spaces of overlap, contradiction, and possibility that are encoded in the term Digital Humanities.

Pushing off from that exercise, this blog post puts Liu’s question to an extant body of DH scholarship: How does the scholarly discourse of DH organize these media theoretic terms? Indeed, answering that question may shed insight on the fraught relationship between these fields. We can also ask a more fundamental question as well. To what extent does DH discourse differentiate between DIGITAL and HUMAN? Are they the primary framing terms?

Provisional answers to these latter questions could be offered through distant reading of scholarship in the digital humanities. This would give us breadth of scope across time, place, and scholarly commitments. Choosing this approach changes the question we need to ask first: What texts and methods could operationalize the very framework we had employed in the classroom?

Continue reading

Reading Distant Readings

This post offers a brief reflection on the previous three on distant reading, topic modeling, and natural language processing. These were originally posted to the Digital Humanities at Berkeley blog.

When I began writing a short series of blog posts for the Digital Humanities at Berkeley, the task had appeared straightforward: answer a few simple questions for people who were new to DH and curious. Why do distant reading? Why use popular tools like mallet or NLTKIn particular, I would emphasize how these methods had been implemented in existing research because, frankly, it is really hard to imagine what interpretive problems computers can even remotely begin to address. This was the basic format of the posts, but as I finished the last one, it became clear that the posts themselves were a study in contrasts. Teasing out those differences suggests a general model for distant reading.

Continue reading

A Humanist Apologetic of Natural Language Processing; or A New Introduction to NLTK

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. Images have been included in the body of this post, which we were unable to originally. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

Computer reading can feel like a Faustian bargain. Sure, we can learn about linguistic patterns in literary texts, but it comes at the expense of their richness. At bottom, the computer simply doesn’t know what or how words mean. Instead, it merely recognizes strings of characters and tallies them. Statistical models then try to identify relationships among the tallies. How could this begin to capture anything like irony or affect or subjectivity that we take as our entry point to interpretive study?

I have framed computer reading in this way before – simple counting and statistics – however I should apologize for misleading anyone, since that account gives the computer far too much credit. It might imply that the computer has an easy way to recognize useful strings of characters. (Or to know which statistical models to use for pattern-finding!) Let me be clear: the computer does not even know what constitutes a word or any linguistically meaningful element without direct instruction from a human programmer.

In a sense, this exacerbates the problem the computer had initially posed. The signifier is not merely divorced from the signified but it is not even understood to signify at all. The presence of an aesthetic, interpretable object is entirely unknown to the computer.

Teasing out the depth of the computer’s naivety to language, however, highlights the opportunity for humanists to use computers in research. Simply put, the computer needs a human to tell it what language consists of – that is, which objects to count. Following the description I’ve given so far, one might be inclined to start by telling the computer how to find the boundaries between words and treat those as individual units. On the other hand, any humanist can tell you that equal attention to each word as a separable unit is not the only way to traverse the language of a text.

Generating instructions for how a computer should read requires us to make many decisions about how language should be handled. Some decisions are intuitive, others arbitrary; some have unexpected consequences. Within the messiness of computer reading, we find ourselves encoding an interpretation. What do we take to be the salient features of language in the text? For that matter, how do we generally guide our attention across language when we perform humanistic research?

The instructions we give the computer are part of a field referred to as natural language processing, or NLP. In the parlance, natural languages are ones spoken by humans, as opposed to the formal languages of computers. Most broadly, NLP might be thought of as the translation from one language type to another. In practice, it consists of a set of techniques and conventions that linguists, computer scientists, and now humanists use in the service of that translation.

For the remainder of this blog post, I will offer an introduction to the Natural Language Toolkit, which is a suite of NLP tools available for the programming language Python. Each section will focus on a particular tool or resource in NLTK and connect it to an interpretive research question. The implicit understanding is that NLP is not a set of tools that exists in isolation but necessarily perform part of the work of textual interpretation.

I am highlighting NLTK for several reasons, not the least of which is the free, online textbook describing their implementation (with exercises for practice!). That textbook doubles as a general introduction to Python and assumes no prior knowledge of programming.[1] Beyond pedagogical motivation, however, NLTK contains both tools that are implemented in a great number of digital humanistic projects and others that have not yet been fully explored for their interpretive power.

Continue reading

Topic Modeling: What Humanists Actually Do With It

pennsylvania20gazzette

This post originally appeared on the Digital Humanities at Berkeley blog. It is the second in what became an informal series. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

One of the hardest questions we can pose to a computer is asking what a human-language text is about. Given an article, what are its keywords or subjects? What are some other texts on the same subjects? For us as human readers, these kinds of tasks may seem inseparable from the very act of reading: we direct our attention over a sequence of words in order to connect them to one another syntactically and interpret their semantic meanings. Reading a text, for us, is a process of unfolding its subject matter.

Computer reading, by contrast, seems hopelessly naive. The computer is well able to recognize unique strings of characters like words and can perform tasks like locating or counting these strings throughout a document. For instance, by pressing Control-F in my word processor, I can tell it to search for the string of letters reading which reveals that so far I have used the word three times and highlights each instance. But that’s about it. The computer doesn’t know that the word is part of the English language, much less that I am referring to its practice as a central method in the humanities.

To their credit, however, computers make excellent statisticians and this can be leveraged toward the kind of textual synthesis that initiates higher-order inquiry. If a computer were shown many academic articles, it might discover that articles containing the word reading frequently include others like interpretationcriticismdiscourse. Without foreknowledge of these words’ meanings, it could statistically learn that there is a useful relationship between them. In turn, the computer would be able to identify articles in which this cluster of words seems to be prominent, corresponding to humanist methods.

This process is popularly referred to as topic modeling, since it attempts to capture a list of many topics (that is, statistical word clusters) that would describe a given set of texts. The most commonly used implementation of a topic modeling algorithm is MALLET, which is written an maintained by Andrew McCallum. It is distributed as well in the form of an easy-to-use R package, ‘mallet‘, by David Mimno.

Continue reading

Ghost in the Machine

Text Analysis Demystified: It's just counting.

This post originally appeared on the Digital Humanities at Berkeley blog. It is the first in what became an informal series. For a brief reflection on the development of that project, see the more recent post, Reading Distant Readings.

Computers are basically magic. We turn them on and (mostly!) they do the things we tell them: open a new text document and record my grand ruminations as I type; open a web browser and help me navigate an unprecedented volume of information. Even though we tend to take them for granted, programs like Word or Firefox are extremely sophisticated in their design and implementation. Maybe we have some distant knowledge that computers store and process information as a series of zeros and ones called “binary” – you know, the numbers that stream across the screen in hacker movies – but modern computers have developed enough that the typical user has no need to understand these underlying mechanics in order to perform high level operations. Somehow, rapid computation of numbers in binary has managed to simulate the human-familar typewriter interface that I am using to compose this blog post.

There is a strange disconnect, then, when literature scholars begin to talk about using computers to “read” books. People – myself included – are often surprised or slightly confused when they first hear about these kinds of methods. When we talk about humans reading books, we refer to interpretive processes. For example, words are understood as signifiers that give access to an abstract meaning, with subtle connotations and historical contingencies. An essay makes an argument by presenting evidence and drawing conclusions, while we evaluate it as critical thinkers. These seem to rely on cognitive functions that are the nearly exclusive domain of humans, and that we, in the humanities, have spent a great deal of effort refining. Despite the magic behind our normative experience of computers, we suspect that these high-level interpretive operations lie beyond the ken of machines. We are suddenly ready to reduce computers to simple adding machines.

In fact, this reduction is entirely valid – since counting is really all that’s happening under the hood – but the scholars who are working on these computational methods are increasingly finding clever ways to leverage counting processes toward the kinds of cultural interpretation that interest us as humanists and that help us to rethink our assumptions about language.

Continue reading

Attributing Authorship to “Iterating Grace,” or The Smell Test of Style

Author attribution, as a sub-field of stylometry, is well suited to a relatively small set of circumstances: an unsigned letter sent by one of a handful of correspondents, an act of a play written by one of the supposed author’s colleagues, a novel in a series penned by a ghostwriter. The cases where author attribution shines are ones in which there exists (1) a finite list of potential authors (2) for whom we have writing samples in the same genre, and (3) the unknown text is itself long enough to have a clear style. If either of the latter conditions is unmet, the findings start getting fuzzy but are still salvageable. Failing the first condition, all bets are off.

And “Iterating Grace,” fails all three.

Continue reading