I will store on this page permanent links to datasets and code that I use for projects on this website, and I invite you to join me in collaborating on them and learning through them. I also post code to GitHub.
Chicago Corpus Word Embeddings
The Chicago Corpus of US Novels is a widely used resource for distant reading projects on American literature in the twentieth century, since it contains full text editions of more than 9,000 novels published between 1880 and 2000. Due to copyright, only part of the corpus can be made publicly available. However, derived datasets such as word embeddings trained from the novels can be freely shared. Word embeddings are numerical representations of words that encode information about meaning and usage.
Keywords in Context
We can learn a great deal about the meanings of words by their usage in context. This is, of course, the intuition behind recent techniques like word embedding, but there are shorter routes from A to B if we wish to interpret our findings. Mapping keywords to their contexts (within an adjacency window) and performing Principle Component Analysis over the context space can help us to locate similar keywords nearby one another.
The code is available as a Jupyter Notebook here.
Replicating R Functions in Python
I prefer programming in Python to R for many reasons, but one of the major drawbacks is that Python simply does not have the robust and well-tailored packages that R does for statistics and visualization. Perhaps out of stubbornness or simply a compulsion to tinker, I often finding myself building the most useful R functions not-quite-from-scratch in Python using a few of the latter’s statistical and machine learning packages (numpy, pandas, scikit-learn).
Visualizing Principal Component Analysis: biplot()
This is a simple script for Python that aims to replicate the most basic function — and ease — of the biplot() function in R. It is not meant to be a comprehensive tool, but a shortcut for quick visualization of PCA. Feel free to change the script in any way; it is heavily commented to guide any edits you may need to make. The script relies on three standard scientific Python packages: pandas, scikit-learn, & matplotlib.
Determining a “Correct” K for K-Means Clustering: Hartigan’s Rule
PYTHON-HARTIGAN is a python script that provides an unsupervised method to determine an optimal value K in K-Means clustering. The script is an implementation of Hartigan’s Rule — essentially, it measures the change in goodness-of-fit as the number of clusters increases. This implementation is based on Chiang & Mirkin (2009), §3.1(B). Feel free to change the script in any way; it is commented to guide any edits you may need to make.
A similar function in R is FitKMeans() in the ‘useful’ package, which also implements Hartigan’s Rule.
Author Attribution “In The Wild”
Authorship attribution methods excel when the true author of a text is already in the list of suspects. However, this is often not the case when researchers perform attribution “in the wild.” A post on this blog regarding the authorship of a particular short story explores the interpretative assumptions of existing attribution methods, and it proposes a naive but generally applicable method (a “Smell Test”) for evaluating the possibility that a given suspect is the unknown text’s true author. Much of the code implements Eder (2013).
The “Smell Test” as a self-contained function: smell-test()
Unfortunately, most of the texts in this project’s corpus are under copyright, and I cannot freely distribute them. I have included within the blog post, however, links to materials that exist online. For convenience, I have also included Iterating Grace.txt in the project’s GitHub repository.