A quick post to announce the public distribution of word embeddings trained from the Chicago Text Lab’s corpus of US novels. They will be hosted by this blog and can be downloaded from this link (download) or through the static Open Code & Data page.
From the Chicago Text Lab’s description of the corpus, the corpus contains
American fiction spanning the period 1880-2000. The corpus contains nearly 9,000 novels which were selected based on the number of library holdings as recorded in WorldCat. They represent a diverse array of authors and genres, including both highly canonical and mass-market works. There are about 7,000 authors represented in the corpus, with peak holdings around 1900 and the 1980s.
In total, the corpus consists of over 700M words and the embeddings’ vocabulary contains 250K unique terms.
The embeddings are learned by the word2vec algorithm distributed in the Python package gensim, version 4.0.1, which implements the skip-gram model described in Mikolov et al, 2013a and Mikolov et al, 2013b. Parameters include
- Vector Size: 300 dimensions
- Window Size: 5 words
- Training Epochs: 3 iterations
All other parameters are default values in gensim (see documentation).
The embeddings are distributed as word vectors in a plain-text file, according to the original word2vec format: one vector per line; initial keyword; values separated by whitespace.
Enjoy a visualization of words similar to the keyword “modern” in the model.