The following is a summary of my dissertation project, written for a general audience. Consider this me dipping my toe into public writing. If you want to read or hear more, please be in touch!
The new wave of AI — powered by Large Language Models — has a love affair with fiction.
The models need vast training datasets of text, since they learn about language as a matter of sheer statistical regularity. Companies like OpenAI and Meta are secretive about their datasets but, from what researchers have gathered, virtually any model you’ve heard of was trained partly on novels.
What are they doing there?
Computer scientists sometimes point to the varied nature of literary writing and refer to their encompassing generality of language. Other times, they highlight features of novels that are uniquely fictional, like an omniscient narrator.
If there is not agreement about the reason for training AI models on fiction, then it is certain that its effects have already been felt. Large Language Models have seen industrial application since at least 2019, when a model called BERT was implemented under the hood of the Google search engine.
What difference does fiction make in a Large Language Model? And what difference has it already made for users?
I address these questions using a technique that I call “critical modeling.” I study one particular dataset by identifying key historical moments in its circulation and developing computational models that reflect researchers’ and engineers’ goals at the time.
Specifically, I study the BookCorpus, a collection of ten thousand commercial novels, compiled in 2014 by a group of computer scientists. The dataset circulated widely among researchers and eventually became training data for the first generation GPT model and for BERT.
I find that models trained on BookCorpus learn how to be social. Like characters in a novel, they learn the language of motive, conflict, and pragmatism. And the effect is sustained in user-facing applications as well. We meet these characters each time we converse with ChatGPT or submit a query to Google.
By taking a “critical modeling” approach, I show that different elements of the data become salient in each application. For app users — as for novel readers — fiction gives us the chance to think about different ways we connect to one other in the real world.
There is something to love, after all, about fiction in the age of Large Language Models.

