Text Modelling and the Anglo-Saxon Poetic Corpus
"Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte."
"I have made this letter longer than usual because I lack the time to make it shorter."
--Blaise Pascal, Letters
Although I am a scholar of Anglo-Saxon language and literature, the purpose of this article is not the examination of Old English poetry, or at least that is not the primary goal. Instead, I will be describing my process for subjecting the Anglo-Saxon Poetic Corpus to computer-aided analysis; in effect, I will be describing all the preliminary work necessary for engaging in this kind of study, rather than the findings themselves. In this, I hope that the demonstration is useful.
All software used for this project can be obtained free of charge through the appropriate links provided throughout the article. Although I have used Windows 8.1 as my primary operating system throughout this process, I have also made use of Linux (in this case, both Ubuntu 14.04 and Porteus 3.0.1) and the various tools that come with them.
What is Text Modelling?
In a nutshell, Text Modelling uses a massive number of sample texts to draw general conclusions about the ways those texts operate. This extends both to the usage of words on their own and their use in context. Although there is nothing inherently digital about this process, the sheer number of texts required for a meaningful sample and the repetitive nature of the tasks required for analysis make this work ideal for computer automation.
Although Text Modelling as a term can describe any number of processes, the Digital Methods course focused on a specific few: Google's N-gram viewer, Voyant, and the now-defunct Bookworm are effective tools for visualization of data, while Topic Models created by tools like MALLET create groups of words associated with topics based on their presence within texts and the other words that are in those texts.
With these things in mind, I developed a plan that would simultaneously require me to demonstrate knowledge of several different techniques while also providing a glimpse into their practical use in my dissertation: to scrape, clean, and model the entire surviving corpus of Old English poetry. Admittedly, the entire Anglo-Saxon Poetic Corpus is only roughly 30,000 lines long, so there is not enough to provide the kind of data input that is required to truly harness the power of such methods. Nonetheless, it is also a finite set of texts; although manuscripts are periodically found, what we currently have represents the vast majority of all the poetry we will ever have from this language in this period in history.
As such, subjecting the surviving poetry of the Anglo-Saxons to Topic Modelling and other analytical tools may provide insights that had previously been inaccessible, and although there may not be a large enough sample size for statistically significant patterns to emerge, the analysis itself may suggest other answers or avenues of inquiry worth investigating using traditional scholarly methods.
The following pages record the process, sometimes quite messy, by which I attempted to accomplish these stated goals. Again, these pages reflect on the methodology used and not on the results. Should there be any questions, however, regarding either what I did or how I did it, you may feel free to contact me at email@example.com.