Collaborating with HathiTrust


The screenshot above documents an exciting moment for our ongoing collaboration with the HathiTrust Research Center. It is a screenshot of my computer, remotely connected to a HathiTrust machine, running my Program Era Project text-mining tools on sample HathiTrust text data. In short, it’s a proof-of-concept, a confirmation that the PEP tools are ready to begin collecting data on thousands of texts produced by creative writers affiliated with the University of Iowa, and that the PEP team can begin to join that data with the wealth of institutional, biographical, and demographic data they have already collected.

This screenshot is a result of HathiTrust’s selection of the Program Era Project as a 2017 Advanced Collaborative Support award winner. HTRC is a collaboration between partner universities that houses an expansive digital library. As HathiTrust’s site explains, the ACS program is:

a scholarly service at the HathiTrust Research Center (HTRC) offering collaboration between external scholars and HTRC staff to solve challenging problems related to computational analysis. By working together with scholars, we facilitate computational-oriented analytical access to HathiTrust based on individual scholarly or educational need.

For the HathiTrust/PEP collaboration, the approach chosen was to establish a “Data Capsule,” a machine maintained and secured by HathiTrust, that PEP team members can remotely access and then run text mining experiments on a corpus of texts held in HathiTrust’s collections. The Data Capsule approach is crucial, as the texts to which we require access remain in copyright; they simply aren’t accessible in digital form for large-scale data collection. The Data Capsule configuration allows full texts of HathiTrust works to be measured by text mining software, but only the metrics collected by the tools can be moved off the Data Capsule machine. In PEP’s case, this means .csv spreadsheets of data on individual texts.

Now, thanks to the HathiTrust/PEP collaboration, the tools I created for the Program Era Project (described a bit more here) can be employed on a large volume of digital texts. They can be used not just for experiments, but to begin building a database of metrics on features of creative writing at the University of Iowa. For the Data Capsule collection, the Program Era Project team assembled a list of roughly 400 selected authors associated with the Writers’ Workshop and the Nonfiction Writing Program. Since receiving this list, the HathiTrust team has endeavored to find all the works held by HTRC associated with these authors. At present, over 2000 volumes have been connected to the PEP authors list. These items are then made accessible on the Data Capsule and, using the PEP tools, converted into metrics which are stored in the PEP database.

So, what data is the Program Era Project collecting? Currently, I’ve built two text mining tools for the Program Era Project. Both are written in Python and draw on Stanford’s Natural Language Toolkit (NLTK). We call them Style Card and LitMap.

Style Card is a text analysis tool that measures features of literary style such as vocabulary size, sentence length, adverb and adjective usage, and frequency of male and female pronouns. The last metric is particularly interesting, as it can provide a quick impression of gender representation trends in a work or collection of works. Additionally, by collecting the same metrics from multiple authors or multiple works from one author, stylistic comparisons can be made between aspects of later or earlier works of a single author or the complete corpora of two authors. It is, in short, like creating baseball cards for authors and literary works, snapshots of information that can be used to establish or test hypotheses.

LitMap is a software package that tracks location references in literary corpora, making it possible to analyze regional representation in literary works. This allows us to see the influence of an author’s biography on their literary output as well as measure the influence of authors migrating to and from creative writing programs on the settings of their writing. Using LitMap, we’ve already made some interesting discoveries about the frequency with which works written by authors who taught at and/or attended the University of Iowa mention the state of Iowa and the region of the midwest. The team is looking forward to sharing more with you on that topic in the future.

What’s truly significant (and truly promising) about the data these tools collect is that it will be stored in the PEP team’s database. When the ACS data is incorporated into the PEP database and available to future users of the PEP web presence, users, at a glance will be able to rank and compare features themselves.

The images below represent a proposal for the eventual look and feel of the Program Era Project web presence. The numbers used are drawn from data already collected with the PEP text mining tools. As the first figure shows, a user could rank writers by average sentence length, learning, at a glance, which authors typically create sprawling (or terse) sentences. The second image ranks authors based on the ratio of male pronouns to female pronouns. The larger the number, the more often male pronouns appear compared to female pronouns.

StyleCard Sentences.png


Users could also compare two authors—or an individual author to a control corpus—and look at differences such as first-person and third-person pronoun use (a potential indicator of narrational patterns and preferences) or adverb and adjective ratios (which can index spare or detailed prose). Scholars could see at a glance how an author’s stylistic features might compare to their advisor, how they compare to other writers in the corpus, or to a baseline corpus of writing in English.

StyleCard Compare Depiction 2.png

StyleCard Control Compare Depiction.png

In the following image, produced using plotly, the platform we currently use to visualize LitMap data, we see another way the PEP text tools will provide new insights into literary corpora. The image shows the strong representation of Iowa in a literary corpus comprising 75 novels by Workshop-affiliated writers, documenting how their time at Iowa has left a mark on where they write about.


The idea behind offering these metrics to future users of the Program Era Project website is that access to this information will prompt curiosity and exploration. Moreover, when users find an interesting pattern or phenomenon in the data, we hope it will prompt a direct investigation of the works included in the data. In short, beyond just presenting this information, we believe that the ability to skim over these metrics will inspire scholars to dive deeper into the texts the data is drawn from. These objectives of encouraging emergent research and driving curiosity are at the heart of Style Card and LitMap’s other principal innovation: the use of clear, easy to understand metrics. The fields of stylometry and text analysis have developed techniques that allow for astounding technical and scholarly achievements, author attribution being a notable example. However, understanding how a piece of software or a quantitative approach arrived at the conclusions it did can be difficult for users not familiar with the theoretical foundations or technologies employed. To this end, metrics tracked by Style Card were selected so that users are offered information that is easy to understand and transparent. By using simple numbers, StyleCard metrics allow any scholar—whatever their experience and training with quantitative analysis—to benefit from the Program Era Project website, broadening the number of academic projects that might be inspired by quantitative analysis.

Even better, both the Style Card and Lit Map tools were developed in such a way that anyone can use them. You simply click the program file, type in the name of the author and work you are scanning and select the name you want for the output file you will create. The tool does everything else. What this means is two-fold. First, it allows more collaboration with building our database of text metrics. If a team member can access the Data Capsule, they can easily run the software to collect metrics. Secondly, these tools will eventually be made freely available online. Therefore, any other project team that wishes to collect the same metrics will have the option available to them. Because the tools will be open source, users will also have the option to modify, adjust, and tweak the technology to their own needs. Moreover, any school that might be interested in learning more about their own history of Creative Writing, or any school that might wish to establish a partner project to the one here at the University of Iowa, will have the necessary technology.

All that said, I hope you can see why we are so excited about the screenshot of our text tools running in conjunction with the HathiTrust Data Capsule. The image represents another significant step along the way to our goal of providing students and scholars of literature the ability to explore the history of the Iowa Writers’ Workshop—and the history of Creative Writing at Iowa—in a way that was never before possible.