Breaking down the HTRC data capsule

We’ve been combing through the index of the HathiTrust Data Capsule [more on how it was put together here].  We requested any volumes authored by a list of 380 writers and got roughly 2500 volumes back.  After deduping and weeding out false positives, we have about 1500 unique volumes representing 234 Iowa-affiliated authors, with publication dates ranging from 1913-2011*.

We’re interested in analysis on a couple of levels – first, generating StyleCard and LitMap analytics for individual authors, and eventually, exploring other trends that can be found in Workshop writing  when we look at workshop texts in aggregate [more information on the tools here].  In light of this, here are some thoughts and questions on my mind as we look over the capsule.

Multiple authors

The co-authorship problem: how do we handle texts that were co-written by multiple authors?  Many volumes are anthologies of works by several authors, sometimes limited to Workshop graduates and sometimes not.  How shall we go about extracting text by just the authors we’re interested in (without having to do it manually)?

Volume Authorship breakdown

We also have some instances of books written by one Workshop writer, with a segment by another Workshop writer – an anthology of short stories by William Kittredge with a foreword by Raymond Carver, for example.  In some contexts this might not be a problem, but when we start comparing author signals with StyleCard analytics, we don’t want to muddy a signal with text by another writer of interest.  That being said, who is including who in anthologies is an interesting question in and of itself.

Anthology duplicates

Many of the authors we’re interested in wrote primarily poems, short stories, or essays, and they’ve been published in anthologies.  Sometimes the same poem or story will appear in multiple anthologies.  How do we avoid double-dipping, and magnifying the signal of some works out of proportion to others (again, without having to do it manually)?

Unequal distribution

Some authors we were interested in, of course, are not included in the corpus at all.  Of those that are, some are much more strongly represented than others.

Distribution of works per author.
Distribution of works per author.

About half of the authors represented have only one or two volumes in the corpus. Without further scrutiny I can’t say say whether this is more due to a real difference in author prolificacy, or their representation and findability in the HT corpus (or, likely, a combination of the two).  For author-specific corpora, this is going to mean a much more reliable/well-sampled author signal for some writers than others.

For any analysis looking at some kind of “Workshop style”, we’re going to have to correct for this somehow.  How should we go about this?  Maybe chopping up works into segments and choosing random samples?  Selecting representative works for each author based on some rationale relating to literary qualities or critical reception?  Generating individual author metrics and combine them instead of running analytics on the entire corpus at once? Or, alternatively, should some authors be weighted more than others, and do we have the numbers in the corpus to do so in a way that we might want?

Derivatives

There are a few dozen translations in the corpus, the majority of them by Robert Bly (accounting for a little under half of his included works).  We will likely not include them in initial stages of analysis, but I’m very interested to see if we notice distinctive differences between his original works and his translations!

An interesting tidbit: we also have a few works in the set that are musical arrangements of poetry by Workshop writers. It seems that we have those particular texts present in other formats, so they likely won’t be included in the analysis, but it was neat detail.


* The Workshop was not founded until 1936.  Included publications predating the Workshop were written by early faculty, such as Edwin Ford Piper.

Building the Program Era Project Database

We were invited last week to talk briefly about the sources and process of building up the database for the Program Era Project at the Digital Bridges Summer Institute co-hosted by Grinnell and the U of I.  Here I’ll give a more in-depth version of that talk.  If you note problems in our process that we haven’t considered or have suggestions of additional sources, let us know!  Building up and refining this dataset is proving to be a fascinating challenge with a lot of twists and turns.

When we first started out last fall we came up with a list of the sort of information we wanted to gather.  The obvious starting point was a list of Workshop graduates, ideally with their full name and year of graduation.  If possible we wanted to gather demographic information such as gender, race, and where they were from.  We wanted to know their thesis, whether they were Poetry or Fiction, and who their advisor was; and lastly, we wanted to know what they did after graduation.  Did they go on to teach at, direct, or found other writing centers?  Did they publish? Win awards?

So far we have found a variety of data sources that provide answers to some of these questions, with variable coverage and consistency.

sources timeline
Items with a black outline are digital; the lighter colored portions of bars are partial records; the striped bars are patchy, arbitrarily filled records.

 

We initially built up a foundation for the database using digital graduation records from the Office of the Registrar of English Masters degrees and MFA thesis records from the Libraries catalog.  Together these gave us a list of years, graduation dates, and usually thesis titles going back to the beginning of the period we wanted to examine to the present.  Records from the mid 1990s were reasonably complete in terms of providing actual program of study, thesis director, and thesis genre.  However, most of the dataset included students from other writing programs at Iowa, as well as students outside the writing programs – English criticism MAs and theatre MFAs, for example.  For the most part these were unmarked.

To filter out the records we don’t want and fill in the missing information, we’ve been checking this initial dataset against the commencement programs from graduation ceremonies, which are available in the University Archives.  In most cases the commencement programs are more specific about program of study, and usually also list the hometown of the student, which will make geographic analysis possible.

The Graduate College was able to supply us with thesis advisor information from index cards they have on file going back to the 1970s.  To fill in the remaining gaps we will need to consult the hard copies of the MFA theses.  This project is currently on the backburner while the Libraries moves its offline storage materials from one facility to another, but we have plenty to do in the meantime.

Wilbers Survey Response
A survey response from Wilbers’ dissertation work.

Two other Special Collections finds have supplied us with information on accomplishments of Iowa writing grads after graduation.

In the 1970s, as part of his dissertation research, Stephen Wilbers did a survey of writing programs to find Iowa grads that were now or had at some point directed those programs, and the original hand-filled survey responses are held in the Libraries.

Special Collections also holds records of a self-study performed by the Writers Workshop in 1992, as it was in the process of separating from the English Department.  The appendices of the report include lists of graduate accomplishments like writing programs founded elsewhere, publications, and writing awards.

The biggest challenge of the data so far has been determining just who should be in the dataset.  The vague and changing administrative status of the various writing programs over time is reflected in the records, and so determining what program (or sometimes programs) an individual graduated from in some cases requires cross-checking multiple sources.

All of this work would not be possible without our student workers, who have and continue to put patient hours into comparing lists of records in the archives.