Breaking down the HTRC data capsule

We’ve been combing through the index of the HathiTrust Data Capsule [more on how it was put together here].  We requested any volumes authored by a list of 380 writers and got roughly 2500 volumes back.  After deduping and weeding out false positives, we have about 1500 unique volumes representing 234 Iowa-affiliated authors, with publication dates ranging from 1913-2011*.

We’re interested in analysis on a couple of levels – first, generating StyleCard and LitMap analytics for individual authors, and eventually, exploring other trends that can be found in Workshop writing  when we look at workshop texts in aggregate [more information on the tools here].  In light of this, here are some thoughts and questions on my mind as we look over the capsule.

Multiple authors

The co-authorship problem: how do we handle texts that were co-written by multiple authors?  Many volumes are anthologies of works by several authors, sometimes limited to Workshop graduates and sometimes not.  How shall we go about extracting text by just the authors we’re interested in (without having to do it manually)?

Volume Authorship breakdown

We also have some instances of books written by one Workshop writer, with a segment by another Workshop writer – an anthology of short stories by William Kittredge with a foreword by Raymond Carver, for example.  In some contexts this might not be a problem, but when we start comparing author signals with StyleCard analytics, we don’t want to muddy a signal with text by another writer of interest.  That being said, who is including who in anthologies is an interesting question in and of itself.

Anthology duplicates

Many of the authors we’re interested in wrote primarily poems, short stories, or essays, and they’ve been published in anthologies.  Sometimes the same poem or story will appear in multiple anthologies.  How do we avoid double-dipping, and magnifying the signal of some works out of proportion to others (again, without having to do it manually)?

Unequal distribution

Some authors we were interested in, of course, are not included in the corpus at all.  Of those that are, some are much more strongly represented than others.

Distribution of works per author.
Distribution of works per author.

About half of the authors represented have only one or two volumes in the corpus. Without further scrutiny I can’t say say whether this is more due to a real difference in author prolificacy, or their representation and findability in the HT corpus (or, likely, a combination of the two).  For author-specific corpora, this is going to mean a much more reliable/well-sampled author signal for some writers than others.

For any analysis looking at some kind of “Workshop style”, we’re going to have to correct for this somehow.  How should we go about this?  Maybe chopping up works into segments and choosing random samples?  Selecting representative works for each author based on some rationale relating to literary qualities or critical reception?  Generating individual author metrics and combine them instead of running analytics on the entire corpus at once? Or, alternatively, should some authors be weighted more than others, and do we have the numbers in the corpus to do so in a way that we might want?

Derivatives

There are a few dozen translations in the corpus, the majority of them by Robert Bly (accounting for a little under half of his included works).  We will likely not include them in initial stages of analysis, but I’m very interested to see if we notice distinctive differences between his original works and his translations!

An interesting tidbit: we also have a few works in the set that are musical arrangements of poetry by Workshop writers. It seems that we have those particular texts present in other formats, so they likely won’t be included in the analysis, but it was neat detail.


* The Workshop was not founded until 1936.  Included publications predating the Workshop were written by early faculty, such as Edwin Ford Piper.