Blog

Getting Started with HTRC Corpora: Tips, Tool, and Walkthrough

Before we begin: here is the link to my GitHub page containing the extractor script I’ll be discussing below. This script automates the process of extracting and organizing an HTRC corpus, turning the compressed archive of page files into a collection of assembled full text volumes.

This post follows on the heels of the recent announcement that HathiTrust Research Center has expanded access to its collection of digital texts, making its complete 16-million-item corpus available for computational text analysis work. Given the challenges scholars can face with gaining access to digital texts—particularly texts under copyright—this is exciting news for computational literary scholarship. The Program Era Project’s own efforts to conduct computational text analysis on contemporary literature were only made possible through an earlier collaboration with HTRC, as I have discussed before.

Given that the scope of who can gain access to HTRC texts has greatly expanded, I wanted to offer an overview of the process I went through to prepare the Program Era Project’s HTRC corpus for text analysis. Hopefully, my walkthrough of how I extracted and organized our corpus offer some ideas or assistance to future HTRC users.

Moreover, I’ll also tell you about a tool I built that automates this extraction process, one which is now freely available on my GitHub. This tool can also be used in conjunction with other python-based text analysis scripts to automate the process of collecting text data from multiple volumes at once. In a moment, I’ll go into greater detail on how to use the tool and how to pair it with your own text analysis scripts. First, however, let’s see what the process of extracting and organizing a HathiTrust corpus looks like.

BaseArchive

Figure 1: The Original Archive and its Contents

In this image, you see the initial archive we received from HTRC, a list of volumes and a compressed file which contains a series of folders. Each folder contains additional files, which will be accessible and useful to us as soon as we extract the initial compressed archive.

Extracted

Figure 2: The Original File, Extracted

individualarchive

Figure 3: An Individual Folder in the Extracted Archive

The above image shows a folder containing the contents of the extracted file, each volume contained in a separate folder. The second image shows contents of a single volume folder. In each volume folder, we find a ZIP file containing the full text of our volume and a JSON file which contains metadata on the volume in question. There are important things to note about both the ZIP and JSON files, things which will impact how we extract and organize full texts.

First, it’s important to recognize that within each ZIP file the full text for a volume is not collected together as a single document. Instead, each page of a full text is included as a separate .txt file. We need to reassemble these pages to turn our volumes’ pages into a complete work.

Second, the JSON file provides a great deal of information on the text in question, information we can use for research purposes and to help organize our texts. Included in the JSON files are things like author names, book titles, and unique HTRC ID numbers for each book. These will be useful for automatically naming and placing files with our extractor tool. The JSON files also often (though not always) contain information like ISBN numbers, publisher names, and publication dates. The PEP team was particularly excited to discover this metadata, as it was information we could incorporate into our database of text metrics. This new information also enabled new ways to look at text data. Publication dates, for instance, make it possible to view trends in text metrics we’ve measured over time.

extractorrunning

Figure 4: PEP’s Extractor Tool Running on an HTRC Corpus

The above image shows my Program Era Project extractor tool working on our HTRC corpus. It uses the metadata in the JSON to automatically organize volumes by author and title. It also extracts, sorts, and assembles the individual pages in each ZIP into a complete work. While the tech-minded are more than welcome to look directly at the tool’s code, in basic terms, the steps to turn an individual volume’s compressed files into a readable work looks like this:

  • The tool pulls author name, work name, and HathiTrust ID data from the associated JSON.
  • It uses the author name to create a folder for that author’s name (if one doesn’t already exist).
  • It creates a sub folder within the author folder named after the volume’s title.
  • It opens the ZIP file and extracts all page files to the book folder.
  • It sorts the page files in numerical order and creates a single text file out of them, named “full.txt,” in the book folder.
  • It places a copy of the metadata file in the folder, renamed with the HTRC ID number (e. a volume with the HTRC ID 123456 would be 123456JSON)

This gives us a single folder where all the volumes, their page files, and their associated metadata can be found. It is a catch-all repository for our HTRC corpus, which is now extracted and assembled as full texts. That said, this archive be somewhat large and unwieldy if all we’re interested in is the text files themselves, particularly in cases where the metadata might not be consistent. Discrepancies in author names, for instance, can create multiple folders for a single author, an issue that can been seen in the image below and can lead to data cleaning efforts.

Luckily, our tool gathers full texts together in another way. After the pages for a volume are organized and reassembled, a copy of the full text is placed in a folder set to be a complete repository of the full texts. In this folder (I call it “collected” above), each volume is named after its HTRC ID number. For example, a volume with the HTRC ID 7891011 would be 7891011.txt. This gives us a place where we have quick access to all our text files, and each is uniquely, consistently identified.

At this point, the extractor tool has done its work. However, we took a few additional steps to organize our collection and make it easier to navigate. Looking through the list of works provided by HTRC, the PEP team found a number of false positives, works not written by workshop authors that had been inadvertently included. We wanted to remove these authors from our list. We also wanted to find a way to relabel files in our collected folder to make it easier to find specific volume or a specific author’s works.

We turned to the HTRC ID numbers to help us remove false positives and rename the files for easy navigation. The PEP team created a spreadsheet of all the volumes in our collection. On this spreadsheet, authors name entries were made consistent and all false positives removed. Each volume entry also listed the title’s HTRC ID. Because we had the unique HTRC ID, I was able to build a script that searched for that unique identifier in each file name and automatically replaced that HTRC ID name with a new naming scheme (authorname+year). Below you can see our collected archive before and immediately after I have run the script.

Using these unique HTRC IDs allowed us a way to automatically replace enter clean, consistent data for our file names. As an added benefit, it made finding and removing all false positives simple. Because the false positives are not included in our spreadsheet, their names were not corrected. Therefore, any file that retained its HTRC ID was not on our master list and should be removed from our archive.

At this point, the PEP corpus had gone from a single compressed file to a collection of over 1000 full text volumes, assembled from individual pages, organized, and ready for future work. In the particular case of the Program Era Project, I incorporated our text mining scripts into the extraction process, producing datasets for each volume we could gather together into a larger dataset (after removing false positives).

Hopefully this account gives readers curious about working with a HTRC corpus a clearer picture of what to do to get a HTRC corpus organized and readable. However, as I said at the beginning of this post, I’m also happy to note that my github page contains a freely-available version of the extractor script I built to automation the extraction process.

The tool is designed to be as user intuitive as possible. No programming knowledge should be required, and the interface is almost entirely point and click. I’lll go through the steps to run the script here, but you can also find the readme for the extractor script on my GitHub.

To begin using the tool, your first step is to extract the initial compressed archive provided by HTRC. This should give you a collection of folders, each containing a ZIP file and JSON file. If you find a folder without a zip or JSON file, remove it. It will crash the tool. If, during extraction, you find a folder crashing your tool, check it for a missing ZIP or JSON.

openinterminal

Figure 5: Opening the terminal

Once the initial archive is extracted, run the script. A simple way to do it can be seen above. First, right click somewhere in the folder where the script is located and select “open in terminal.” You can also navigate to the folder manually if you prefer. After you have done that, type in the following command:

python3 HTRCExtractor-Release.py

At this point, prompts will appear on the screen. They will ask you to select three folders:

  • The first folder you select is the one which contains the subfolders with ZIP and JSON files.
  • The second folder is where the tool will place all the page files, metadata files, and full text files, organized by author folders and work name subfolders.
  • The third folder is the tool will place all the full text files (labeled by HTRC ID).

The tool should then run, letting you know as it works through the files in your archive. It will also notify you when it has completed. It lists my e-mail address should you encounter any bugs or have any questions, comments, etc…

Note: If you are working with the tool for the first time in a HTRC data capsule, you will want to run it first in maintenance mode, so that it can import any modules it needs to run. Once the initial folder select prompt appears, you can simply close the tool. You can then switch to secure mode and run the tool as I explain above.

For users who’d like to integrate their own text mining scripts into the tool, please examine the code. I’ve marked a place that allows you to insert your own scripts easily, running them on each volume you extract. Inserting text mining scripts there also allows you access to the HTRC metadata for the volume. I’ve also included some sample code to get the assembled file up and running for text-mining, though you may likely have alternative ways you’d like to work with or open the data.

To those of you interested in getting started with HTRC corpora, I hope this walkthrough and the tool provide some useful starting points. The more we increase the accessibility of computational text analysis tools and corpora (particularly corpora still in copyright) the more we all stand to benefit from greater diversity of computational analysis research projects as well as more experimentation (and, ideally, innovation) within the field of computational text analysis.

Happy mining.

Public Records Part II

Note:  This post follows up on the post of June 28 regarding my ongoing attempts to access the administrative records of the Writers’ Workshop, for which I have had to make a FOIA Public Records Request through the University of Iowa’s Transparency Office:

Public Records Request

Thanks to the generosity of Mark McGurl I have now received a redacted version of the Director’s Files from the Frank Conroy Era currently housed under restriction as part of the Iowa Writers’ Workshop Records in the Special Collections and University Archives of the UI Library.  I received 21 pdf files totaling 626 pages.  They are organized and named alphabetically starting with “A_Redacted.”  According to the Transparency Office these files represent about ¾ of the entire series I requested.  The Workshop was provided with the opportunity to withhold materials prior to the Transparency Office’s redactions.

Before I discuss what’s in the redacted records I’d like to summarize what’s in the Workshop records as a whole (their record-keeping has in fact been quite thorough), how the authority over access to these records has shifted since my FOIA request, and the state and federal laws that will determine access in the future.

The Iowa Writers’ Workshop Records consist of 13 series of varying sizes and provenance.  Here is the content description from the Library Finding Aid:

Series I: Student Coursework, consists of photocopies of students’ works arranged by semester and class section within each semester. It is the largest series in the collection, dating from Fall 1965 to the present. Note that a few of the semesters are filed out of chronological sequence.

Series II, Award Competitions, consists of writing entries from individuals vying for scholarships and other awards.

Series III, Students and Alumni, consists of files containing correspondence, applications, and other material, arranged alphabetically by name of individual within each accrual. Note that accrual dates cover academic years  e.g., 1986-91 covers the 1986-87 to 1991-92 academic years. Restricted access.

Series IV, Faculty, is arranged alphabetically by name of individual. Restricted access

Series V, Director’s Files, consists of correspondence and other material created and received by the Office of the Director. Restricted access.

Series VI, Administrative Files. Restricted access.

Series VII, Accepted  Not Coming. Restricted access

Series VIII, Rejected Applicants’ Evaluation Sheets. Restricted access

Series IX, Applicants’ Letters of Recommendation. Restricted access.

Series X, Ephemera, includes posters and other printed matter, dating from 1982 to present.

Series XI, Stephen Wilbers Project, consists of correspondence and interview notes prepared by an alumnus of the Workshop who prepared a history of the program in 1980.

Series XII, Jean Wylder Project, consists of survey responses obtained from numerous alumni during the early 1970’s as part of a history project. The responses are arranged by era of attendance/graduation.

Series XIII, Newsletters, consists of newsletters released once or twice yearly since 1970 chronicling the publishing activity of Workshop alumni and students, as well as Workshop programs and events.

The original restrictions consisted in consulting with the University Archivist and the Workshop.  Since my request, access to Series I-IX of these records comes under the authority of the Transparency Office. This includes the 380 boxes of student coursework running from 1965-2011, which prior to my request were under no restriction at all.  Anyone wishing access to these records must now file a FOIA request with the UI Transparency Office and pay for its processing (as an example, the charge for reviewing the Director’s Files was $1080; the estimate for processing the remainder of the administrative records is $12,000).

Two laws are used by the University of Iowa Transparency Office in redacting requested records, one federal and one state:

  • The Family Educational Rights and Privacy Act (FERPA) (20 U.S.C. § 1232g; 34 CFR Part 99), that protects the privacy of student education records. The law applies to all schools that receive funds under an applicable program of the U.S. Department of Education.

https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html

And

  • Chapter 22.7 of the Iowa Code concerning the Examination of Public Records.

https://www.legis.iowa.gov/docs/code/22.7.pdf

I anticipate that these laws, and others like them, will restrict access to educational records for future researchers into the history of the Program Era.

 

So what’s in the records that I received and what was redacted?

Selectively and roughly alphabetically here are some examples:

A redacted copy of the AWP’s 1992 Survey of Creative Writing Programs filled out by Conroy.  The redactions are interesting, and indicate some of the difficulties researchers will face in determining the demographic composition of creative writing cohorts.  Since the law apparently dictates that any statistical measure of 6 or fewer risks exposing individual identities (I say apparently because I couldn’t find this specified in either law), we can know that in 1992 the Workshop had 55 men and 45 women enrolled, but we can’t know how many (if any) of them were American Indian, Asian, Black, Hispanic, White, or “Other.”  Nor are we allowed to know the gender or ethnic background of the faculty (though we do know that 7 are Full Professors and 5 are Adjuncts).

A_Excerpts 30

Extensive discussion with the upper administration concerning the awarding of Teaching-Writing Fellowships (TWiFs as they were colloquially known) and their importance in competing with other programs for students.  All student names in these discussions are redacted.

Much discussion with a series of English Department Chairs regarding the administrative autonomy of the Workshop.  Indeed, the process whereby the Workshop achieved this autonomy, ultimately resulting in their move to Dey House, is documented in some detail in these pages.

An itinerary for, though no results from, the External Review conducted of the Workshop by Nicholas Delbanco and Stephen Tatum in 1992.

A series of increasingly testy exchanges about violations of the smoking policy in the English-Philosophy Building (EPB).  Though the violator is never named, it is well-known that Conroy was a chain smoker.

Correspondence with or about Frederick Barthelme, Saul Bellow, Annie Dillard (about the possibility of adding a Literary Nonfiction Track to the WW), Gail Godwin, John Irving, Norman Mailer, Tom McGrath, Joyce Carol Oates, George Plimpton, and Roger Strauss.

A Graduate Writing Faculty Assistance Survey issued by the University of Houston and filled out by Conroy which confirms that faculty meet with graduate students a “fair amount” in “bookstores, local bar and local restaurant (Foxhead, The Mill),” in a “miraculously open” climate, and affirms that the criterion for acceptance includes “the pressure of the soul behind the language.”

C 15

Extensive correspondence, internal and external, some redacted, involving funding, for TAships, visiting writers, copy machines (they made 100,000 copies a month), and more.

Numerous redacted nominations for prizes and awards.

A letter to all Workshop Faculty discouraging them from conducting workshops in their homes.

 

This is only a selection; there is more in these records, far more than I have any use for (and none of it violating any privacy or confidentiality laws). Unfortunately, I’m not sure where the funding will come from to redact the remaining administrative records, nor where they would be housed if the funding is secured.  At this point, what I’ve received is housed in a folder in my Dropbox account.  Do let me know if you’d like access.

 

Breaking down the HTRC data capsule

We’ve been combing through the index of the HathiTrust Data Capsule [more on how it was put together here].  We requested any volumes authored by a list of 380 writers and got roughly 2500 volumes back.  After deduping and weeding out false positives, we have about 1500 unique volumes representing 234 Iowa-affiliated authors, with publication dates ranging from 1913-2011*.

We’re interested in analysis on a couple of levels – first, generating StyleCard and LitMap analytics for individual authors, and eventually, exploring other trends that can be found in Workshop writing  when we look at workshop texts in aggregate [more information on the tools here].  In light of this, here are some thoughts and questions on my mind as we look over the capsule.

Multiple authors

The co-authorship problem: how do we handle texts that were co-written by multiple authors?  Many volumes are anthologies of works by several authors, sometimes limited to Workshop graduates and sometimes not.  How shall we go about extracting text by just the authors we’re interested in (without having to do it manually)?

Volume Authorship breakdown

We also have some instances of books written by one Workshop writer, with a segment by another Workshop writer – an anthology of short stories by William Kittredge with a foreword by Raymond Carver, for example.  In some contexts this might not be a problem, but when we start comparing author signals with StyleCard analytics, we don’t want to muddy a signal with text by another writer of interest.  That being said, who is including who in anthologies is an interesting question in and of itself.

Anthology duplicates

Many of the authors we’re interested in wrote primarily poems, short stories, or essays, and they’ve been published in anthologies.  Sometimes the same poem or story will appear in multiple anthologies.  How do we avoid double-dipping, and magnifying the signal of some works out of proportion to others (again, without having to do it manually)?

Unequal distribution

Some authors we were interested in, of course, are not included in the corpus at all.  Of those that are, some are much more strongly represented than others.

Distribution of works per author.
Distribution of works per author.

About half of the authors represented have only one or two volumes in the corpus. Without further scrutiny I can’t say say whether this is more due to a real difference in author prolificacy, or their representation and findability in the HT corpus (or, likely, a combination of the two).  For author-specific corpora, this is going to mean a much more reliable/well-sampled author signal for some writers than others.

For any analysis looking at some kind of “Workshop style”, we’re going to have to correct for this somehow.  How should we go about this?  Maybe chopping up works into segments and choosing random samples?  Selecting representative works for each author based on some rationale relating to literary qualities or critical reception?  Generating individual author metrics and combine them instead of running analytics on the entire corpus at once? Or, alternatively, should some authors be weighted more than others, and do we have the numbers in the corpus to do so in a way that we might want?

Derivatives

There are a few dozen translations in the corpus, the majority of them by Robert Bly (accounting for a little under half of his included works).  We will likely not include them in initial stages of analysis, but I’m very interested to see if we notice distinctive differences between his original works and his translations!

An interesting tidbit: we also have a few works in the set that are musical arrangements of poetry by Workshop writers. It seems that we have those particular texts present in other formats, so they likely won’t be included in the analysis, but it was neat detail.


* The Workshop was not founded until 1936.  Included publications predating the Workshop were written by early faculty, such as Edwin Ford Piper.

Collaborating with HathiTrust

HathiScreen.png

The screenshot above documents an exciting moment for our ongoing collaboration with the HathiTrust Research Center. It is a screenshot of my computer, remotely connected to a HathiTrust machine, running my Program Era Project text-mining tools on sample HathiTrust text data. In short, it’s a proof-of-concept, a confirmation that the PEP tools are ready to begin collecting data on thousands of texts produced by creative writers affiliated with the University of Iowa, and that the PEP team can begin to join that data with the wealth of institutional, biographical, and demographic data they have already collected.

This screenshot is a result of HathiTrust’s selection of the Program Era Project as a 2017 Advanced Collaborative Support award winner. HTRC is a collaboration between partner universities that houses an expansive digital library. As HathiTrust’s site explains, the ACS program is:

a scholarly service at the HathiTrust Research Center (HTRC) offering collaboration between external scholars and HTRC staff to solve challenging problems related to computational analysis. By working together with scholars, we facilitate computational-oriented analytical access to HathiTrust based on individual scholarly or educational need.

For the HathiTrust/PEP collaboration, the approach chosen was to establish a “Data Capsule,” a machine maintained and secured by HathiTrust, that PEP team members can remotely access and then run text mining experiments on a corpus of texts held in HathiTrust’s collections. The Data Capsule approach is crucial, as the texts to which we require access remain in copyright; they simply aren’t accessible in digital form for large-scale data collection. The Data Capsule configuration allows full texts of HathiTrust works to be measured by text mining software, but only the metrics collected by the tools can be moved off the Data Capsule machine. In PEP’s case, this means .csv spreadsheets of data on individual texts.

Now, thanks to the HathiTrust/PEP collaboration, the tools I created for the Program Era Project (described a bit more here) can be employed on a large volume of digital texts. They can be used not just for experiments, but to begin building a database of metrics on features of creative writing at the University of Iowa. For the Data Capsule collection, the Program Era Project team assembled a list of roughly 400 selected authors associated with the Writers’ Workshop and the Nonfiction Writing Program. Since receiving this list, the HathiTrust team has endeavored to find all the works held by HTRC associated with these authors. At present, over 2000 volumes have been connected to the PEP authors list. These items are then made accessible on the Data Capsule and, using the PEP tools, converted into metrics which are stored in the PEP database.

So, what data is the Program Era Project collecting? Currently, I’ve built two text mining tools for the Program Era Project. Both are written in Python and draw on Stanford’s Natural Language Toolkit (NLTK). We call them Style Card and LitMap.

Style Card is a text analysis tool that measures features of literary style such as vocabulary size, sentence length, adverb and adjective usage, and frequency of male and female pronouns. The last metric is particularly interesting, as it can provide a quick impression of gender representation trends in a work or collection of works. Additionally, by collecting the same metrics from multiple authors or multiple works from one author, stylistic comparisons can be made between aspects of later or earlier works of a single author or the complete corpora of two authors. It is, in short, like creating baseball cards for authors and literary works, snapshots of information that can be used to establish or test hypotheses.

LitMap is a software package that tracks location references in literary corpora, making it possible to analyze regional representation in literary works. This allows us to see the influence of an author’s biography on their literary output as well as measure the influence of authors migrating to and from creative writing programs on the settings of their writing. Using LitMap, we’ve already made some interesting discoveries about the frequency with which works written by authors who taught at and/or attended the University of Iowa mention the state of Iowa and the region of the midwest. The team is looking forward to sharing more with you on that topic in the future.

What’s truly significant (and truly promising) about the data these tools collect is that it will be stored in the PEP team’s database. When the ACS data is incorporated into the PEP database and available to future users of the PEP web presence, users, at a glance will be able to rank and compare features themselves.

The images below represent a proposal for the eventual look and feel of the Program Era Project web presence. The numbers used are drawn from data already collected with the PEP text mining tools. As the first figure shows, a user could rank writers by average sentence length, learning, at a glance, which authors typically create sprawling (or terse) sentences. The second image ranks authors based on the ratio of male pronouns to female pronouns. The larger the number, the more often male pronouns appear compared to female pronouns.

StyleCard Sentences.png

StyleCardRankingDepiction.png

Users could also compare two authors—or an individual author to a control corpus—and look at differences such as first-person and third-person pronoun use (a potential indicator of narrational patterns and preferences) or adverb and adjective ratios (which can index spare or detailed prose). Scholars could see at a glance how an author’s stylistic features might compare to their advisor, how they compare to other writers in the corpus, or to a baseline corpus of writing in English.

StyleCard Compare Depiction 2.png

StyleCard Control Compare Depiction.png

In the following image, produced using plotly, the platform we currently use to visualize LitMap data, we see another way the PEP text tools will provide new insights into literary corpora. The image shows the strong representation of Iowa in a literary corpus comprising 75 novels by Workshop-affiliated writers, documenting how their time at Iowa has left a mark on where they write about.

Workshop75Updated

The idea behind offering these metrics to future users of the Program Era Project website is that access to this information will prompt curiosity and exploration. Moreover, when users find an interesting pattern or phenomenon in the data, we hope it will prompt a direct investigation of the works included in the data. In short, beyond just presenting this information, we believe that the ability to skim over these metrics will inspire scholars to dive deeper into the texts the data is drawn from. These objectives of encouraging emergent research and driving curiosity are at the heart of Style Card and LitMap’s other principal innovation: the use of clear, easy to understand metrics. The fields of stylometry and text analysis have developed techniques that allow for astounding technical and scholarly achievements, author attribution being a notable example. However, understanding how a piece of software or a quantitative approach arrived at the conclusions it did can be difficult for users not familiar with the theoretical foundations or technologies employed. To this end, metrics tracked by Style Card were selected so that users are offered information that is easy to understand and transparent. By using simple numbers, StyleCard metrics allow any scholar—whatever their experience and training with quantitative analysis—to benefit from the Program Era Project website, broadening the number of academic projects that might be inspired by quantitative analysis.

Even better, both the Style Card and Lit Map tools were developed in such a way that anyone can use them. You simply click the program file, type in the name of the author and work you are scanning and select the name you want for the output file you will create. The tool does everything else. What this means is two-fold. First, it allows more collaboration with building our database of text metrics. If a team member can access the Data Capsule, they can easily run the software to collect metrics. Secondly, these tools will eventually be made freely available online. Therefore, any other project team that wishes to collect the same metrics will have the option available to them. Because the tools will be open source, users will also have the option to modify, adjust, and tweak the technology to their own needs. Moreover, any school that might be interested in learning more about their own history of Creative Writing, or any school that might wish to establish a partner project to the one here at the University of Iowa, will have the necessary technology.

All that said, I hope you can see why we are so excited about the screenshot of our text tools running in conjunction with the HathiTrust Data Capsule. The image represents another significant step along the way to our goal of providing students and scholars of literature the ability to explore the history of the Iowa Writers’ Workshop—and the history of Creative Writing at Iowa—in a way that was never before possible.

Public Records Request

For folks who are interested in the backstory behind the article recently published in the Iowa City Press-Citizen, this blog post will fill you in.

http://www.press-citizen.com/story/news/2017/06/19/iowa-writers-workshop-archive-costly-search-scholar-finds/391011001/

After lengthy deliberation and extensive work in the University Archives, I’ve decided to write a literary history of Iowa City to be called City of Literature.  I have already repeatedly consulted the papers of Paul Engle, George Starbuck, and John Leggett, but material from the Conroy era was under restriction.  According to the online finding aid, this material was accessible with permission from the Writers’ Workshop. In March, I decided to request permission for access to these records.

To my surprise, I was instructed to make a request through the Transparency Office. When I wrote them to confirm they agreed that this was highly unusual; indeed they later confirmed that they had never received a comparable request and have never worked with the University Archives. Here is their portal:

https://publicrecordsrequests.iowa.uiowa.edu

After some deliberation, I decided to go ahead and submit the request. A few months later I received this cost estimate:

Professor Glass,

Below are descriptions of Series IV-VI, the approximate number of pages to be reviewed in each series, and the estimated staff time involved with reviewing each page for confidential information or other information exempt from disclosure.

This estimate includes only the time it will take to review all of the records and to identify which ones are subject to disclosure (in full, or with redactions) under the state open records law. If you decide to proceed with this request, we will send you an invoice and begin the process of reviewing the records after we receive your payment. Once the initial review has been completed, we will send you a second estimate for the time it will take to then copy or scan the records which may be disclosed and to make any necessary redactions. At this time, we are not able to provide you with the second estimate because we won’t know how many records will need to be copied/scanned or redacted until the initial review is complete.

As we have previously informed you, and as the descriptions below indicate, many of these records are likely to contain information exempt from disclosure under Iowa Code Chapter 22, such as confidential personnel information or student/FERPA records. Although we are estimating that a considerable amount of time will be required to review all of these records, please be aware that many records may still end up being withheld.

The review times below were calculated based on an average rate of 8 seconds per sheet for Series IV, and 10 seconds for each sheet for Series V and VI (due to greater complexity among the records in the latter two series).

Series IV (Faculty series)

Description: Letters of recommendation, evaluations, criticisms of submissions, applications, appointment sheets (which contain hire and separation dates, position description, date of birth, salary, SSN). Arranged alpha by name.

1 box containing approximately 3,240 sheets

Time estimate for review: 7.2 hours

Series V (Director’s files series)

Description: Administrative correspondence of Workshop director Frank Conroy, which includes correspondence regarding faculty and staff appointments, internal funding, external fundraising, personal professional development activities, and letters of recommendation

1 box containing approximately 3,240 sheets

Time estimate for review: 9 hours

Series VI (Administrative series)

Description: A wide range of records, such as financial aid records, Ida Beam Professorship files (including applications, evaluations of candidates), gift acknowledgments, correspondence including letters of recommendation, award competitions, self-studies.

21 boxes containing approximately 63,720 sheets

Time estimate for review: 177 hours

Total time estimate for initial review: 193.2 hours (billed at a rate of $30/hour, with the first hour free of charge)

Please let us know if you would like to proceed with all or part of this request, and we will send you the first invoice. We will begin reviewing the records once we receive your payment.

 

NOTICE TO RECIPIENT: THIS MESSAGE AND ANY RESPONSE TO IT MAY CONSTITUTE A PUBLIC RECORD, AND THEREFORE, MAY BE AVAILABLE UPON REQUEST IN ACCORDANCE WITH IOWA PUBLIC RECORDS LAW, IOWA CODE CHAPTER 22.

As you can see, this adds up to at least $6,000, and that doesn’t include the cost of acquiring the records I need for my project. I am currently looking into funding possibilities. In the meantime, I will be posting my progress, since this entire process is now a matter of public record and public concern. Feel free to post comments and suggestions.

More to follow!

Gender Trends at the Iowa Writers’ Workshop

By Loren Glass and Nicholas M Kelly

Thanks to the ongoing efforts of the Program Era Project (PEP) team and the resources of both the University of Iowa Libraries and the University of Iowa’s Digital Scholarship & Publishing Studio (DSPS) work continues on our database. This database will offer an extensive overview of the professional itineraries, accomplishments, and connections of students and faculty involved with creative writing at the University from the founding of the Writers’ Workshop up to the present day and this information will be accessible and explorable through a future PEP website.

As the PEP team continues to aggregate data, we gain the capability to observe macro-level demographics and trends in creative writing cohorts. We now have enough data to visualize the gender breakdown of all the students who ever attended the Iowa Writers’ Workshop from 1932 to the present.

While there are no remarkable revelations here, our data does confirm certain anecdotal assumptions about the demographic composition of MFA programs over time. This potential to support or contest anecdotal assumptions about Workshop demographics with macro-level data in itself is something of a victory. Moreover, as more demographic data is collected and, eventually, made public via the PEP web presence, this information will be accessible for scholars and students of creative writing.

Before we discuss our visualization, a quick note on the methods employed here. Whenever possible, the PEP team turned to biographical data available on authors to ensure correct gender identification. When this was not possible, gender identity was inferred based on first name, doing so only when the name clearly indicated a specific gender. If these criteria were not met, writers were left in the unknown category. This means there is a certain margin of error, one which we feel does not affect our observations. As we continue to develop our database, we will work on refining how we define gender, exploring more inclusive, non-binary methods for tracking gender identities.

Below is the gender breakdown in its entirety. Showing trends over time in the male, female, and unknown categories. Looking more closely, discrete trends reveal themselves.

First of all, it is worth noting that before WWII, women predominated as students in what was a small and fledgling program. We also know from high school graduation dates that most of these women were older than standard college age.  The program in its early years, in other words, was less about professional credentialing than continuing education.

Then, after WWII, with the inception of the GI Bill, the program is flooded with male students, usually older than standard college age and (based on anecdotal accounts of this period) frequently married, and this pattern persists into the seventies. The image also shows the overall growth of the Workshop program.

Looking closer to the present, we see the first year in which female students predominate is 1983 and from then on there is a marginally larger percentage of female students, which reflects national trends as established by the Integrated Postsecondary Education Data System (IPEDS).

We hope this illustrates how the PEP will assist scholars and those interested in the history of creative writing to obtain insights and information about connections and trends in the field. As PEP’s database of institutional and biographical data continues to grow—and is accompanied by a  host of computer-collected text analysis metrics from Workshop-affiliated writing—we hope to offer scholars new ways to make discoveries that will lead to new lines of scholarly inquiry on the rise and spread of creative writing in the United States and the world.

 

Geography and Creative Writing with Google Maps: Part Two, a Program Era Project Sample Visualization

Here is a link to the Google map I’ll be discussing for this second post on Geographic information assembled by the Program Era Project. Again, feel free to click and explore. Layer toggles are activated on the left bar.  In order to ensure privacy, all names have been removed from the assembled records. This map was made possible by the efforts of University of Iowa students Emma Husar and Abby Sevcik, who were instrumental in collecting and organizing the data presented here.

In our earlier post, “Geography and Creative Writing with Google Maps,” the Program Era Project provided a sample visualization of some of the geographical data the Project had collected in its ongoing effort to document and better understand the expansion of the Creative Writing programs at universities across the United States during the second half of the 20th century. The visualization, made using Google Maps, employed data assembled from resources found in the University of Iowa Special Collections Library and the University Archives to illustrate the migration of creative writers to and from Iowa City and the Iowa Writers’ Workshop. Charting the hometowns of prominent Workshop-affiliated writers and the locations of creative writing programs founded by, directed by, or employing Workshop-affiliated writers, the map helped demonstrate how a single institution such as the Workshop could have connections to writing programs across the nation. It also showed how a single program could draw writers from a wide variety of locations both inside and outside of the United States.

homecompare3

By using the Google Maps filters feature to separate historic time frames, it was possible, within the previous Google Maps visualization, to detect what regions of the United States were common origin points for prominent Workshop writers. The previous visualization suggested that the Northeast, Midwest, and South had been home to a number of Workshop writers. Meanwhile, the map simultaneously suggested a scarcity of writers from states such as Montana, the Dakotas, Idaho, and Wyoming (though, again, that may simply be a feature of that intentionally limited data set).

At the end of the previous post, I had written that the Program Era Project’s Nikki White had been working with a different collection of University of Iowa archival records to bring together a larger set of geographic data about the Workshop and its graduates. This post will highlight a new Google Maps visualization based on that data. This new visualization documents the hometowns of over 200 University of Iowa graduate students connected to the Iowa Writers’ Workshop, graduates who received advanced degrees at Iowa between 1938 and 1960 (this distinction will be clarified in a moment). In order to maintain the privacy of the graduates, we have stripped the records of all names. The aim of the data visualization is to explore broad demographic trends, not chart any single author’s professional itinerary.

Illustrating, once again, the wealth of historical records the Program Era Project has been able to access in the University Iowa’s Libraries Special Collections and University Archives, the data for this visualization was assembled by examining graduation programs and pamphlets distributed at University of Iowa commencement ceremonies between 1938 and 1960. In these materials, students were allowed to list their hometowns, and, so, when this information was collected, a data set could be born. For their efforts in undertaking the sizable task of assembling and organizing these hometown records, the Program Era Project owes significant thanks to University of Iowa Students Emma Husar and Abby Sevcik, who were crucial in stewarding this information from the archive to the database.

While this post is on the topic of organizing and assembling data, it’s important to note that this current visualization charts the hometowns of “over 200 University of Iowa graduate students who attended the Iowa Writer’s Workshop,” not “Iowa Writer’s Workshop MFA graduates.” There are a number of reasons to use this broader language and I will outline some of them in order illustrate the challenges the Program Era Project faces in collecting and organizing archival records. First of all, this broader language helps account for the variety of degrees Workshop students earned at Iowa in the early days of the program, as well as the variety of ways those degrees were earned.  Workshop affiliated writers and scholars were receiving PhDs as well as MAs and, later, MFAs. Critical studies as well as creative works were turned in to satisfy completion requirements.

Additionally, as Nikki mentioned in her post on the Program Era Project blog, “Building the Program Era Project Database,” another challenge faced in refining datasets is the shifting administrative relationship between the Writers’ Workshop and the English Department at Iowa. Until the 1990’s English and Creative Writing remained, to varying degrees, institutionally intertwined, the two units only fully separating after the 1992 study mentioned in the previous post. Because of the closeness between these two units, deciphering the precise institutional position of an individual can require frequent cross referencing between resources and archives.

One other particularly interesting and unusual feature of the data is the significant number of graduates listing Iowa City itself as their hometown. This is both something of an anomaly and a specific issue the Program Era Project is addressing as it builds its database. Throughout the period covered by the map there are a significant number of graduates listing Iowa City as home, a number that seems both out of place and comparable to much larger, more populous areas. Looking at the data in more detail, with names still connected to data points, shows how this anomaly likely occurred. For instance, in the dataset poet W.D. Snodgrass listed his hometown as Iowa City. However, Snodgrass’s biographical information makes it clear this is not the case. Iowa City was not the poet’s hometown and he only lived in Iowa City for a relatively small portion of his life. While, on the one hand, this represents a statistical blip, it also points to a potential phenomenon, one of writers, at least on an official record, “adopting” the home of the workshop as home for themselves. That said, because of this unique coincidence of bookkeeping, and because of some of the other challenges I’ve mentioned above, the visualization presented here should be seen as a look at broad contours in early demographic trends with Workshop-affiliated writers. It is still an evolving data set that will be further shaped by ongoing research.

usafull

Methods, provisions, and historical anomalies covered, let’s look at the maps. When compared to the previous visualization, what becomes quickly apparent is the much larger number of individuals included. Because of this higher density of information, this map lends itself much better to looking at aggregate changes as opposed to examining individual points. That isn’t to say, however, it can’t also be fun to speculate who points might be, points like a sole 1947 graduate from Milledgeville, GA. When the map is zoomed closer on particular cities and regions, a clearer pictures emerges of the numbers of graduates associated with a location. Moreover, like the previous Google Maps visualization, the most interesting perspectives emerge by toggling layers of the map on and off, as these layers group the data points by time increments. By toggling layers, the chronology of the Workshop’s growth, its expansion into an international institution, its accumulation of graduates from specific cities and regions, can be seen in greater detail.

globalcomposite

These images show overall growth in the number of Workshop graduates in the United States between 1938 and 1960, as well as the Workshop’s eventual turn towards drawing writers from outside the United States, graduates from England, the Philippines, and South Korea all appearing on the map.

midandcoastcomp

Here are a set of images demonstrating the incremental growth of graduate hometowns in Midwest and East coast of the United States, the regions which a large portion of the early graduates listed as their home.

nyccomposite

Lastly, we can see how toggling on and off layers can help illustrate the proliferation of graduates listing a single metropolitan area as their hometown, in this case, New York City and its surrounding areas. While only one writer in our dataset listed the New York City area as home in 1946, as the layers increase, the number of graduates also increases sharply, indicating an ongoing growth of a relationship between the two cities.

So, while our previous map allowed viewers to get a sense of the migration of specific individuals into the Iowa Writer’s Workshop and out to positions of institutional significance at newly-forming creative writing programs across the country, this visualization offers the chance to get a better large-scale sense of what places writers were leaving to arrive at the Iowa Writer’s Workshop. This visualization also documents the Workshop’s expansion into an institution that drew international attention and it shows how the number of writers coming to Iowa City from American metropolitan centers would grow throughout the second half of the twentieth century. In short, by linking the Program Era Project’s data up with Google Maps we have a chance to show off another example of how The Program Era Project is assembling the information needed to chart patterns, and take a macro look at statistical and geographic trends in the history of Creative Writing.

 

Year One

updated snip[1]

Here’s a mockup of one visualization we’re considering to access MFA cohorts by year. As you can see, 1932 conveniently emerges as Year One of the program era insofar as the first two directors of the Workshop, Paul Engle and Wilbur Schramm, both received MAs, alongside Wallace Stegner, who would go on to launch the program at Stanford. Together, they cover the three main genres–poetry, fiction, and nonfiction–that would come to dominate creative writing programs. Norman Foerster, founder of the College of Letters and persistent advocate of the creative thesis, directed Schramm’s thesis. We’ve still haven’t identified the directors for Engle and Stegner, but Foerster is a pretty likely candidate.

Building the Program Era Project Database

We were invited last week to talk briefly about the sources and process of building up the database for the Program Era Project at the Digital Bridges Summer Institute co-hosted by Grinnell and the U of I.  Here I’ll give a more in-depth version of that talk.  If you note problems in our process that we haven’t considered or have suggestions of additional sources, let us know!  Building up and refining this dataset is proving to be a fascinating challenge with a lot of twists and turns.

When we first started out last fall we came up with a list of the sort of information we wanted to gather.  The obvious starting point was a list of Workshop graduates, ideally with their full name and year of graduation.  If possible we wanted to gather demographic information such as gender, race, and where they were from.  We wanted to know their thesis, whether they were Poetry or Fiction, and who their advisor was; and lastly, we wanted to know what they did after graduation.  Did they go on to teach at, direct, or found other writing centers?  Did they publish? Win awards?

So far we have found a variety of data sources that provide answers to some of these questions, with variable coverage and consistency.

sources timeline
Items with a black outline are digital; the lighter colored portions of bars are partial records; the striped bars are patchy, arbitrarily filled records.

 

We initially built up a foundation for the database using digital graduation records from the Office of the Registrar of English Masters degrees and MFA thesis records from the Libraries catalog.  Together these gave us a list of years, graduation dates, and usually thesis titles going back to the beginning of the period we wanted to examine to the present.  Records from the mid 1990s were reasonably complete in terms of providing actual program of study, thesis director, and thesis genre.  However, most of the dataset included students from other writing programs at Iowa, as well as students outside the writing programs – English criticism MAs and theatre MFAs, for example.  For the most part these were unmarked.

To filter out the records we don’t want and fill in the missing information, we’ve been checking this initial dataset against the commencement programs from graduation ceremonies, which are available in the University Archives.  In most cases the commencement programs are more specific about program of study, and usually also list the hometown of the student, which will make geographic analysis possible.

The Graduate College was able to supply us with thesis advisor information from index cards they have on file going back to the 1970s.  To fill in the remaining gaps we will need to consult the hard copies of the MFA theses.  This project is currently on the backburner while the Libraries moves its offline storage materials from one facility to another, but we have plenty to do in the meantime.

Wilbers Survey Response
A survey response from Wilbers’ dissertation work.

Two other Special Collections finds have supplied us with information on accomplishments of Iowa writing grads after graduation.

In the 1970s, as part of his dissertation research, Stephen Wilbers did a survey of writing programs to find Iowa grads that were now or had at some point directed those programs, and the original hand-filled survey responses are held in the Libraries.

Special Collections also holds records of a self-study performed by the Writers Workshop in 1992, as it was in the process of separating from the English Department.  The appendices of the report include lists of graduate accomplishments like writing programs founded elsewhere, publications, and writing awards.

The biggest challenge of the data so far has been determining just who should be in the dataset.  The vague and changing administrative status of the various writing programs over time is reflected in the records, and so determining what program (or sometimes programs) an individual graduated from in some cases requires cross-checking multiple sources.

All of this work would not be possible without our student workers, who have and continue to put patient hours into comparing lists of records in the archives.

Geography and Creative Writing with Google Maps: a Program Era Project Sample Visualization (Reposted from New Readia)

– The following first appeared on the NewReadia blog 05/24/16. It has been reposted here to document some of the Program Era Project’s ongoing experimentation with visualizing the data it has collected. – NMK

Here is a link to the the Google map I’ll be discussing throughout this post, “Workshop-Affiliated Directors and Founders of Creative Writing Programs (1976, 1992).”

Since the last post on New Readia, the team behind Mapping the Program Era—now renamed the Program Era Project—has continued its work on collecting historical and institutional records to chart the evolution of both the Iowa Writer’s Workshop and the literary phenomenon of creative writing programs during the second half of the 20th century.

As I mentioned in my last post, Mark McGurl opens The Program Era: Postwar Fiction and the Rise of Creative Writing, remarking, “the rise of the creative writing program stands as the most important event in postwar American literary history” and he emphasizes the need to document the growth of the creative writing enterprise (ix). Earlier this month, the Program Era Project team—along with new team member John J. Witte of Iowa’s Department of Communication Studies—had the chance to go to Stanford University and meet with Professor McGurl, Professors Mark Algee-Hewitt and Franco Moretti , and other members of Stanford University’s Literary Lab. There, we were able to share some of the work we’ve done on the Program Era Project and to bounce ideas off our gracious hosts regarding how the Program Era Project might pursue the objective McGurl lays out in his book.

Over the course of the summer, I’ll be sharing online some of the work we presented and the experiments we’ve conducted with visualizing the data we’ve collected. As I’ve mentioned before, the Program Era Project is interested, whenever possible, in using our data to create sample visualizations and proof-of-concept work. We do this both because it helps the team get a sense of what types of research question our data can help answer, and, more importantly because it helps us see the potential the Program Era Project has to offer new perspectives on the literary phenomenon of creative writing.

For this post, I’ll be showing (more accurately, “providing access to”) a sample visualization put together using Google Maps, which presents geographic information about the migration of Workshop-affiliated writers to and from Iowa. It also allows users to see a collection of creative writing programs founded by Workshop writers and where Workshop-affiliated writers were serving as directors of other creative writing programs at two specific points in time: 1992 and approximately 1976.

The key information for this new visualization came, as is often the case with the Program Era Project, from resources available through the University of Iowa Special Collections Library and the University Archives. In this case, the document in question was a department self-study produced, in 1992, by the Writers’ Workshop for the College of Liberal Arts and Sciences. In the 1992 study, the Workshop reported to the University on its current size and overall growth. The self-study offered other information, including extensive lists of awards won by Workshop students and faculty, as well as one appendix, titled “Directors of Writing Programs with University of Iowa MFA’s,” which provided a list of Workshop-affiliated writers and the programs for which they were then serving as director. Interestingly, the study was produced as the Workshop was undertaking efforts to separate from The University of Iowa’s English department, becoming its own institutional entity.

Beyond its status as a fascinating historical document, the i 1992 study opened the opportunity for some new geographical visualization experiments and proof-of-concept work, particularly given that the Program Era Project had already collected an earlier list of Workshop-affiliated-writers who served as directors (or founders) of other creative programs, that list assembled by looking through information assembled by Stephen Wilbers for his 1976 English dissertation at Iowa, a work which went on to be The Iowa Writers’ Workshop. Because we had these two collections of data, assembled at two specific points in time, we could see how the number of creative writing programs with Workshop-affiliated directors or founders had grown or changed over the span of 16 years. We could also get a greater sense of the ongoing movement of Workshop-affiliated writers into and between creative writing programs across the country.

mpeusmap

For my last post, some network maps made in Gephi provided the basis of a rough mockup visualization of the spread of the Workshop-affiliated writers across the US. While it gives a sense of how and where Workshop-trained writers had moved on to teach by the time of Wilbers’ 1976 survey, the image could benefit from better legibility and it doesn’t account for changes over time. So, for this experiment in data visualization, we turned to Google Maps.

FullMapWithPoint

The new Google map experiment offers a sample of some of the geographical information about the history of creative writing we are working to document in the Program Era Project. By taking advantage of different layers of information we can have on one Google Map, we can both account for (slight) variations in time and allow for different types of information to be toggled on and off. The static image above illustrates a number of things tracked by the map. First, the light blue points are schools that listed Workshop-affiliated writers as directors of their creative writing programs in Wilbers survey. The dark blue points show schools listed as having Iowa MFAs as directors in the 1992 self-study. Green markers are the locations of creative writing programs that reported, for the Wilbers survey, they were founded by Workshop writers. Clicking on a blue or green point gives the name of the school and the Workshop writer(s) listed as director or founder.

timecompare1

timecompare2

Toggling on and off layers, such the 1976 and 1992 director’s lists, allows for some changes over time to be seen. By switching between 1976 only and both 1992 and 1976, users can see, for instance, the new schools where Workshop-affiliated writers became directors. The map also shows, if Workshop writers stayed at a particular school. Oakley Hall, for instance, was listed as the director of the creative writing program at the University of California, Irvine both in Wilbers survey and in 1992. Moreover, with both layers on, the overall growth of schools where Workshop writers have been employed in positions of institutional significance is also illustrated.

The Program Era Project is also interested in where Workshop writers came from, not just where they went after the Workshop. A history of the Workshop (or creative writing) would be incomplete without considering what regional backgrounds have converged in creative writing communities. So, in cases where information about Workshop author hometowns was available via author biographies, that hometown information was added to this map. The information, like the school information, is separated by the time frames of 1992 and 1976.

homecompare1

homecompare2

homecompare3

Here, the hometowns of 1992 directors are in dark red and 1976 directors/founders are in light red. Admittedly, the hometown data is less complete than the school information. In a later blog post, I will show some of the other approaches team member Nikki White has taken towards mapping hometown geographic data. However, for now, the hometown information on the map still gives a small sense of some of the places people were traveling from to arrive at Iowa City. The East coast, South and Midwest all have a number of Workshop writers. Both in terms of schools and hometowns, the map also bears a noticeable gap in states like Montana, the Dakotas, Idaho, and Wyoming, though this may just be an unusual feature of this data.

This map, as is the case with the previous data visualizations, covers very specific historical snapshots and uses an intentionally limited collection of information. It is, fundamentally, a proof-of-concept. However, I hope the map helps illustrate some of the information the Program Era Project hopes to make available in its efforts to document the history of creative writing. I encourage you to play around with the map and we hope it gives you a sense of our aim to provide interactive digital research tools for the scholar and the curious alike, as well as the potential the Program Era Project has to offer new perspectives on the literary history of the 20th century.

– Nicholas M Kelly