Getting Started with HTRC Corpora: Tips, Tool, and Walkthrough

Before we begin: here is the link to my GitHub page containing the extractor script I’ll be discussing below. This script automates the process of extracting and organizing an HTRC corpus, turning the compressed archive of page files into a collection of assembled full text volumes.

This post follows on the heels of the recent announcement that HathiTrust Research Center has expanded access to its collection of digital texts, making its complete 16-million-item corpus available for computational text analysis work. Given the challenges scholars can face with gaining access to digital texts—particularly texts under copyright—this is exciting news for computational literary scholarship. The Program Era Project’s own efforts to conduct computational text analysis on contemporary literature were only made possible through an earlier collaboration with HTRC, as I have discussed before.

Given that the scope of who can gain access to HTRC texts has greatly expanded, I wanted to offer an overview of the process I went through to prepare the Program Era Project’s HTRC corpus for text analysis. Hopefully, my walkthrough of how I extracted and organized our corpus offer some ideas or assistance to future HTRC users.

Moreover, I’ll also tell you about a tool I built that automates this extraction process, one which is now freely available on my GitHub. This tool can also be used in conjunction with other python-based text analysis scripts to automate the process of collecting text data from multiple volumes at once. In a moment, I’ll go into greater detail on how to use the tool and how to pair it with your own text analysis scripts. First, however, let’s see what the process of extracting and organizing a HathiTrust corpus looks like.


Figure 1: The Original Archive and its Contents

In this image, you see the initial archive we received from HTRC, a list of volumes and a compressed file which contains a series of folders. Each folder contains additional files, which will be accessible and useful to us as soon as we extract the initial compressed archive.


Figure 2: The Original File, Extracted


Figure 3: An Individual Folder in the Extracted Archive

The above image shows a folder containing the contents of the extracted file, each volume contained in a separate folder. The second image shows contents of a single volume folder. In each volume folder, we find a ZIP file containing the full text of our volume and a JSON file which contains metadata on the volume in question. There are important things to note about both the ZIP and JSON files, things which will impact how we extract and organize full texts.

First, it’s important to recognize that within each ZIP file the full text for a volume is not collected together as a single document. Instead, each page of a full text is included as a separate .txt file. We need to reassemble these pages to turn our volumes’ pages into a complete work.

Second, the JSON file provides a great deal of information on the text in question, information we can use for research purposes and to help organize our texts. Included in the JSON files are things like author names, book titles, and unique HTRC ID numbers for each book. These will be useful for automatically naming and placing files with our extractor tool. The JSON files also often (though not always) contain information like ISBN numbers, publisher names, and publication dates. The PEP team was particularly excited to discover this metadata, as it was information we could incorporate into our database of text metrics. This new information also enabled new ways to look at text data. Publication dates, for instance, make it possible to view trends in text metrics we’ve measured over time.


Figure 4: PEP’s Extractor Tool Running on an HTRC Corpus

The above image shows my Program Era Project extractor tool working on our HTRC corpus. It uses the metadata in the JSON to automatically organize volumes by author and title. It also extracts, sorts, and assembles the individual pages in each ZIP into a complete work. While the tech-minded are more than welcome to look directly at the tool’s code, in basic terms, the steps to turn an individual volume’s compressed files into a readable work looks like this:

  • The tool pulls author name, work name, and HathiTrust ID data from the associated JSON.
  • It uses the author name to create a folder for that author’s name (if one doesn’t already exist).
  • It creates a sub folder within the author folder named after the volume’s title.
  • It opens the ZIP file and extracts all page files to the book folder.
  • It sorts the page files in numerical order and creates a single text file out of them, named “full.txt,” in the book folder.
  • It places a copy of the metadata file in the folder, renamed with the HTRC ID number (e. a volume with the HTRC ID 123456 would be 123456JSON)

This gives us a single folder where all the volumes, their page files, and their associated metadata can be found. It is a catch-all repository for our HTRC corpus, which is now extracted and assembled as full texts. That said, this archive be somewhat large and unwieldy if all we’re interested in is the text files themselves, particularly in cases where the metadata might not be consistent. Discrepancies in author names, for instance, can create multiple folders for a single author, an issue that can been seen in the image below and can lead to data cleaning efforts.

Luckily, our tool gathers full texts together in another way. After the pages for a volume are organized and reassembled, a copy of the full text is placed in a folder set to be a complete repository of the full texts. In this folder (I call it “collected” above), each volume is named after its HTRC ID number. For example, a volume with the HTRC ID 7891011 would be 7891011.txt. This gives us a place where we have quick access to all our text files, and each is uniquely, consistently identified.

At this point, the extractor tool has done its work. However, we took a few additional steps to organize our collection and make it easier to navigate. Looking through the list of works provided by HTRC, the PEP team found a number of false positives, works not written by workshop authors that had been inadvertently included. We wanted to remove these authors from our list. We also wanted to find a way to relabel files in our collected folder to make it easier to find specific volume or a specific author’s works.

We turned to the HTRC ID numbers to help us remove false positives and rename the files for easy navigation. The PEP team created a spreadsheet of all the volumes in our collection. On this spreadsheet, authors name entries were made consistent and all false positives removed. Each volume entry also listed the title’s HTRC ID. Because we had the unique HTRC ID, I was able to build a script that searched for that unique identifier in each file name and automatically replaced that HTRC ID name with a new naming scheme (authorname+year). Below you can see our collected archive before and immediately after I have run the script.

Using these unique HTRC IDs allowed us a way to automatically replace enter clean, consistent data for our file names. As an added benefit, it made finding and removing all false positives simple. Because the false positives are not included in our spreadsheet, their names were not corrected. Therefore, any file that retained its HTRC ID was not on our master list and should be removed from our archive.

At this point, the PEP corpus had gone from a single compressed file to a collection of over 1000 full text volumes, assembled from individual pages, organized, and ready for future work. In the particular case of the Program Era Project, I incorporated our text mining scripts into the extraction process, producing datasets for each volume we could gather together into a larger dataset (after removing false positives).

Hopefully this account gives readers curious about working with a HTRC corpus a clearer picture of what to do to get a HTRC corpus organized and readable. However, as I said at the beginning of this post, I’m also happy to note that my github page contains a freely-available version of the extractor script I built to automation the extraction process.

The tool is designed to be as user intuitive as possible. No programming knowledge should be required, and the interface is almost entirely point and click. I’lll go through the steps to run the script here, but you can also find the readme for the extractor script on my GitHub.

To begin using the tool, your first step is to extract the initial compressed archive provided by HTRC. This should give you a collection of folders, each containing a ZIP file and JSON file. If you find a folder without a zip or JSON file, remove it. It will crash the tool. If, during extraction, you find a folder crashing your tool, check it for a missing ZIP or JSON.


Figure 5: Opening the terminal

Once the initial archive is extracted, run the script. A simple way to do it can be seen above. First, right click somewhere in the folder where the script is located and select “open in terminal.” You can also navigate to the folder manually if you prefer. After you have done that, type in the following command:


At this point, prompts will appear on the screen. They will ask you to select three folders:

  • The first folder you select is the one which contains the subfolders with ZIP and JSON files.
  • The second folder is where the tool will place all the page files, metadata files, and full text files, organized by author folders and work name subfolders.
  • The third folder is the tool will place all the full text files (labeled by HTRC ID).

The tool should then run, letting you know as it works through the files in your archive. It will also notify you when it has completed. It lists my e-mail address should you encounter any bugs or have any questions, comments, etc…

Note: If you are working with the tool for the first time in a HTRC data capsule, you will want to run it first in maintenance mode, so that it can import any modules it needs to run. Once the initial folder select prompt appears, you can simply close the tool. You can then switch to secure mode and run the tool as I explain above.

For users who’d like to integrate their own text mining scripts into the tool, please examine the code. I’ve marked a place that allows you to insert your own scripts easily, running them on each volume you extract. Inserting text mining scripts there also allows you access to the HTRC metadata for the volume. I’ve also included some sample code to get the assembled file up and running for text-mining, though you may likely have alternative ways you’d like to work with or open the data.

To those of you interested in getting started with HTRC corpora, I hope this walkthrough and the tool provide some useful starting points. The more we increase the accessibility of computational text analysis tools and corpora (particularly corpora still in copyright) the more we all stand to benefit from greater diversity of computational analysis research projects as well as more experimentation (and, ideally, innovation) within the field of computational text analysis.

Happy mining.

Collaborating with HathiTrust


The screenshot above documents an exciting moment for our ongoing collaboration with the HathiTrust Research Center. It is a screenshot of my computer, remotely connected to a HathiTrust machine, running my Program Era Project text-mining tools on sample HathiTrust text data. In short, it’s a proof-of-concept, a confirmation that the PEP tools are ready to begin collecting data on thousands of texts produced by creative writers affiliated with the University of Iowa, and that the PEP team can begin to join that data with the wealth of institutional, biographical, and demographic data they have already collected.

This screenshot is a result of HathiTrust’s selection of the Program Era Project as a 2017 Advanced Collaborative Support award winner. HTRC is a collaboration between partner universities that houses an expansive digital library. As HathiTrust’s site explains, the ACS program is:

a scholarly service at the HathiTrust Research Center (HTRC) offering collaboration between external scholars and HTRC staff to solve challenging problems related to computational analysis. By working together with scholars, we facilitate computational-oriented analytical access to HathiTrust based on individual scholarly or educational need.

For the HathiTrust/PEP collaboration, the approach chosen was to establish a “Data Capsule,” a machine maintained and secured by HathiTrust, that PEP team members can remotely access and then run text mining experiments on a corpus of texts held in HathiTrust’s collections. The Data Capsule approach is crucial, as the texts to which we require access remain in copyright; they simply aren’t accessible in digital form for large-scale data collection. The Data Capsule configuration allows full texts of HathiTrust works to be measured by text mining software, but only the metrics collected by the tools can be moved off the Data Capsule machine. In PEP’s case, this means .csv spreadsheets of data on individual texts.

Now, thanks to the HathiTrust/PEP collaboration, the tools I created for the Program Era Project (described a bit more here) can be employed on a large volume of digital texts. They can be used not just for experiments, but to begin building a database of metrics on features of creative writing at the University of Iowa. For the Data Capsule collection, the Program Era Project team assembled a list of roughly 400 selected authors associated with the Writers’ Workshop and the Nonfiction Writing Program. Since receiving this list, the HathiTrust team has endeavored to find all the works held by HTRC associated with these authors. At present, over 2000 volumes have been connected to the PEP authors list. These items are then made accessible on the Data Capsule and, using the PEP tools, converted into metrics which are stored in the PEP database.

So, what data is the Program Era Project collecting? Currently, I’ve built two text mining tools for the Program Era Project. Both are written in Python and draw on Stanford’s Natural Language Toolkit (NLTK). We call them Style Card and LitMap.

Style Card is a text analysis tool that measures features of literary style such as vocabulary size, sentence length, adverb and adjective usage, and frequency of male and female pronouns. The last metric is particularly interesting, as it can provide a quick impression of gender representation trends in a work or collection of works. Additionally, by collecting the same metrics from multiple authors or multiple works from one author, stylistic comparisons can be made between aspects of later or earlier works of a single author or the complete corpora of two authors. It is, in short, like creating baseball cards for authors and literary works, snapshots of information that can be used to establish or test hypotheses.

LitMap is a software package that tracks location references in literary corpora, making it possible to analyze regional representation in literary works. This allows us to see the influence of an author’s biography on their literary output as well as measure the influence of authors migrating to and from creative writing programs on the settings of their writing. Using LitMap, we’ve already made some interesting discoveries about the frequency with which works written by authors who taught at and/or attended the University of Iowa mention the state of Iowa and the region of the midwest. The team is looking forward to sharing more with you on that topic in the future.

What’s truly significant (and truly promising) about the data these tools collect is that it will be stored in the PEP team’s database. When the ACS data is incorporated into the PEP database and available to future users of the PEP web presence, users, at a glance will be able to rank and compare features themselves.

The images below represent a proposal for the eventual look and feel of the Program Era Project web presence. The numbers used are drawn from data already collected with the PEP text mining tools. As the first figure shows, a user could rank writers by average sentence length, learning, at a glance, which authors typically create sprawling (or terse) sentences. The second image ranks authors based on the ratio of male pronouns to female pronouns. The larger the number, the more often male pronouns appear compared to female pronouns.

StyleCard Sentences.png


Users could also compare two authors—or an individual author to a control corpus—and look at differences such as first-person and third-person pronoun use (a potential indicator of narrational patterns and preferences) or adverb and adjective ratios (which can index spare or detailed prose). Scholars could see at a glance how an author’s stylistic features might compare to their advisor, how they compare to other writers in the corpus, or to a baseline corpus of writing in English.

StyleCard Compare Depiction 2.png

StyleCard Control Compare Depiction.png

In the following image, produced using plotly, the platform we currently use to visualize LitMap data, we see another way the PEP text tools will provide new insights into literary corpora. The image shows the strong representation of Iowa in a literary corpus comprising 75 novels by Workshop-affiliated writers, documenting how their time at Iowa has left a mark on where they write about.


The idea behind offering these metrics to future users of the Program Era Project website is that access to this information will prompt curiosity and exploration. Moreover, when users find an interesting pattern or phenomenon in the data, we hope it will prompt a direct investigation of the works included in the data. In short, beyond just presenting this information, we believe that the ability to skim over these metrics will inspire scholars to dive deeper into the texts the data is drawn from. These objectives of encouraging emergent research and driving curiosity are at the heart of Style Card and LitMap’s other principal innovation: the use of clear, easy to understand metrics. The fields of stylometry and text analysis have developed techniques that allow for astounding technical and scholarly achievements, author attribution being a notable example. However, understanding how a piece of software or a quantitative approach arrived at the conclusions it did can be difficult for users not familiar with the theoretical foundations or technologies employed. To this end, metrics tracked by Style Card were selected so that users are offered information that is easy to understand and transparent. By using simple numbers, StyleCard metrics allow any scholar—whatever their experience and training with quantitative analysis—to benefit from the Program Era Project website, broadening the number of academic projects that might be inspired by quantitative analysis.

Even better, both the Style Card and Lit Map tools were developed in such a way that anyone can use them. You simply click the program file, type in the name of the author and work you are scanning and select the name you want for the output file you will create. The tool does everything else. What this means is two-fold. First, it allows more collaboration with building our database of text metrics. If a team member can access the Data Capsule, they can easily run the software to collect metrics. Secondly, these tools will eventually be made freely available online. Therefore, any other project team that wishes to collect the same metrics will have the option available to them. Because the tools will be open source, users will also have the option to modify, adjust, and tweak the technology to their own needs. Moreover, any school that might be interested in learning more about their own history of Creative Writing, or any school that might wish to establish a partner project to the one here at the University of Iowa, will have the necessary technology.

All that said, I hope you can see why we are so excited about the screenshot of our text tools running in conjunction with the HathiTrust Data Capsule. The image represents another significant step along the way to our goal of providing students and scholars of literature the ability to explore the history of the Iowa Writers’ Workshop—and the history of Creative Writing at Iowa—in a way that was never before possible.

Gender Trends at the Iowa Writers’ Workshop

By Loren Glass and Nicholas M Kelly

Thanks to the ongoing efforts of the Program Era Project (PEP) team and the resources of both the University of Iowa Libraries and the University of Iowa’s Digital Scholarship & Publishing Studio (DSPS) work continues on our database. This database will offer an extensive overview of the professional itineraries, accomplishments, and connections of students and faculty involved with creative writing at the University from the founding of the Writers’ Workshop up to the present day and this information will be accessible and explorable through a future PEP website.

As the PEP team continues to aggregate data, we gain the capability to observe macro-level demographics and trends in creative writing cohorts. We now have enough data to visualize the gender breakdown of all the students who ever attended the Iowa Writers’ Workshop from 1932 to the present.

While there are no remarkable revelations here, our data does confirm certain anecdotal assumptions about the demographic composition of MFA programs over time. This potential to support or contest anecdotal assumptions about Workshop demographics with macro-level data in itself is something of a victory. Moreover, as more demographic data is collected and, eventually, made public via the PEP web presence, this information will be accessible for scholars and students of creative writing.

Before we discuss our visualization, a quick note on the methods employed here. Whenever possible, the PEP team turned to biographical data available on authors to ensure correct gender identification. When this was not possible, gender identity was inferred based on first name, doing so only when the name clearly indicated a specific gender. If these criteria were not met, writers were left in the unknown category. This means there is a certain margin of error, one which we feel does not affect our observations. As we continue to develop our database, we will work on refining how we define gender, exploring more inclusive, non-binary methods for tracking gender identities.

Below is the gender breakdown in its entirety. Showing trends over time in the male, female, and unknown categories. Looking more closely, discrete trends reveal themselves.

First of all, it is worth noting that before WWII, women predominated as students in what was a small and fledgling program. We also know from high school graduation dates that most of these women were older than standard college age.  The program in its early years, in other words, was less about professional credentialing than continuing education.

Then, after WWII, with the inception of the GI Bill, the program is flooded with male students, usually older than standard college age and (based on anecdotal accounts of this period) frequently married, and this pattern persists into the seventies. The image also shows the overall growth of the Workshop program.

Looking closer to the present, we see the first year in which female students predominate is 1983 and from then on there is a marginally larger percentage of female students, which reflects national trends as established by the Integrated Postsecondary Education Data System (IPEDS).

We hope this illustrates how the PEP will assist scholars and those interested in the history of creative writing to obtain insights and information about connections and trends in the field. As PEP’s database of institutional and biographical data continues to grow—and is accompanied by a  host of computer-collected text analysis metrics from Workshop-affiliated writing—we hope to offer scholars new ways to make discoveries that will lead to new lines of scholarly inquiry on the rise and spread of creative writing in the United States and the world.


Geography and Creative Writing with Google Maps: Part Two, a Program Era Project Sample Visualization

Here is a link to the Google map I’ll be discussing for this second post on Geographic information assembled by the Program Era Project. Again, feel free to click and explore. Layer toggles are activated on the left bar.  In order to ensure privacy, all names have been removed from the assembled records. This map was made possible by the efforts of University of Iowa students Emma Husar and Abby Sevcik, who were instrumental in collecting and organizing the data presented here.

In our earlier post, “Geography and Creative Writing with Google Maps,” the Program Era Project provided a sample visualization of some of the geographical data the Project had collected in its ongoing effort to document and better understand the expansion of the Creative Writing programs at universities across the United States during the second half of the 20th century. The visualization, made using Google Maps, employed data assembled from resources found in the University of Iowa Special Collections Library and the University Archives to illustrate the migration of creative writers to and from Iowa City and the Iowa Writers’ Workshop. Charting the hometowns of prominent Workshop-affiliated writers and the locations of creative writing programs founded by, directed by, or employing Workshop-affiliated writers, the map helped demonstrate how a single institution such as the Workshop could have connections to writing programs across the nation. It also showed how a single program could draw writers from a wide variety of locations both inside and outside of the United States.


By using the Google Maps filters feature to separate historic time frames, it was possible, within the previous Google Maps visualization, to detect what regions of the United States were common origin points for prominent Workshop writers. The previous visualization suggested that the Northeast, Midwest, and South had been home to a number of Workshop writers. Meanwhile, the map simultaneously suggested a scarcity of writers from states such as Montana, the Dakotas, Idaho, and Wyoming (though, again, that may simply be a feature of that intentionally limited data set).

At the end of the previous post, I had written that the Program Era Project’s Nikki White had been working with a different collection of University of Iowa archival records to bring together a larger set of geographic data about the Workshop and its graduates. This post will highlight a new Google Maps visualization based on that data. This new visualization documents the hometowns of over 200 University of Iowa graduate students connected to the Iowa Writers’ Workshop, graduates who received advanced degrees at Iowa between 1938 and 1960 (this distinction will be clarified in a moment). In order to maintain the privacy of the graduates, we have stripped the records of all names. The aim of the data visualization is to explore broad demographic trends, not chart any single author’s professional itinerary.

Illustrating, once again, the wealth of historical records the Program Era Project has been able to access in the University Iowa’s Libraries Special Collections and University Archives, the data for this visualization was assembled by examining graduation programs and pamphlets distributed at University of Iowa commencement ceremonies between 1938 and 1960. In these materials, students were allowed to list their hometowns, and, so, when this information was collected, a data set could be born. For their efforts in undertaking the sizable task of assembling and organizing these hometown records, the Program Era Project owes significant thanks to University of Iowa Students Emma Husar and Abby Sevcik, who were crucial in stewarding this information from the archive to the database.

While this post is on the topic of organizing and assembling data, it’s important to note that this current visualization charts the hometowns of “over 200 University of Iowa graduate students who attended the Iowa Writer’s Workshop,” not “Iowa Writer’s Workshop MFA graduates.” There are a number of reasons to use this broader language and I will outline some of them in order illustrate the challenges the Program Era Project faces in collecting and organizing archival records. First of all, this broader language helps account for the variety of degrees Workshop students earned at Iowa in the early days of the program, as well as the variety of ways those degrees were earned.  Workshop affiliated writers and scholars were receiving PhDs as well as MAs and, later, MFAs. Critical studies as well as creative works were turned in to satisfy completion requirements.

Additionally, as Nikki mentioned in her post on the Program Era Project blog, “Building the Program Era Project Database,” another challenge faced in refining datasets is the shifting administrative relationship between the Writers’ Workshop and the English Department at Iowa. Until the 1990’s English and Creative Writing remained, to varying degrees, institutionally intertwined, the two units only fully separating after the 1992 study mentioned in the previous post. Because of the closeness between these two units, deciphering the precise institutional position of an individual can require frequent cross referencing between resources and archives.

One other particularly interesting and unusual feature of the data is the significant number of graduates listing Iowa City itself as their hometown. This is both something of an anomaly and a specific issue the Program Era Project is addressing as it builds its database. Throughout the period covered by the map there are a significant number of graduates listing Iowa City as home, a number that seems both out of place and comparable to much larger, more populous areas. Looking at the data in more detail, with names still connected to data points, shows how this anomaly likely occurred. For instance, in the dataset poet W.D. Snodgrass listed his hometown as Iowa City. However, Snodgrass’s biographical information makes it clear this is not the case. Iowa City was not the poet’s hometown and he only lived in Iowa City for a relatively small portion of his life. While, on the one hand, this represents a statistical blip, it also points to a potential phenomenon, one of writers, at least on an official record, “adopting” the home of the workshop as home for themselves. That said, because of this unique coincidence of bookkeeping, and because of some of the other challenges I’ve mentioned above, the visualization presented here should be seen as a look at broad contours in early demographic trends with Workshop-affiliated writers. It is still an evolving data set that will be further shaped by ongoing research.


Methods, provisions, and historical anomalies covered, let’s look at the maps. When compared to the previous visualization, what becomes quickly apparent is the much larger number of individuals included. Because of this higher density of information, this map lends itself much better to looking at aggregate changes as opposed to examining individual points. That isn’t to say, however, it can’t also be fun to speculate who points might be, points like a sole 1947 graduate from Milledgeville, GA. When the map is zoomed closer on particular cities and regions, a clearer pictures emerges of the numbers of graduates associated with a location. Moreover, like the previous Google Maps visualization, the most interesting perspectives emerge by toggling layers of the map on and off, as these layers group the data points by time increments. By toggling layers, the chronology of the Workshop’s growth, its expansion into an international institution, its accumulation of graduates from specific cities and regions, can be seen in greater detail.


These images show overall growth in the number of Workshop graduates in the United States between 1938 and 1960, as well as the Workshop’s eventual turn towards drawing writers from outside the United States, graduates from England, the Philippines, and South Korea all appearing on the map.


Here are a set of images demonstrating the incremental growth of graduate hometowns in Midwest and East coast of the United States, the regions which a large portion of the early graduates listed as their home.


Lastly, we can see how toggling on and off layers can help illustrate the proliferation of graduates listing a single metropolitan area as their hometown, in this case, New York City and its surrounding areas. While only one writer in our dataset listed the New York City area as home in 1946, as the layers increase, the number of graduates also increases sharply, indicating an ongoing growth of a relationship between the two cities.

So, while our previous map allowed viewers to get a sense of the migration of specific individuals into the Iowa Writer’s Workshop and out to positions of institutional significance at newly-forming creative writing programs across the country, this visualization offers the chance to get a better large-scale sense of what places writers were leaving to arrive at the Iowa Writer’s Workshop. This visualization also documents the Workshop’s expansion into an institution that drew international attention and it shows how the number of writers coming to Iowa City from American metropolitan centers would grow throughout the second half of the twentieth century. In short, by linking the Program Era Project’s data up with Google Maps we have a chance to show off another example of how The Program Era Project is assembling the information needed to chart patterns, and take a macro look at statistical and geographic trends in the history of Creative Writing.


Geography and Creative Writing with Google Maps: a Program Era Project Sample Visualization (Reposted from New Readia)

– The following first appeared on the NewReadia blog 05/24/16. It has been reposted here to document some of the Program Era Project’s ongoing experimentation with visualizing the data it has collected. – NMK

Here is a link to the the Google map I’ll be discussing throughout this post, “Workshop-Affiliated Directors and Founders of Creative Writing Programs (1976, 1992).”

Since the last post on New Readia, the team behind Mapping the Program Era—now renamed the Program Era Project—has continued its work on collecting historical and institutional records to chart the evolution of both the Iowa Writer’s Workshop and the literary phenomenon of creative writing programs during the second half of the 20th century.

As I mentioned in my last post, Mark McGurl opens The Program Era: Postwar Fiction and the Rise of Creative Writing, remarking, “the rise of the creative writing program stands as the most important event in postwar American literary history” and he emphasizes the need to document the growth of the creative writing enterprise (ix). Earlier this month, the Program Era Project team—along with new team member John J. Witte of Iowa’s Department of Communication Studies—had the chance to go to Stanford University and meet with Professor McGurl, Professors Mark Algee-Hewitt and Franco Moretti , and other members of Stanford University’s Literary Lab. There, we were able to share some of the work we’ve done on the Program Era Project and to bounce ideas off our gracious hosts regarding how the Program Era Project might pursue the objective McGurl lays out in his book.

Over the course of the summer, I’ll be sharing online some of the work we presented and the experiments we’ve conducted with visualizing the data we’ve collected. As I’ve mentioned before, the Program Era Project is interested, whenever possible, in using our data to create sample visualizations and proof-of-concept work. We do this both because it helps the team get a sense of what types of research question our data can help answer, and, more importantly because it helps us see the potential the Program Era Project has to offer new perspectives on the literary phenomenon of creative writing.

For this post, I’ll be showing (more accurately, “providing access to”) a sample visualization put together using Google Maps, which presents geographic information about the migration of Workshop-affiliated writers to and from Iowa. It also allows users to see a collection of creative writing programs founded by Workshop writers and where Workshop-affiliated writers were serving as directors of other creative writing programs at two specific points in time: 1992 and approximately 1976.

The key information for this new visualization came, as is often the case with the Program Era Project, from resources available through the University of Iowa Special Collections Library and the University Archives. In this case, the document in question was a department self-study produced, in 1992, by the Writers’ Workshop for the College of Liberal Arts and Sciences. In the 1992 study, the Workshop reported to the University on its current size and overall growth. The self-study offered other information, including extensive lists of awards won by Workshop students and faculty, as well as one appendix, titled “Directors of Writing Programs with University of Iowa MFA’s,” which provided a list of Workshop-affiliated writers and the programs for which they were then serving as director. Interestingly, the study was produced as the Workshop was undertaking efforts to separate from The University of Iowa’s English department, becoming its own institutional entity.

Beyond its status as a fascinating historical document, the i 1992 study opened the opportunity for some new geographical visualization experiments and proof-of-concept work, particularly given that the Program Era Project had already collected an earlier list of Workshop-affiliated-writers who served as directors (or founders) of other creative programs, that list assembled by looking through information assembled by Stephen Wilbers for his 1976 English dissertation at Iowa, a work which went on to be The Iowa Writers’ Workshop. Because we had these two collections of data, assembled at two specific points in time, we could see how the number of creative writing programs with Workshop-affiliated directors or founders had grown or changed over the span of 16 years. We could also get a greater sense of the ongoing movement of Workshop-affiliated writers into and between creative writing programs across the country.


For my last post, some network maps made in Gephi provided the basis of a rough mockup visualization of the spread of the Workshop-affiliated writers across the US. While it gives a sense of how and where Workshop-trained writers had moved on to teach by the time of Wilbers’ 1976 survey, the image could benefit from better legibility and it doesn’t account for changes over time. So, for this experiment in data visualization, we turned to Google Maps.


The new Google map experiment offers a sample of some of the geographical information about the history of creative writing we are working to document in the Program Era Project. By taking advantage of different layers of information we can have on one Google Map, we can both account for (slight) variations in time and allow for different types of information to be toggled on and off. The static image above illustrates a number of things tracked by the map. First, the light blue points are schools that listed Workshop-affiliated writers as directors of their creative writing programs in Wilbers survey. The dark blue points show schools listed as having Iowa MFAs as directors in the 1992 self-study. Green markers are the locations of creative writing programs that reported, for the Wilbers survey, they were founded by Workshop writers. Clicking on a blue or green point gives the name of the school and the Workshop writer(s) listed as director or founder.



Toggling on and off layers, such the 1976 and 1992 director’s lists, allows for some changes over time to be seen. By switching between 1976 only and both 1992 and 1976, users can see, for instance, the new schools where Workshop-affiliated writers became directors. The map also shows, if Workshop writers stayed at a particular school. Oakley Hall, for instance, was listed as the director of the creative writing program at the University of California, Irvine both in Wilbers survey and in 1992. Moreover, with both layers on, the overall growth of schools where Workshop writers have been employed in positions of institutional significance is also illustrated.

The Program Era Project is also interested in where Workshop writers came from, not just where they went after the Workshop. A history of the Workshop (or creative writing) would be incomplete without considering what regional backgrounds have converged in creative writing communities. So, in cases where information about Workshop author hometowns was available via author biographies, that hometown information was added to this map. The information, like the school information, is separated by the time frames of 1992 and 1976.




Here, the hometowns of 1992 directors are in dark red and 1976 directors/founders are in light red. Admittedly, the hometown data is less complete than the school information. In a later blog post, I will show some of the other approaches team member Nikki White has taken towards mapping hometown geographic data. However, for now, the hometown information on the map still gives a small sense of some of the places people were traveling from to arrive at Iowa City. The East coast, South and Midwest all have a number of Workshop writers. Both in terms of schools and hometowns, the map also bears a noticeable gap in states like Montana, the Dakotas, Idaho, and Wyoming, though this may just be an unusual feature of this data.

This map, as is the case with the previous data visualizations, covers very specific historical snapshots and uses an intentionally limited collection of information. It is, fundamentally, a proof-of-concept. However, I hope the map helps illustrate some of the information the Program Era Project hopes to make available in its efforts to document the history of creative writing. I encourage you to play around with the map and we hope it gives you a sense of our aim to provide interactive digital research tools for the scholar and the curious alike, as well as the potential the Program Era Project has to offer new perspectives on the literary history of the 20th century.

– Nicholas M Kelly

Mapping the Program Era: Sample Data Visualizations (Reposted from New Readia)


– The following first appeared on the NewReadia blog 10/18/15. It has been reposted here to document some of the Program Era Project’s ongoing experimentation with visualizing the data it has collected. For clarity, the project name has been updated. – NMK

Last week, University of Iowa Professor Loren Glass, University of Iowa librarian Nikki White, and I had the opportunity to give a talk at the University of Iowa’s Digital Scholarship and Publishing Studio about a Digital Humanities project we began earlier this year called the Program Era Project Program Era. The Program Era Project employs data visualization software and network analysis tools to chart the growth of creative writing programs after the World War II, discerning, in the process, lines of aesthetic and institutional influence. Our initial efforts have centered on our home institution, the University of Iowa, and the influential Iowa Writers’ Workshop. For our talk, we presented sample visualizations drawn from a small-scale dataset on the Workshop the team assembled.

To provide a demonstration of the tremendous potential of the project, the team created a sample visualization in Gephi that served as a proof-of-concept for MEP. It illustrated the connections that could be seen even in a dataset that covered only a single moment in time. The origins of this dataset lay in research work conducted in 1976 by dissertation research work by English Ph.D. candidate Stephen Wilbers. For his 1978 dissertation, Emergence of the Iowa Writers’ Workshop —later adapted to become The Iowa Writers’ Workshop—Wilbers attempted to assess the influence of the Workshop by finding out which Workshop graduates had helped found other creative writing programs or had become directors or instructors at creative writing programs outside the Workshop. To do this, Wilbers sent a survey to 125 creative writing programs across the United States. The list of programs was “compiled from the CEA Chap Book (1970), the Associated Writing Programs 1975 Catalog of Programs (including the directory at the end), and an ‘in-house’ list of 32 programs that the Iowa Writers’ Workshop staff recognizes as top programs” (from Emergence, page 203).

The survey responses, available in Iowa’s University Archives, provided a list of Workshop graduates and Workshop-affiliated writers connected to other creative writing programs. Going, again, to the Iowa’s libraries and looking at title pages of these graduates’ theses and dissertations offered a way to find the advisors connected to each graduate. Thus, connections between Workshop instructors, Workshop graduates, and other creative writing programs began to emerge. In Gephi, the visualizations could map the lines of connection.

(Edit 10/26/15) – Note: Again, these samples are only intended to demonstrate a proof-of-concept for what the project aims to do. The data was constructed using an intentionally limited and incomplete dataset and corrections may need to be made.


WilbersFullSurveyThis visualization demonstrates the full array of relationships in the proof-of-concept dataset. It shows the connections (the lines, or “edges”) from the Workshop itself (blue circle, or “node”) out to its instructors (orange circles). Then, the lines move from the instructors to the graduates they advised (green) and, finally, from graduates out to the institutions at which they were employed (yellow). Larger circles indicate nodes that with more edge connections. For instance, an instructor with more students mentioned in Wilbers’ survey or an institution with more workshop-affiliated faculty will be larger circle.

JusticeHere graduates connected to Donald Justice are specifically highlighted, with the years for graduates’ thesis/dissertation completion also marked. The visualization also demonstrates how, in some cases, Workshop graduates would themselves become Workshop instructors. The line between Justice and Paul Engle charts Engle’s status as Justice’s advisor. Justice advisee Eugene Garber (marked in orange, not green), was later listed as advisor for future Iowa graduates.

Paul Engle's Students with Graduation DatesThis image shows graduates connected to Paul Engle and lists other information the database tracks, such as if graduates were listed in Wilbers’ survey results as faculty members instrumental in founding a creative writing program. The image shows that half of Engle’s students listed in Wilbers’ survey data were considered significant founding figures at programs outside of the Workshop.

University of Massachusetts Amherst - Affiliated Faculty with Graduation DatesThis slide shows another way Program Era Project data can be visualized. Here, the University of Massachusetts Amherst is highlighted, showing that five different Workshop graduates were associated with the program at the time of Wilbers’ survey. Four of them are considered key in founding the creative writing component at Amherst.

MPEUSMapThis final image is a composite, assembled from a Gephi visualization placed on top of a map of the United States. It demonstrates the geographical distribution of Workshop graduates and looks forwards towards some of the other visualization options the Program Era Project team is exploring as we move forward and expand the project.

In the introduction to his study of the expansion of creative writing programs across the U.S., The Program Era: Postwar Fiction and the Rise of Creative Writing, Mark McGurl writes “the rise of the creative writing program stands as the most important event in postwar American literary history” (ix) and adds,

We need to start documenting this phenomenon, moving out from the illustrious cases of the Iowa Writers’ Workshop and Stanford University and a few others to grasp the reality of an enterprise that now numbers some 350 institutional participants and continues to grow. This enterprise is our literary history. (xii).

It is the aim of the Program Era Project team to document the growth and evolution of the creative writing enterprise. These samples offer a glimpse of the ways we are endeavoring to do that. The images above represent a snapshot of a specific historical moment and account for only specific individuals. They are produced by an incomplete and intentionally constrained dataset. However, they illustrate the enormous potential of the MPE project and they offer evidence of how data visualization tools might help us take a new look at the history of creative writing.

 – Nicholas M Kelly