Data cleaning and descriptives
Our team’s goal is to rebuild the email structure sufficiently to make the data searchable for users. The data was originally in PDF format, which meant it had to be restructured and sorted. In order to carry out this restructuring and eventual building of the searchable database, the data was transformed from PDFs to JSON using OCR and natural language processing tools. The rest of the data cleaning process is ongoing, and our team continues to work on new challenges as they arise.
The main function of the database and the goal of our team in creating it was to make the emails searchable so that they would be organized and useful for Flint residents, researchers, and others interested in water and environmental management. Available data includes email senders, receivers, dates, texts, and attachments. This database is still under construction.
Timeline and calendar
The calendar visualization offers a timeline of communications during the years leading up to and then public knowledge of the Flint Water Crisis. The interactive calendar shows frequencies of emails sent between the years 2011-2016. It is currently in phase one, with an accompanying detailed timeline to follow which will provide context to high-frequency values.
Data visualization is a tool for understanding nuances in datasets and exploring relationships in the data. Our team has been putting together visualizations at various stages of the data cleaning process in order to sort out and narrow down preliminary patterns. Examples of those visualizations can be found in the gallery. These examples were made at multiple points in the data collection and cleaning process, so they are not to be taken as final products.
Email annotation tool
Our email annotator allows users to explore individual emails for content. Part of this is to help with corrections of OCR errors and recognizing patterns in OCR errors to assist in cleaning the emails in the database. The annotation also allows us to mark phrases, words, and other interesting information for further study in the body of email data that we have. The tool is still being refined.