Data cleaning and descriptives
Our team’s goal is to rebuild the email structure sufficiently to make the data searchable for users. The data was originally in PDF format, which meant it had to be restructured and sorted. In order to carry out this restructuring and eventual building of the searchable database, the data was transformed from PDFs to JSON using OCR and natural language processing tools. The rest of the data cleaning process is ongoing, and our team continues to work on new challenges as they arise.
The main function of the database and the goal of our team in creating it was to make the emails searchable so that they would be organized and useful for Flint residents, researchers, and others interested in water and environmental management. Available data includes email senders, receivers, dates, texts, and attachments. This database is still under construction.
Data visualization is a tool for understanding nuances in datasets and exploring relationships in the data. Our team has been putting together visualizations at various stages of the data cleaning process in order to sort out and narrow down preliminary patterns. Examples of those visualizations can be found in the gallery. These examples were made at multiple points in the data collection and cleaning process, so they are not to be taken as final products.