Data Cleaning & Text Prep
We’re sorry that we’ll be unable to meet you all in June to explore Data Cleaning and Text Preparation at DSI 2020. Listed below are some suggested introductory readings, lessons, and software packages that may be of interest for those looking to do some self-directed learning on this topic.
Hopefully, we’ll have the opportunity to work with you all in 2021.
Alex Provo, NYU Libraries
Jay Brodeur, McMaster University Library
1. Suggested readings
- Against Cleaning, by Katie Rawson and Trevor Muñoz.
2. Recommended lessons and useful resources
Below are some recommended self-directed lessons and utilities that cover a variety of approaches using different common tools for data cleaning and text preparation. Some of what you’ll learn here is specifc to the software package being used, while some approaches and concepts can be used in combination (spreadsheets and regular expressions, as an example).
Spreadsheets for data cleaning
- Data Organization in Spreadsheets for Social Scientists, from Data Carpentry.
- Tidy data for librarians, from Library Carpentry.
- Data Organization in Spreadsheets for Ecologists, from Data Carpentry.
- Data cleaning in spreadsheets, from Duke Library.
OpenRefine
- OpenRefine for Social Science Data, from Data Carpentry.
- Library Carpentry: OpenRefine, from Library Carpentry.
- OpenRefine Recipes. “A collection of useful recipes for achieving certain tasks in OpenRefine”.
Regular Expressions
- Regular expressions
- Using find/replace in Notepad++
- Library Carpentry: Introduction to Working with Data (Regular Expressions)
- Regular Expressions Tester
- RegEx101: Regex explainer and tester
Other Software packages
- Data Cleaning with R and the Tidyverse: Detecting Missing Values, using Tidyverse in R.
- Data Analysis and Visualization in R for Ecologists, from Data Carpentry using Tidyverse in R.
- Pythonic Data Cleaning With Pandas and NumPy
3. Software packages for data cleaning & text preparation
Below is a list of software packages and services that are used for common data cleaning and text preparation applications.
Free/open tools
- Notepad++ is a powerful text editing software. A wide variety of plugins are available to assist and automate common taks
- Tidyverse in R
- Pandas with Python
- Data Cleaner
- OpenRefine