Getting Started With Tidy Data#
This chapter provides an overview of tidy data principles (Wickham et al). It covers how to recognize and address pattern errors in structured data using the data cleaning tool Open Refine and common spreadsheet programs like Microsoft Excel or Google Sheets. It also covers how to use survey design and data validation options to minimize user error in data entry.
Acknowledgements#
The author consulted the following resources when building this tutorial:
Chapter Contents#
Data#
This chapter will work with two tables, verisons of the baseball data we used for relational databases and SQL. Link to access via Google Drive
players
teams
You can copy the project to your own Drive workspace, download the workbook as an .xlsx
file, and/or download individual sheets as .csv
files.
Software#
Spreadsheet Programs#
We’ll be opening some structured data files as part of our work in this lab. You can use a spreadsheet program or text editor to access these files.
Text editors:
Spreadsheet programs:
Microsoft Excel (Windows or Mac)
ND students have free access to the Microsoft Office suite through Office365
LibreCalc (open source Excel/Numbers alternative for Mac or Windows users)
Google Sheets (web-based option available through Google Drive)
OpenRefine#
We’ll also be working with a free software program called OpenRefine as part of our work in this lab.
Navigate to https://openrefine.org/download.html in a web browser and download the appropriate version for your operating system.
If you are getting memory-related error messages, visit https://docs.openrefine.org/manual/installing#increasing-memory-allocation to troubleshoot.
Application#
Click here for a Google Doc template for this chapter’s application problems.
To include screenshots in your doc:
Tutorial for adding images/tables/drawings to a Google Doc
Windows (Snipping Tool for folks running older versions of Windows; Snip & Sketch for folks running updated versions)
We’ll also use a combined spreadsheet template for the data cleaning outputs.
Click here to make a copy of the Google Sheets template.