Getting Started With Tidy Data#

This chapter provides an overview of tidy data principles (Wickham et al). It covers how to recognize and address pattern errors in structured data using the data cleaning tool Open Refine and common spreadsheet programs like Microsoft Excel or Google Sheets. It also covers how to use survey design and data validation options to minimize user error in data entry.

Acknowledgements#

The author consulted the following resources when building this tutorial:

Chapter Contents#

Data#

This chapter will work with two tables, verisons of the baseball data we used for relational databases and SQL. Link to access via Google Drive

  • players

  • teams

You can copy the project to your own Drive workspace, download the workbook as an .xlsx file, and/or download individual sheets as .csv files.

Software#

Spreadsheet Programs#

We’ll be opening some structured data files as part of our work in this lab. You can use a spreadsheet program or text editor to access these files.

OpenRefine#

We’ll also be working with a free software program called OpenRefine as part of our work in this lab.

Navigate to https://openrefine.org/download.html in a web browser and download the appropriate version for your operating system.

Application#

Click here for a Google Doc template for this chapter’s application problems.

To include screenshots in your doc:

We’ll also use a combined spreadsheet template for the data cleaning outputs.

Click here to make a copy of the Google Sheets template.