Tidy Data Principles#
Hadley Wickham’s 2014 article in the Journal of Statistical Software outlines the foundations and principles of tidy data. These principles have become widely used in data science and other statistical software applications.
“A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.” (Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.)
As part of this week’s work, we read Karl W. Broman and Kara H. Woo’s 2018 “Data Organization in Spreadsheets” from The American Statistician.
“Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.” (Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989)
What Are the Principles#
Designing spreadsheets that are “tidy, consistent, and as resistant to mistakes as possible” (2)
Be Consistent:
Use consistent codes for categorical variables
Use a consistent fixed code for any missing values
Use consistent variable names
Use consistent subject identifiers
Use a consistent data layout in multiple files
Use consistent file names
Use a consistent format for all dates
Use consistent phrases in your notes
Be careful about extra spaces within cells
Choose Good Names for Things:
Avoid spaces
Avoid special characters
Be short but meaningful
Write Dates as YYYY-MM-DD
Or have separate columns for YEAR, MONTH, DATE
No Empty Cells
Put Just One Thing in a Cell
Make it a Rectangle
Single first row with variable names
7.- Create a Data Dictionary
“This is part of the metadata that you will want to prepare: information about the data” (6)
You might also find this information in a codebook that goes with a dataset
Things to include:
The exact variable name as in the data file
A version of the variable name that might be used in data visualizations
A longer explanation of what the variable means
The measurement units
Expected minimum and maximum values
No Calculations in the Raw Data Files
Do Not Use Font Color or Highlighting as Data
Make Backups
Multiple locations (OneDrive, local computer, etc.)
Version control program (i.e. Git)
Write protect the file when not entering data
Use Data Validation to Avoid Errors
Save a Copy of the Data in Plain Text Files
File formats can include comma-separated values (CSV) or plain-text (TXT)
The principles are also available as a PDF:
Common Spreadsheet Errors#
As described in Library Carpentry’s “Tidy data for librarians” tutorial, common formatting problems for data in spreadsheets include:
Multiple tables
Multiple tabs
Not filling in zeros
Using bad null values
Using formatting to convey information
Using formatting to make the data sheet look pretty
Placing comments or units in cells
More than one piece of information in a cell
Field name problems
Special characters in data
Inclusion of metadata in data table
Date formatting