Structured Data in Python

Structured Data in Python#

In programming languages and computing more broadly, I/O stands for Input and Output.

We’ve seen I/O at at work elsewhere in teh course, where we provided inputs to the computer (using a CPU simulator or working at the terminal), a task or operation was performed, and there was an endpoint or output for that process.

Similarly, programming languages can take a variety of inputs (user-provided values, data, files, etc) and return outputs in a variety of formats (data stored in memory, output that shows up in the console, newly-created or -modified files, etc). We’ve seen I/O in action in previous units, with Python’s print() and input() functions.

In this chapter, we’re going to look at aspects of I/O that have to do with reading and writing files, with a focus on two key data structures:

  • Tabular data, or data stored as a table consisting of rows and columns

  • Data stored as name-value pairs, specifically in the JavaScript Object Notation (.json) format

We’ll do so by interacting with Python’s csv module and json package, along with the pandas Python library.

Goals#

This first part of this chapter provides an overview of fundamental programming concepts in the areas of file I/O and working with structured data, with a focus on Python syntax. Topics covered include:

  • Delimited (.csv, .tsv) and name-value (.json, .xml) data structures

  • Escape characters in relation to delimited data structures

  • Working with delimited and name-value data structures in the Python programming environment, with a focus on the csv + json modules and list + dictionary data structures

  • File I/O and file methods/access modes, with a focus on Python workflows

Panopto logo Lab overview

This second part of this chapter covers the core components of pandas, including Series and DataFrame objects. It covers how to manually create and interact with Series and DataFrame objects in the Python programming environment. It covers loading a structured data file (CSV and JSON) as a DataFrame, and sorting, selecting, and filtering the resulting DataFrame. The lab also covers common data parsing and wrangling challenges like duplicate entries and missing data.

By the end of this chapter, students will be able to;

  • Understand the basic components of Series and DataFrame objects in pandas

  • Manually create Series and DataFrame objects in Python using pandas

  • Load a structured data file as a DataFrame in Python using pandas

  • Interact with a DataFrame using sorting, selecting, and filtering operations

  • Remove duplicate rows from a DataFrame

  • Understand how to approach common DataFrame parsing and loading errors using pandas

  • Understand the basic components of how to handle missing values in a DataFrame

Acknowledgements#

Peer review and editing on the CSV portion of this chapter was provided by Spring 2021 graduate teaching assistant Aidan Draper.

Peer review and editing on the JSON portion of this chapter was provided by Spring 2021 graduate teaching assistant Subhadyuti Sahoo.

When building the CSV, JSON & file methods sections, the author consulted the following materials:

Information and exercises in the pandas sections were developed in consultation with the following resources:

Data Files#

You’ll need four data files for this chapter.

You can also access them via Google Drive (ND users only).

You’ll need to download these files and put them in the same folder as your Jupyter Notebook (or upload them to Google Colab).

Application#

Your answers to this chapter’s application questions should be added to the notebook template.

Submit the Colab link on Canvas for the assignment submission.

Chapter Contents#