Loading Structured Data in Python#

Much of our work this semester will utilize Pandas, a powerful Python library for working with structured data. But we can also work with structured data using Python’s built-in secondary data structures and the csv and json modules.

Data Files#

We’ll be working with two data files in this section.

Without Pandas#

csv#

The csv module allows us to create a reader object that iterates over lines in a .csv file. We can use it to read or load an existing .csv file into Python.

# loading delimited data as list of lists
import csv # import statement
file = open('example.csv') # open file
reader = csv.reader(file) # create reader object
data = list(reader) # create list of lists
print(data) # show output
# if we wanted to combine those steps
import csv # import statement
data = list(csv.reader(open('example.csv'))) # open file, create reader object, convert to list
print(data) # show output
# loading delimited data as dictionary
import csv # import statement
file = open('example.csv') # open file
dictReader = csv.DictReader(file) # DictReader object
data = list(dictReader) # convert DictReader to list of dictionaries
print(data) # show output
# if we wanted to combine those steps
import csv # import statement
data = list(csv.DictReader(open('example.csv'))) # open file, create DictReader object, convert to list of dictionaries
print(data) # show output

JSON#

We can read JSON into Python using the json module, which includes a few key functions for loading JSON data into Python:

  • json.loads() takes a single string of JSON and loads it as a Python value

  • json.load() takes a JSON file (or file-like object) and loads it as a Python value

  • json.dumps() takes a Python value and transforms it to a JSON object.

Values stored in JSON format must be one of the following data types:

  • string

  • number

  • object (JSON object)

  • array

  • boolean

  • null

Translation table for how Python’s json module interprets or renders these data types:

JSON

Python

object

dict

array

list

string

str

number (int)

int

number (real)

float

true

True

false

False

null

None

To translate a string of JSON data into a Python value, we pass it to the loads() function contained in the json module.

import json # import statement
jString = '{"name": "Zophie", "isCat": true, "miceCaught": 0, "felineIQ": null}' # string of JSON data
data = json.loads(jString) # load JSON data as Python value
print(data) # show output

When loading a JSON file (or file-like object), we would need to use the json.load() argument.

import json # import statement
file = open("example.json") # open file
data = json.load(file) # parse as JSON
print(data) # show output

Additional Resources#

For more on using the csv module and this workflow:

For more on using the json module and this workflow:

If you’re feeling fuzzy on file methods:

With Pandas#

pandas has two main data structures: Series and DataFrame.

  • “A Series is a one-dimensional, array-like object containing a sequence of values…and an associated array of data labels, called its index” (McKinney, 126)

  • A DataFrame includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130).

Delimited file -> DataFrame#

Loading a delimited file as a DataFrame is fairly straightforward.

import pandas as pd # import statement
df = pd.read_csv("example.csv") # load file, create dataframe
print(df) # show output

Additional Resources#

.read_csv() can include a number of other arguments or parameters to handle different data loading challenges, for example other delimiter characters, escape characters, and data type conversion. For more on parsing delimited data in Pandas:

The read_ prefix can be used with other structured data file formats. For more on Pandas and file I/O:

Pandas & JSON#

Since a DataFrame is a flat two-dimensional structure, highly-nested JSON can present a challenge.

Pandas includes a few JSON-specific functions. Documentation links are included below.

We’ll come back to these functions when we start working with web API returns.

Application#

Question 1#

Q1A: Navigate to an open data portal and download a .csv file.

A few places to start:

These open data portals are catalogs of datasets- you will need to explore the websites to identify and then download a specific dataset.

Open the data in a spreadsheet program and/or text editor.

  • What do you see?

  • How can we start to make sense of the data based on available documentation?

Q1B: Write three programs that load the data in Python using the different approaches highlighted in the previous section of the lab:

  • Lists and sublists

  • Dictionaries

  • Pandas DataFrame

Answer to this question includes program + comments that document process and explain your code.

Q1C: What challenges did you encounter? How did you address or solve them?

Question 2#

Q2A: Navigate to an open data portal and download a JSON file.

Some options that can get you started:

Q2B: Open the data in a spreadsheet program and/or text editor. Describe what are you seeing. How can we start to make sense of this data? What documentation is available?

Q2C: Write a program that loads the JSON data into Python, using the workflow shown in the previous section of the lab.

  • You’re welcome to explore creating a Pandas DataFrame, but that’s not required.

  • Answer to this question includes program + comments that document process and explain your code.

Q3C: What challenges did you encounter? How did you address or solve them?