Loading Structured Data in Python#
Much of our work this semester will utilize Pandas
, a powerful Python library for working with structured data. But we can also work with structured data using Python’s built-in secondary data structures and the csv
and json
modules.
Data Files#
We’ll be working with two data files in this section.
Without Pandas#
csv
#
The csv
module allows us to create a reader
object that iterates over lines in a .csv
file. We can use it to read or load an existing .csv
file into Python.
# loading delimited data as list of lists
import csv # import statement
file = open('example.csv') # open file
reader = csv.reader(file) # create reader object
data = list(reader) # create list of lists
print(data) # show output
# if we wanted to combine those steps
import csv # import statement
data = list(csv.reader(open('example.csv'))) # open file, create reader object, convert to list
print(data) # show output
# loading delimited data as dictionary
import csv # import statement
file = open('example.csv') # open file
dictReader = csv.DictReader(file) # DictReader object
data = list(dictReader) # convert DictReader to list of dictionaries
print(data) # show output
# if we wanted to combine those steps
import csv # import statement
data = list(csv.DictReader(open('example.csv'))) # open file, create DictReader object, convert to list of dictionaries
print(data) # show output
JSON
#
We can read JSON into Python using the json
module, which includes a few key functions for loading JSON data into Python:
json.loads()
takes a single string of JSON and loads it as a Python valuejson.load()
takes a JSON file (or file-like object) and loads it as a Python valuejson.dumps()
takes a Python value and transforms it to a JSON object.
Values stored in JSON format must be one of the following data types:
string
number
object (JSON object)
array
boolean
null
Translation table for how Python’s json
module interprets or renders these data types:
JSON |
Python |
---|---|
object |
dict |
array |
list |
string |
str |
number (int) |
int |
number (real) |
float |
true |
True |
false |
False |
null |
None |
To translate a string of JSON data into a Python value, we pass it to the loads()
function contained in the json
module.
import json # import statement
jString = '{"name": "Zophie", "isCat": true, "miceCaught": 0, "felineIQ": null}' # string of JSON data
data = json.loads(jString) # load JSON data as Python value
print(data) # show output
When loading a JSON file (or file-like object), we would need to use the json.load()
argument.
import json # import statement
file = open("example.json") # open file
data = json.load(file) # parse as JSON
print(data) # show output
Additional Resources#
For more on using the csv
module and this workflow:
For more on using the json
module and this workflow:
If you’re feeling fuzzy on file methods:
With Pandas#
pandas
has two main data structures: Series
and DataFrame
.
“A
Series
is a one-dimensional, array-like object containing a sequence of values…and an associated array of data labels, called its index” (McKinney, 126)A
DataFrame
includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130).
Delimited file -> DataFrame
#
Loading a delimited file as a DataFrame is fairly straightforward.
import pandas as pd # import statement
df = pd.read_csv("example.csv") # load file, create dataframe
print(df) # show output
Additional Resources#
.read_csv()
can include a number of other arguments or parameters to handle different data loading challenges, for example other delimiter characters, escape characters, and data type conversion. For more on parsing delimited data in Pandas
:
The read_
prefix can be used with other structured data file formats. For more on Pandas
and file I/O:
Pandas
& JSON
#
Since a DataFrame
is a flat two-dimensional structure, highly-nested JSON can present a challenge.
Pandas includes a few JSON-specific functions. Documentation links are included below.
We’ll come back to these functions when we start working with web API returns.
Application#
Question 1#
Q1A: Navigate to an open data portal and download a .csv
file.
A few places to start:
These open data portals are catalogs of datasets- you will need to explore the websites to identify and then download a specific dataset.
Open the data in a spreadsheet program and/or text editor.
What do you see?
How can we start to make sense of the data based on available documentation?
Q1B: Write three programs that load the data in Python using the different approaches highlighted in the previous section of the lab:
Lists and sublists
Dictionaries
Pandas
DataFrame
Answer to this question includes program + comments that document process and explain your code.
Q1C: What challenges did you encounter? How did you address or solve them?
Question 2#
Q2A: Navigate to an open data portal and download a JSON file.
Some options that can get you started:
These open data portals are catalogs of datasets- you will need to explore the websites to identify and then download a specific dataset.
Q2B: Open the data in a spreadsheet program and/or text editor. Describe what are you seeing. How can we start to make sense of this data? What documentation is available?
Q2C: Write a program that loads the JSON data into Python, using the workflow shown in the previous section of the lab.
You’re welcome to explore creating a Pandas
DataFrame
, but that’s not required.Answer to this question includes program + comments that document process and explain your code.
Q3C: What challenges did you encounter? How did you address or solve them?