Pandas Basics

Pandas Basics#

Data Structures#

pandas has two main data structures: Series and DataFrame.

“A Series is a one-dimensional, array-like object containing a sequence of values…and an associated array of data labels, called its index” (McKinney, 126)
A DataFrame includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130).

Series#

In pandas, “a Series is a one-dimensional, array-like object containing a sequence of values…and an associated array of data labels, called its index” (McKinney, 126).

# series example
import pandas as pd # import statement
obj = pd.Series([4, 7, -5, 3]) # create series
obj # show outpu

# series example with index attributes
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) # create series
obj2['a'] # access value using index

# series comparison operator example
obj2[obj2 > 0] # returns index-value pairs for values greater than 0

# series arithmetic operator example
obj2 * 2 # multiples all values by 2

DataFrame#

While a Series object is a one-dimensional array, a DataFrame includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130). A pandas DataFrame has a row and column index–we can think of these as Series that all share the same index. Behold, a two-dimensional data structure!

The “Data Structures & Sources” chapter walks through some data loading options in Pandas. In this example, we’ll work with the Census Bureau API return.

Specific dataset we’re working with: American Community Survey 1-Year Estimates Public Use Microdata Sample

# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://api.census.gov/data/2022/acs/acs1/pums?get=SEX,AGEP,MAR&SCHL=24")
data = r.json() # store response
df = pd.DataFrame(data[1:], columns=data[0]) # create the dataframe, making the first sublist the column headers, and starting with the first row of data to avoid duplicating headers)
df # show output

Understanding Your DataFrame#

Pandas includes a few different useful commands for getting a sense of what all is in a DataFrame.

# show first and last five rows
df

# show initial rows- default value is 5 but you can put other integers in the parenthesis
df.head()

# show final rows- default value is 5 but you can put other integers in the parenthesis
df.tail()

# technical summary for the dataframe
df.info()

# summary statistics for the dataframe
df.describe()

The last command .describe() returns some statistical information about values in our dataset, including:

count
mean
standard deviation
minimum
25th percentile
50th percentile
75th percentile
maximum