Pandas Basics#
Data Structures#
pandas
has two main data structures: Series
and DataFrame
.
“A
Series
is a one-dimensional, array-like object containing a sequence of values…and an associated array of data labels, called its index” (McKinney, 126)A
DataFrame
includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130).
Series#
In pandas
, “a Series
is a one-dimensional, array-like object containing a sequence of values…and an associated array of data labels, called its index” (McKinney, 126).
# series example
import pandas as pd # import statement
obj = pd.Series([4, 7, -5, 3]) # create series
obj # show outpu
# series example with index attributes
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) # create series
obj2['a'] # access value using index
# series comparison operator example
obj2[obj2 > 0] # returns index-value pairs for values greater than 0
# series arithmetic operator example
obj2 * 2 # multiples all values by 2
DataFrame#
While a Series
object is a one-dimensional array, a DataFrame
includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130). A pandas
DataFrame
has a row and column index–we can think of these as Series that all share the same index. Behold, a two-dimensional data structure!
The “Data Structures & Sources” chapter walks through some data loading options in Pandas. In this example, we’ll work with the Census Bureau API return.
Specific dataset we’re working with: American Community Survey 1-Year Estimates Public Use Microdata Sample
# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://api.census.gov/data/2022/acs/acs1/pums?get=SEX,AGEP,MAR&SCHL=24")
data = r.json() # store response
df = pd.DataFrame(data[1:], columns=data[0]) # create the dataframe, making the first sublist the column headers, and starting with the first row of data to avoid duplicating headers)
df # show output
Understanding Your DataFrame#
Pandas includes a few different useful commands for getting a sense of what all is in a DataFrame
.
# show first and last five rows
df
# show initial rows- default value is 5 but you can put other integers in the parenthesis
df.head()
# show final rows- default value is 5 but you can put other integers in the parenthesis
df.tail()
# technical summary for the dataframe
df.info()
# summary statistics for the dataframe
df.describe()
The last command .describe()
returns some statistical information about values in our dataset, including:
count
mean
standard deviation
minimum
25th percentile
50th percentile
75th percentile
maximum