Summary & Descriptive Statistics#
The .mean()
, .median()
, .describe()
, and .agg()
arguments apply to columns with numeric data and typically ignore missing data.
.info()
is a good place to start to check data types.
In these examples, we’ll work with the Census Bureau API return.
Specific dataset we’re working with: American Community Survey 1-Year Estimates Public Use Microdata Sample
# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://api.census.gov/data/2022/acs/acs1/pums?get=SEX,AGEP,MAR&SCHL=24")
data = r.json() # store response
df = pd.DataFrame(data[1:], columns=data[0]) # create the dataframe, making the first sublist the column headers, and starting with the first row of data to avoid duplicating headers)
df # show output
We can check the data types and general summary.
df.info() # show info
In order to perform calculations, we’ll want to convert the data type to integer:
df = df.astype(int) # change datatype for all columns
df.info() # show updated technical summary
Now we can start to explore summary statistics for specific columns.
df['AGEP'].mean() # mean value for PWGTP column
df['AGEP'].median() # median vlaue for PWGTP column
We can also use .describe()
to get a predetermined set of summary statistics for all columns.
df.describe() # summary statistics for entire dataframe
We can use .agg()
to return specific combinations of aggregate statistics for specific columns.
df.agg(
{
'AGEP': ['min', 'max', 'median', 'mean', 'skew'],
'MAR': ['mean']
}
)