Summary & Descriptive Statistics

Summary & Descriptive Statistics#

The .mean(), .median(), .describe(), and .agg() arguments apply to columns with numeric data and typically ignore missing data.

.info() is a good place to start to check data types.

In these examples, we’ll work with the Census Bureau API return.

# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://api.census.gov/data/2022/acs/acs1/pums?get=SEX,AGEP,MAR&SCHL=24")
data = r.json() # store response
df = pd.DataFrame(data[1:], columns=data[0]) # create the dataframe, making the first sublist the column headers, and starting with the first row of data to avoid duplicating headers)
df # show output

We can check the data types and general summary.

df.info() # show info

In order to perform calculations, we’ll want to convert the data type to integer:

df = df.astype(int) # change datatype for all columns
df.info() # show updated technical summary

Now we can start to explore summary statistics for specific columns.

df['AGEP'].mean() # mean value for PWGTP column
df['AGEP'].median() # median vlaue for PWGTP column

We can also use .describe() to get a predetermined set of summary statistics for all columns.

df.describe() # summary statistics for entire dataframe

We can use .agg() to return specific combinations of aggregate statistics for specific columns.

df.agg(
    {
        'AGEP': ['min', 'max', 'median', 'mean', 'skew'],
        'MAR': ['mean']
    }
)

Additional Resources#