.groupby() & .value_counts()#

.groupby()#

Generally speaking, this type of pandas operation follows a split-apply-combine pattern. pandas syntax combines the apply and combine steps.

  • First we split the data into groups.

  • Then we apply a function to each group independently.

  • Then we combine the function results into a data structure.

In this example, we’ll work with the Census Bureau API return.

# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://api.census.gov/data/2022/acs/acs1/pums?get=SEX,AGEP,MAR&SCHL=24")
data = r.json() # store response
df = pd.DataFrame(data[1:], columns=data[0]) # create the dataframe, making the first sublist the column headers, and starting with the first row of data to avoid duplicating headers)
df = df.astype(int) # change datatype for all columns
df.info() # show updated technical summary
df # show output

For example, let’s say we wanted to calculate average age by marital status.

df[['MAR', 'AGEP']].groupby('MAR').mean() # average age by marital status

The first set of brackets ([]) isolates a subselection with only these two columns. Then we apply the .groupby() method to the MAR column to calculate .mean() for the AGEP values.

Another way to calculate average age by marital status, using a slightly different syntax:

df.groupby('MAR')['AGEP'].mean() # average age by marital status

In this alternate example, the column name in the .groupby() parenthesis specifies the column to group by. The column name in brackets ([]) specifies the column to perform the mathematical function on.

Additional Resources#

`.value_counts()#

We can also count the number of records by category using .value_counts(). The .value_counts() method returns the number of records for each category in a column.

Let’s say we want to know the number of records for each marital status. Thinking through the underlying logic, we need a program that groups by marital status and calculates the number of records for each unique value.

df['MAR'].value_counts() # count number of records for each marital status unique value

We could also break out these steps using a combination of .groupby() and .count().

df.groupby('MAR')['MAR'].count() # count number of records for

In the second example, we are explicitly categorizing the data by category in the MAR column, then performing a .count() operation on the records for each MAR category.

Putting It All Together#

A few key takeaways:

  • We can calculate aggregate statistics for entire rows or columns

  • .groupby() follows a split-apply-combine pattern

  • .value_counts() can be a shorthand for getting the number of entries for each category in a field