Labels & Indexing#
We’ve already seen a few different ways to select specific rows and columns and filter based on different kinds of conditions.
As a result or as part of data wrangling operations, you may need to rename columns or renumber the rows in a dataframe.
Renaming Columns#
We can use .rename()
to rename columns in a dataframe.
# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://api.census.gov/data/2022/acs/acs1/pums?get=SEX,AGEP,MAR&SCHL=24")
data = r.json() # store response
df = pd.DataFrame(data[1:], columns=data[0]) # create the dataframe, making the first sublist the column headers, and starting with the first row of data to avoid duplicating headers)
df.info() # show output
# renaming columns using a dictionary
df2 = df.rename(columns={'SEX':'sex', 'AGEP': 'age', 'MAR': 'maritalStatus', 'SCHL': 'education'}, axis='columns')
df2 # show output
# renaming columns in place using index labels
df.columns.values[0] = 'sex'
df.columns.values[1] = 'age'
df.columns.values[2] = 'maritalStatus'
df.columns.values[3] = 'education'
df # show output
Reindexing#
We can also imagine a scenario in which you have sliced or filtered the data and now have row index labels that no longer make sense. For example, if your original rows had sequential numerical index labels, the transformed data will retain the original index labels. In these situations, we can reset the index to a simple ascending integer index using .reset_index()
.
df = df.astype(int) # change datatype for all clumns
subset = df[(df['maritalStatus'] == 2) | (df['age'] > 50)] # filter using two conditions and the OR operator
subset.reset_index(inplace=True) # reset index
subset # show output
Working with Large Datasets#
Renaming columns is fairly manageable for our Census Bureau example. Let’s look at the raw data from the City of South bend’s 2022 Digital Literacy Survey, which includes 72 columns.
# creating dataframe from API return
import pandas as pd, requests, json # import statements
page = requests
r = requests.get("https://opendata.arcgis.com/datasets/c97085b608604f5c8c07487c24dcaff4_0/FeatureServer/0/query?outFields=*&where=1%3D1")
data = r.json() # store response
surveyData = [] # empty list for responses
for d in data['features']: # iterate over list of dictionaries
surveyData.append(d['attributes']) # isolate value and append to list
df = pd.DataFrame(surveyData) # create df
df # show output
There are different approaches for relabeling- I’m going to isolate the column labels in a separate metadata DataFrame
, then relabel using my own variable codes.
metadata = pd.DataFrame(df.columns.tolist()) # create new dataframe with existing column labels
metadata.rename(columns={0:'label'}, inplace=True) # rename column
metadata # show output
codes = [] # empty list for new codes
for i in range(0,72):
codes.append("V"+str(i))
metadata['abbreviation'] = codes # add column with codes
metadata # show output
codes = dict(zip(metadata.label, metadata.abbreviation)) # create dictionary from dataframe
codes # show output
df.rename(columns=codes, inplace=True) # rename columns using dictionary
df # show output
In this example, the metadata
DataFrame is crucial to understanding what each variable or column represents, similar to how the Census Bureau API documentation is critical to make senes of that data.
Additional Resources#
For more on renaming, mapping & labels:
For more on indexing: