Filtering#
We can use Python’s comparison operators to return rows in our DataFrame
that meet specific conditions.
subset = df[df['MAR'] == '1'] # create new dataframe with specific MAR values
subset # show output
In this example, we had to put 1
in quotation marks, because this column is being treated as a string object.
We would need to change the data type to be able to make numeric comparisons or filters.
df = df.astype(int) # change datatype for all clumns
df.info() # show updated technical summary
subset2 = df[df['AGEP'] > 30] # create new df with AGEP values over specific threshold
subset2 # show output
We use brackets ([]
) to set a condition rows must meet to be assigned to the new dataframe. If we just wanted to see whether rows meet this condition in the original DataFrame
, we could just test for the condition without creating a new DataFrame
.
df['AGEP'] > 30 # return boolean values for conditional test
Advanced Filtering#
.isin()
#
The isin()
conditional function on its own would return a True
or False
value. By nesting the isin()
function in brackets ([]
), we are filtering rows based on rows that meet the function critera, or return as True
from this function.
subset3 = df[df['MAR'].isin([2,3])] # create new df with specific MAR values
subset3 # show output
Boolean Operators#
We could also break out the chained or compound conditional statement using an OR
operator, |
.
subset4 = df[(df['MAR'] == 2) | (df['AGEP'] > 50)] # filter using two conditions and the OR operator
subset4 # show output
SEX | PWGTP | MAR | SCHL | |
---|---|---|---|---|
1 | 2 | 23 | 2 | 24 |
3 | 1 | 80 | 5 | 24 |
5 | 1 | 107 | 3 | 24 |
8 | 2 | 127 | 5 | 24 |
12 | 1 | 70 | 5 | 24 |
... | ... | ... | ... | ... |
44072 | 1 | 67 | 1 | 24 |
44073 | 1 | 127 | 1 | 24 |
44074 | 2 | 127 | 1 | 24 |
44075 | 2 | 56 | 1 | 24 |
44078 | 1 | 102 | 1 | 24 |
34170 rows × 4 columns
For more on Boolean indexing and the isin()
function:
Missing Data#
We could use the .isna()
and .notna()
funtions to handle missing data.
.notna()
is a conditional function that returns True
for rows that do not have a Null
value. .isna()
accomplishes the inverse operation.
For more on missing values and related functions, check out the “Working with missing data” package documentation.
Duplicates#
A useful place to start is identifying and removing any duplicate rows in a dataframe. We can do this using a few key functions. .duplicated()
will return a True
or False
value indicating if a row is a duplicate of a previously occuring row.