Filtering

Filtering#

We can use Python’s comparison operators to return rows in our DataFrame that meet specific conditions.

subset = df[df['MAR'] == '1'] # create new dataframe with specific MAR values
subset # show output

In this example, we had to put 1 in quotation marks, because this column is being treated as a string object.

We would need to change the data type to be able to make numeric comparisons or filters.

df = df.astype(int) # change datatype for all clumns
df.info() # show updated technical summary

subset2 = df[df['AGEP'] > 30] # create new df with AGEP values over specific threshold
subset2 # show output

We use brackets ([]) to set a condition rows must meet to be assigned to the new dataframe. If we just wanted to see whether rows meet this condition in the original DataFrame, we could just test for the condition without creating a new DataFrame.

df['AGEP'] > 30 # return boolean values for conditional test

Advanced Filtering#

`.isin()`#

The isin() conditional function on its own would return a True or False value. By nesting the isin() function in brackets ([]), we are filtering rows based on rows that meet the function critera, or return as True from this function.

subset3 = df[df['MAR'].isin([2,3])] # create new df with specific MAR values
subset3 # show output

Boolean Operators#

We could also break out the chained or compound conditional statement using an OR operator, |.

subset4 = df[(df['MAR'] == 2) | (df['AGEP'] > 50)] # filter using two conditions and the OR operator
subset4 # show output

	SEX	PWGTP	MAR	SCHL
1	2	23	2	24
3	1	80	5	24
5	1	107	3	24
8	2	127	5	24
12	1	70	5	24
...	...	...	...	...
44072	1	67	1	24
44073	1	127	1	24
44074	2	127	1	24
44075	2	56	1	24
44078	1	102	1	24

34170 rows × 4 columns

For more on Boolean indexing and the isin() function:

Missing Data#

We could use the .isna() and .notna() funtions to handle missing data.

.notna() is a conditional function that returns True for rows that do not have a Null value. .isna() accomplishes the inverse operation.

For more on missing values and related functions, check out the “Working with missing data” package documentation.

Duplicates#

A useful place to start is identifying and removing any duplicate rows in a dataframe. We can do this using a few key functions. .duplicated() will return a True or False value indicating if a row is a duplicate of a previously occuring row.