{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"toc_visible":true,"authorship_tag":"ABX9TyOTs3nAt5cC2sUcX14ItFdE"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# Filtering\n","\n","We can use Python's comparison operators to return rows in our `DataFrame` that meet specific conditions."],"metadata":{"id":"CrtOEmT1Cqpe"}},{"cell_type":"code","source":["subset = df[df['MAR'] == '1'] # create new dataframe with specific MAR values\n","subset # show output"],"metadata":{"id":"cxhD-n8eDDL2"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["In this example, we had to put `1` in quotation marks, because this column is being treated as a string object.\n","\n","We would need to change the data type to be able to make numeric comparisons or filters."],"metadata":{"id":"OuvaCX2DDfGP"}},{"cell_type":"code","source":["df = df.astype(int) # change datatype for all clumns\n","df.info() # show updated technical summary"],"metadata":{"id":"jadPtagPDt9u"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["subset2 = df[df['AGEP'] > 30] # create new df with AGEP values over specific threshold\n","subset2 # show output"],"metadata":{"id":"L4qQBA9fDVlo"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We use brackets (`[]`) to set a condition rows must meet to be assigned to the new dataframe. If we just wanted to see whether rows meet this condition in the original `DataFrame`, we could just test for the condition without creating a new `DataFrame`."],"metadata":{"id":"LBtJNGi1D_AP"}},{"cell_type":"code","source":["df['AGEP'] > 30 # return boolean values for conditional test"],"metadata":{"id":"wslIhBYeEAKw"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Advanced Filtering"],"metadata":{"id":"yFULDdyaFBbT"}},{"cell_type":"markdown","source":["### `.isin()`\n","\n","The `isin()` conditional function on its own would return a `True` or `False` value. By nesting the `isin()` function in brackets (`[]`), we are filtering rows based on rows that meet the function critera, or return as `True` from this function."],"metadata":{"id":"mV5YNcuvEKpO"}},{"cell_type":"code","source":["subset3 = df[df['MAR'].isin([2,3])] # create new df with specific MAR values\n","subset3 # show output"],"metadata":{"id":"yquy1BbQETaV"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### Boolean Operators\n","\n","We could also break out the chained or compound conditional statement using an `OR` operator, `|`."],"metadata":{"id":"J_jMoMK0Egr-"}},{"cell_type":"code","source":["subset4 = df[(df['MAR'] == 2) | (df['AGEP'] > 50)] # filter using two conditions and the OR operator\n","subset4 # show output"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":424},"id":"1A7a_zDbEliA","executionInfo":{"status":"ok","timestamp":1705958923149,"user_tz":300,"elapsed":124,"user":{"displayName":"Katherine Walden","userId":"17094108395123900917"}},"outputId":"9f64d9b9-c446-4b2d-899f-6dca1a299350"},"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" SEX PWGTP MAR SCHL\n","1 2 23 2 24\n","3 1 80 5 24\n","5 1 107 3 24\n","8 2 127 5 24\n","12 1 70 5 24\n","... ... ... ... ...\n","44072 1 67 1 24\n","44073 1 127 1 24\n","44074 2 127 1 24\n","44075 2 56 1 24\n","44078 1 102 1 24\n","\n","[34170 rows x 4 columns]"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
SEXPWGTPMARSCHL
1223224
3180524
51107324
82127524
12170524
...............
44072167124
440731127124
440742127124
44075256124
440781102124
\n","

34170 rows × 4 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","source":["For more on Boolean indexing and the `isin()` function:\n","- [\"Boolean indexing,\" Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-boolean)\n","- [\"Indexing with isin,\" Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-basics-indexing-isin)"],"metadata":{"id":"3dGFfoP0EkCx"}},{"cell_type":"markdown","source":["### Missing Data\n","\n","We could use the `.isna()` and `.notna()` funtions to handle missing data.\n","\n","`.notna()` is a conditional function that returns `True` for rows that do not have a `Null` value. `.isna()` accomplishes the inverse operation.\n","\n","For more on missing values and related functions, check out [the \"Working with missing data\" package documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data).\n"],"metadata":{"id":"sRhBx0EWFOpc"}},{"cell_type":"markdown","source":["### Duplicates\n","\n","A useful place to start is identifying and removing any duplicate rows in a dataframe. We can do this using a few key functions. `.duplicated()` will return a `True` or `False` value indicating if a row is a duplicate of a previously occuring row.\n","\n","- [.duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)\n","- [.drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)\n"],"metadata":{"id":"7sRTvJePF18R"}}]}