{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"toc_visible":true,"authorship_tag":"ABX9TyOTs3nAt5cC2sUcX14ItFdE"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# Filtering\n","\n","We can use Python's comparison operators to return rows in our `DataFrame` that meet specific conditions."],"metadata":{"id":"CrtOEmT1Cqpe"}},{"cell_type":"code","source":["subset = df[df['MAR'] == '1'] # create new dataframe with specific MAR values\n","subset # show output"],"metadata":{"id":"cxhD-n8eDDL2"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["In this example, we had to put `1` in quotation marks, because this column is being treated as a string object.\n","\n","We would need to change the data type to be able to make numeric comparisons or filters."],"metadata":{"id":"OuvaCX2DDfGP"}},{"cell_type":"code","source":["df = df.astype(int) # change datatype for all clumns\n","df.info() # show updated technical summary"],"metadata":{"id":"jadPtagPDt9u"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["subset2 = df[df['AGEP'] > 30] # create new df with AGEP values over specific threshold\n","subset2 # show output"],"metadata":{"id":"L4qQBA9fDVlo"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We use brackets (`[]`) to set a condition rows must meet to be assigned to the new dataframe. If we just wanted to see whether rows meet this condition in the original `DataFrame`, we could just test for the condition without creating a new `DataFrame`."],"metadata":{"id":"LBtJNGi1D_AP"}},{"cell_type":"code","source":["df['AGEP'] > 30 # return boolean values for conditional test"],"metadata":{"id":"wslIhBYeEAKw"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Advanced Filtering"],"metadata":{"id":"yFULDdyaFBbT"}},{"cell_type":"markdown","source":["### `.isin()`\n","\n","The `isin()` conditional function on its own would return a `True` or `False` value. By nesting the `isin()` function in brackets (`[]`), we are filtering rows based on rows  that meet the function critera, or return as `True` from this function."],"metadata":{"id":"mV5YNcuvEKpO"}},{"cell_type":"code","source":["subset3 = df[df['MAR'].isin([2,3])] # create new df with specific MAR values\n","subset3 # show output"],"metadata":{"id":"yquy1BbQETaV"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### Boolean Operators\n","\n","We could also break out the chained or compound conditional statement using an `OR` operator, `|`."],"metadata":{"id":"J_jMoMK0Egr-"}},{"cell_type":"code","source":["subset4 = df[(df['MAR'] == 2) | (df['AGEP'] > 50)] # filter using two conditions and the OR operator\n","subset4 # show output"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":424},"id":"1A7a_zDbEliA","executionInfo":{"status":"ok","timestamp":1705958923149,"user_tz":300,"elapsed":124,"user":{"displayName":"Katherine Walden","userId":"17094108395123900917"}},"outputId":"9f64d9b9-c446-4b2d-899f-6dca1a299350"},"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":["       SEX  PWGTP  MAR  SCHL\n","1        2     23    2    24\n","3        1     80    5    24\n","5        1    107    3    24\n","8        2    127    5    24\n","12       1     70    5    24\n","...    ...    ...  ...   ...\n","44072    1     67    1    24\n","44073    1    127    1    24\n","44074    2    127    1    24\n","44075    2     56    1    24\n","44078    1    102    1    24\n","\n","[34170 rows x 4 columns]"],"text/html":["\n","  <div id=\"df-61b577c6-499f-4a3e-9dc0-246207433dd7\" class=\"colab-df-container\">\n","    <div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>SEX</th>\n","      <th>PWGTP</th>\n","      <th>MAR</th>\n","      <th>SCHL</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>1</th>\n","      <td>2</td>\n","      <td>23</td>\n","      <td>2</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>1</td>\n","      <td>80</td>\n","      <td>5</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>1</td>\n","      <td>107</td>\n","      <td>3</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>8</th>\n","      <td>2</td>\n","      <td>127</td>\n","      <td>5</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>12</th>\n","      <td>1</td>\n","      <td>70</td>\n","      <td>5</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>...</th>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","    </tr>\n","    <tr>\n","      <th>44072</th>\n","      <td>1</td>\n","      <td>67</td>\n","      <td>1</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>44073</th>\n","      <td>1</td>\n","      <td>127</td>\n","      <td>1</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>44074</th>\n","      <td>2</td>\n","      <td>127</td>\n","      <td>1</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>44075</th>\n","      <td>2</td>\n","      <td>56</td>\n","      <td>1</td>\n","      <td>24</td>\n","    </tr>\n","    <tr>\n","      <th>44078</th>\n","      <td>1</td>\n","      <td>102</td>\n","      <td>1</td>\n","      <td>24</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>34170 rows × 4 columns</p>\n","</div>\n","    <div class=\"colab-df-buttons\">\n","\n","  <div class=\"colab-df-container\">\n","    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-61b577c6-499f-4a3e-9dc0-246207433dd7')\"\n","            title=\"Convert this dataframe to an interactive table.\"\n","            style=\"display:none;\">\n","\n","  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n","    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n","  </svg>\n","    </button>\n","\n","  <style>\n","    .colab-df-container {\n","      display:flex;\n","      gap: 12px;\n","    }\n","\n","    .colab-df-convert {\n","      background-color: #E8F0FE;\n","      border: none;\n","      border-radius: 50%;\n","      cursor: pointer;\n","      display: none;\n","      fill: #1967D2;\n","      height: 32px;\n","      padding: 0 0 0 0;\n","      width: 32px;\n","    }\n","\n","    .colab-df-convert:hover {\n","      background-color: #E2EBFA;\n","      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n","      fill: #174EA6;\n","    }\n","\n","    .colab-df-buttons div {\n","      margin-bottom: 4px;\n","    }\n","\n","    [theme=dark] .colab-df-convert {\n","      background-color: #3B4455;\n","      fill: #D2E3FC;\n","    }\n","\n","    [theme=dark] .colab-df-convert:hover {\n","      background-color: #434B5C;\n","      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n","      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n","      fill: #FFFFFF;\n","    }\n","  </style>\n","\n","    <script>\n","      const buttonEl =\n","        document.querySelector('#df-61b577c6-499f-4a3e-9dc0-246207433dd7 button.colab-df-convert');\n","      buttonEl.style.display =\n","        google.colab.kernel.accessAllowed ? 'block' : 'none';\n","\n","      async function convertToInteractive(key) {\n","        const element = document.querySelector('#df-61b577c6-499f-4a3e-9dc0-246207433dd7');\n","        const dataTable =\n","          await google.colab.kernel.invokeFunction('convertToInteractive',\n","                                                    [key], {});\n","        if (!dataTable) return;\n","\n","        const docLinkHtml = 'Like what you see? Visit the ' +\n","          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n","          + ' to learn more about interactive tables.';\n","        element.innerHTML = '';\n","        dataTable['output_type'] = 'display_data';\n","        await google.colab.output.renderOutput(dataTable, element);\n","        const docLink = document.createElement('div');\n","        docLink.innerHTML = docLinkHtml;\n","        element.appendChild(docLink);\n","      }\n","    </script>\n","  </div>\n","\n","\n","<div id=\"df-25ec1128-1a4f-43e0-8b1a-84cbda98e053\">\n","  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-25ec1128-1a4f-43e0-8b1a-84cbda98e053')\"\n","            title=\"Suggest charts\"\n","            style=\"display:none;\">\n","\n","<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n","     width=\"24px\">\n","    <g>\n","        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n","    </g>\n","</svg>\n","  </button>\n","\n","<style>\n","  .colab-df-quickchart {\n","      --bg-color: #E8F0FE;\n","      --fill-color: #1967D2;\n","      --hover-bg-color: #E2EBFA;\n","      --hover-fill-color: #174EA6;\n","      --disabled-fill-color: #AAA;\n","      --disabled-bg-color: #DDD;\n","  }\n","\n","  [theme=dark] .colab-df-quickchart {\n","      --bg-color: #3B4455;\n","      --fill-color: #D2E3FC;\n","      --hover-bg-color: #434B5C;\n","      --hover-fill-color: #FFFFFF;\n","      --disabled-bg-color: #3B4455;\n","      --disabled-fill-color: #666;\n","  }\n","\n","  .colab-df-quickchart {\n","    background-color: var(--bg-color);\n","    border: none;\n","    border-radius: 50%;\n","    cursor: pointer;\n","    display: none;\n","    fill: var(--fill-color);\n","    height: 32px;\n","    padding: 0;\n","    width: 32px;\n","  }\n","\n","  .colab-df-quickchart:hover {\n","    background-color: var(--hover-bg-color);\n","    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n","    fill: var(--button-hover-fill-color);\n","  }\n","\n","  .colab-df-quickchart-complete:disabled,\n","  .colab-df-quickchart-complete:disabled:hover {\n","    background-color: var(--disabled-bg-color);\n","    fill: var(--disabled-fill-color);\n","    box-shadow: none;\n","  }\n","\n","  .colab-df-spinner {\n","    border: 2px solid var(--fill-color);\n","    border-color: transparent;\n","    border-bottom-color: var(--fill-color);\n","    animation:\n","      spin 1s steps(1) infinite;\n","  }\n","\n","  @keyframes spin {\n","    0% {\n","      border-color: transparent;\n","      border-bottom-color: var(--fill-color);\n","      border-left-color: var(--fill-color);\n","    }\n","    20% {\n","      border-color: transparent;\n","      border-left-color: var(--fill-color);\n","      border-top-color: var(--fill-color);\n","    }\n","    30% {\n","      border-color: transparent;\n","      border-left-color: var(--fill-color);\n","      border-top-color: var(--fill-color);\n","      border-right-color: var(--fill-color);\n","    }\n","    40% {\n","      border-color: transparent;\n","      border-right-color: var(--fill-color);\n","      border-top-color: var(--fill-color);\n","    }\n","    60% {\n","      border-color: transparent;\n","      border-right-color: var(--fill-color);\n","    }\n","    80% {\n","      border-color: transparent;\n","      border-right-color: var(--fill-color);\n","      border-bottom-color: var(--fill-color);\n","    }\n","    90% {\n","      border-color: transparent;\n","      border-bottom-color: var(--fill-color);\n","    }\n","  }\n","</style>\n","\n","  <script>\n","    async function quickchart(key) {\n","      const quickchartButtonEl =\n","        document.querySelector('#' + key + ' button');\n","      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n","      quickchartButtonEl.classList.add('colab-df-spinner');\n","      try {\n","        const charts = await google.colab.kernel.invokeFunction(\n","            'suggestCharts', [key], {});\n","      } catch (error) {\n","        console.error('Error during call to suggestCharts:', error);\n","      }\n","      quickchartButtonEl.classList.remove('colab-df-spinner');\n","      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n","    }\n","    (() => {\n","      let quickchartButtonEl =\n","        document.querySelector('#df-25ec1128-1a4f-43e0-8b1a-84cbda98e053 button');\n","      quickchartButtonEl.style.display =\n","        google.colab.kernel.accessAllowed ? 'block' : 'none';\n","    })();\n","  </script>\n","</div>\n","    </div>\n","  </div>\n"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","source":["For more on Boolean indexing and the `isin()` function:\n","- [\"Boolean indexing,\" Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-boolean)\n","- [\"Indexing with isin,\" Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-basics-indexing-isin)"],"metadata":{"id":"3dGFfoP0EkCx"}},{"cell_type":"markdown","source":["### Missing Data\n","\n","We could use the `.isna()` and `.notna()` funtions to handle missing data.\n","\n","`.notna()` is a conditional function that returns `True` for rows that do not have a `Null` value. `.isna()` accomplishes the inverse operation.\n","\n","For more on missing values and related functions, check out [the \"Working with missing data\" package documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data).\n"],"metadata":{"id":"sRhBx0EWFOpc"}},{"cell_type":"markdown","source":["### Duplicates\n","\n","A useful place to start is identifying and removing any duplicate rows in a dataframe. We can do this using a few key functions. `.duplicated()` will return a `True` or `False` value indicating if a row is a duplicate of a previously occuring row.\n","\n","- [.duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)\n","- [.drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)\n"],"metadata":{"id":"7sRTvJePF18R"}}]}