{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Text Processing Workflows\n",
        "\n",
        "A deep dive into natural language processing is outside the scope of this course, but we'll introduce a few building blocks here for working with text data.\n",
        "\n",
        "<blockquote>\n",
        "\"Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of \"understanding\" the contents of documents, including the contextual nuances of the language within them\" (<a href=\"https://en.wikipedia.org/wiki/Natural_language_processing\">Wikipedia</a>)</blockquote>"
      ],
      "metadata": {
        "id": "JJPyblhmrsEt"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Acknowledgements\n",
        "\n",
        "The explanations and examples in this section are adopted from the Distributed AI Research Institute's \"[Fundamentals of NLP](https://dair.ai/notebooks/nlp/2020/03/19/nlp_basics_tokenization_segmentation.html)\" resource.\n",
        "- Eric Saravia, \"Chapter 1 - Tokenization, Lemmatization, Stemming, and Sentence Segmentation\" *Fundamentals of NLP* (DAIR, 19 March 2020)"
      ],
      "metadata": {
        "id": "_l2qsG9ZrvkW"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Setup\n",
        "\n",
        "We'll start explore some of these workflows using Python's [spaCy](https://spacy.io) library.\n",
        "\n",
        "\n",
        "First, we need to install and load `spaCy`.\n",
        "- *There will be lots of output- don't panic!*"
      ],
      "metadata": {
        "id": "IG7kDQIZrx-9"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0WAOxpebM52y"
      },
      "outputs": [],
      "source": [
        "#!pip install -q spacy # install library\n",
        "!pip install -U spacy-lookups-data\n",
        "import spacy # import  library\n",
        "!spacy download en_core_web_md # download program components"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "JIV3tjHE5QD5"
      },
      "outputs": [],
      "source": [
        "nlp = spacy.load('en_core_web_md') # load language model\n",
        "from spacy import displacy\n",
        "from spacy.lookups import Lookups"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q gensim\n",
        "import gensim\n",
        "from gensim.corpora import Dictionary\n",
        "from gensim.models import LdaModel"
      ],
      "metadata": {
        "id": "GlxcOfZSbvmv"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Tokenization\n",
        "\n",
        "<p align=\"center\"><img class=\" size-full wp-image-55 aligncenter\" src=\"https://miro.medium.com/v2/resize:fit:1400/1*PZYP2nL6Zc_jpkaHLRxLQQ.png\" alt=\"Capture_2\"  /></p>\n",
        "\n",
        "**Tokenization** involves extracting **tokens** from a piece of text."
      ],
      "metadata": {
        "id": "XX0qGoaRrwXz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "doc = \"Cheer cheer for old Notre Dame\" # string\n",
        "for i, w in enumerate(doc.split(\" \")): # tokenize string\n",
        "    print(\"Token \" + str(i) + \": \" + w) # output tokens"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EpadC18YsbHm",
        "outputId": "19316701-f808-4474-f9f5-93fc635e2ff6"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Token 0: Cheer\n",
            "Token 1: cheer\n",
            "Token 2: for\n",
            "Token 3: old\n",
            "Token 4: Notre\n",
            "Token 5: Dame\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Lemmamitization\n",
        "\n",
        "<p align=\"center\"><img class=\" size-full wp-image-55 aligncenter\" src=\"https://d2mk45aasx86xg.cloudfront.net/Example_to_understand_lemmatization_11zon_000b43c193.webp\" width=500 alt=\"Capture_2\"  /></p>\n",
        "\n",
        "**Lemmatization** reduces **tokens** to their **base form**."
      ],
      "metadata": {
        "id": "oP8dtgtIrw7-"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lTSgpZXZqRuN"
      },
      "outputs": [],
      "source": [
        "doc = nlp(\"Our words are buttressed by our deeds, and our deeds are inspired by our convictions.\") # string, courtesy of Fr. Hesburgh\n",
        "for word in doc: # iterate over string\n",
        "    print(word.text, \"=>\", word.lemma_) # lemmatize tokens"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Stemming\n",
        "\n",
        "<p align=\"center\"><img class=\" size-full wp-image-55 aligncenter\" src=\"https://devopedia.org/images/article/218/8583.1569386710.png\" width=500 alt=\"Capture_2\"  /></p>\n",
        "\n",
        "**Stemming** determines what base form a **token** is derived or inflected from."
      ],
      "metadata": {
        "id": "lOovciZJvKoX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from nltk.stem.snowball import SnowballStemmer # import statement\n",
        "stemmer = SnowballStemmer(language='english')\n",
        "doc = \"Our words are buttressed by our deeds, and our deeds are inspired by our convictions.\" # string, courtesy of Fr. Hesburgh\n",
        "for token in doc.split(\" \"):\n",
        "    print(token, '=>' , stemmer.stem(token))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HpvSEYVrvwAT",
        "outputId": "1902cd0b-6918-4fd7-fac5-7e61cb855511"
      },
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Our => our\n",
            "words => word\n",
            "are => are\n",
            "buttressed => buttress\n",
            "by => by\n",
            "our => our\n",
            "deeds, => deeds,\n",
            "and => and\n",
            "our => our\n",
            "deeds => deed\n",
            "are => are\n",
            "inspired => inspir\n",
            "by => by\n",
            "our => our\n",
            "convictions. => convictions.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Lemmatization Versus Stemming\n",
        "\n",
        "<p align=\"center\"><img class=\" size-full wp-image-55 aligncenter\" src=\"https://www.johnsnowlabs.com/wp-content/uploads/2023/08/img_blog_2-2.jpg\" width=1000 alt=\"Capture_2\"  /></p>\n"
      ],
      "metadata": {
        "id": "IiVhoa5Jv_XU"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Sentence Segmentation\n",
        "\n",
        "**Sentence segmentation** breaks up text using sentence boundaries."
      ],
      "metadata": {
        "id": "WIB-lx7DwO6y"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "doc = nlp(\"I love coding and programming. I also love sleeping!\") # string\n",
        "for sent in doc.sents: # segment sentences\n",
        "    print(sent.text) # show output"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UdR_9OY7woUr",
        "outputId": "cb17060d-3d5a-41b4-ac35-4e2929476aca"
      },
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I love coding and programming.\n",
            "I also love sleeping!\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Additional Resources\n",
        "\n",
        "We've already seen some of these workflows in action:\n",
        "- [Jupyter Notebook](https://colab.research.google.com/drive/10HsDRPknC6EK8WPunl0quRLAVzrLt0Gt?usp=sharing) for Elements I (F23) NLP explorations\n",
        "- [Jupyter Notebook](https://colab.research.google.com/drive/1GwF-ADakMMJK6r9EohitzvNCKgdEeNzM?usp=sharing) from our South Bend State of the City NLP explorations\n",
        "\n",
        "Tutorials that are a good starting point:\n",
        "- [\"Fundamentals of NLP\" resource this section is based on](https://dair.ai/notebooks/nlp/2020/03/19/nlp_basics_tokenization_segmentation.html)\n",
        "- [Kaggle tutorial](https://www.kaggle.com/code/astraz93/beginners-tokenization-stemming-and-lemmatization)"
      ],
      "metadata": {
        "id": "RaWitj9swu8b"
      }
    }
  ]
}