# Text Processing Workflows

A deep dive into natural language processing is outside the scope of this course, but we'll introduce a few building blocks here for working with text data.

<blockquote>
"Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them" (<a href="https://en.wikipedia.org/wiki/Natural_language_processing">Wikipedia</a>)</blockquote>

## Acknowledgements

The explanations and examples in this section are adopted from the Distributed AI Research Institute's "[Fundamentals of NLP](https://dair.ai/notebooks/nlp/2020/03/19/nlp_basics_tokenization_segmentation.html)" resource.
- Eric Saravia, "Chapter 1 - Tokenization, Lemmatization, Stemming, and Sentence Segmentation" *Fundamentals of NLP* (DAIR, 19 March 2020)

## Setup

We'll start explore some of these workflows using Python's [spaCy](https://spacy.io) library.


First, we need to install and load `spaCy`.
- *There will be lots of output- don't panic!*

In [None]:
#!pip install -q spacy # install library
!pip install -U spacy-lookups-data
import spacy # import  library
!spacy download en_core_web_md # download program components

In [7]:
nlp = spacy.load('en_core_web_md') # load language model
from spacy import displacy
from spacy.lookups import Lookups

In [3]:
!pip install -q gensim
import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel

## Tokenization

<p align="center"><img class=" size-full wp-image-55 aligncenter" src="https://miro.medium.com/v2/resize:fit:1400/1*PZYP2nL6Zc_jpkaHLRxLQQ.png" alt="Capture_2"  /></p>

**Tokenization** involves extracting **tokens** from a piece of text.

In [4]:
doc = "Cheer cheer for old Notre Dame" # string
for i, w in enumerate(doc.split(" ")): # tokenize string
    print("Token " + str(i) + ": " + w) # output tokens

Token 0: Cheer
Token 1: cheer
Token 2: for
Token 3: old
Token 4: Notre
Token 5: Dame


## Lemmamitization

<p align="center"><img class=" size-full wp-image-55 aligncenter" src="https://d2mk45aasx86xg.cloudfront.net/Example_to_understand_lemmatization_11zon_000b43c193.webp" width=500 alt="Capture_2"  /></p>

**Lemmatization** reduces **tokens** to their **base form**.

In [None]:
doc = nlp("Our words are buttressed by our deeds, and our deeds are inspired by our convictions.") # string, courtesy of Fr. Hesburgh
for word in doc: # iterate over string
    print(word.text, "=>", word.lemma_) # lemmatize tokens

## Stemming

<p align="center"><img class=" size-full wp-image-55 aligncenter" src="https://devopedia.org/images/article/218/8583.1569386710.png" width=500 alt="Capture_2"  /></p>

**Stemming** determines what base form a **token** is derived or inflected from.

In [16]:
from nltk.stem.snowball import SnowballStemmer # import statement
stemmer = SnowballStemmer(language='english')
doc = "Our words are buttressed by our deeds, and our deeds are inspired by our convictions." # string, courtesy of Fr. Hesburgh
for token in doc.split(" "):
    print(token, '=>' , stemmer.stem(token))

Our => our
words => word
are => are
buttressed => buttress
by => by
our => our
deeds, => deeds,
and => and
our => our
deeds => deed
are => are
inspired => inspir
by => by
our => our
convictions. => convictions.


### Lemmatization Versus Stemming

<p align="center"><img class=" size-full wp-image-55 aligncenter" src="https://www.johnsnowlabs.com/wp-content/uploads/2023/08/img_blog_2-2.jpg" width=1000 alt="Capture_2"  /></p>


## Sentence Segmentation

**Sentence segmentation** breaks up text using sentence boundaries.

In [17]:
doc = nlp("I love coding and programming. I also love sleeping!") # string
for sent in doc.sents: # segment sentences
    print(sent.text) # show output

I love coding and programming.
I also love sleeping!


# Additional Resources

We've already seen some of these workflows in action:
- [Jupyter Notebook](https://colab.research.google.com/drive/10HsDRPknC6EK8WPunl0quRLAVzrLt0Gt?usp=sharing) for Elements I (F23) NLP explorations
- [Jupyter Notebook](https://colab.research.google.com/drive/1GwF-ADakMMJK6r9EohitzvNCKgdEeNzM?usp=sharing) from our South Bend State of the City NLP explorations

Tutorials that are a good starting point:
- ["Fundamentals of NLP" resource this section is based on](https://dair.ai/notebooks/nlp/2020/03/19/nlp_basics_tokenization_segmentation.html)
- [Kaggle tutorial](https://www.kaggle.com/code/astraz93/beginners-tokenization-stemming-and-lemmatization)