{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Text Processing Workflows\n", "\n", "A deep dive into natural language processing is outside the scope of this course, but we'll introduce a few building blocks here for working with text data.\n", "\n", "
\n", "\"Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of \"understanding\" the contents of documents, including the contextual nuances of the language within them\" (Wikipedia)" ], "metadata": { "id": "JJPyblhmrsEt" } }, { "cell_type": "markdown", "source": [ "## Acknowledgements\n", "\n", "The explanations and examples in this section are adopted from the Distributed AI Research Institute's \"[Fundamentals of NLP](https://dair.ai/notebooks/nlp/2020/03/19/nlp_basics_tokenization_segmentation.html)\" resource.\n", "- Eric Saravia, \"Chapter 1 - Tokenization, Lemmatization, Stemming, and Sentence Segmentation\" *Fundamentals of NLP* (DAIR, 19 March 2020)" ], "metadata": { "id": "_l2qsG9ZrvkW" } }, { "cell_type": "markdown", "source": [ "## Setup\n", "\n", "We'll start explore some of these workflows using Python's [spaCy](https://spacy.io) library.\n", "\n", "\n", "First, we need to install and load `spaCy`.\n", "- *There will be lots of output- don't panic!*" ], "metadata": { "id": "IG7kDQIZrx-9" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0WAOxpebM52y" }, "outputs": [], "source": [ "#!pip install -q spacy # install library\n", "!pip install -U spacy-lookups-data\n", "import spacy # import library\n", "!spacy download en_core_web_md # download program components" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "JIV3tjHE5QD5" }, "outputs": [], "source": [ "nlp = spacy.load('en_core_web_md') # load language model\n", "from spacy import displacy\n", "from spacy.lookups import Lookups" ] }, { "cell_type": "code", "source": [ "!pip install -q gensim\n", "import gensim\n", "from gensim.corpora import Dictionary\n", "from gensim.models import LdaModel" ], "metadata": { "id": "GlxcOfZSbvmv" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Tokenization\n", "\n", "