Working With Documents#
If you took Elements of Computing I with Prof. Walden, you had some exposure to how we can start to work with unstructured text, looking at things like term relationships, frequency, etc. Web scraping is a great workflow for information that is online in web pages. But what about text information contained in documents?
Enter Optical Character Recognition, or OCR.
“Optical character recognition (OCR) is sometimes referred to as text recognition. An OCR program extracts and repurposes data from scanned documents, camera images and image-only pdfs. OCR software singles out letters on the image, puts them into words and then puts the words into sentences, thus enabling access to and editing of the original content…OCR systems use a combination of hardware and software to convert physical, printed documents into machine-readable text” (IBM Cloud Education, “What is Optical Character Recognition?”, 5 January 2022)
OCR is an example of computer vision in action. A full exploration of OCR or computer vision is outside the scope of this course. We’re going to focus on some Python workflows for extracting text and data from PDF documents that already contain text.
We’re going to be using a couple Python libraries built for extracting text and tables from PDF files.
And we’ll be testing out these workflows on a PDF file from the City of South Bend’s Document Search Center. I’m looking at the Inclusive Procurement & Contracting Board’s 2020 report.
Note: I renamed this document
pdf20.pdf
for the next section of the lab.
This file on GitHub:
Python Workflows#
We’ll start by installing and importing these packages.
!pip install pypdf tabula-py # install packages
# import statements
from pypdf import PdfReader
import tabula
Next, we’ll use PdfReader
to extract text.
reader = PdfReader("pdf20.pdf") # load renamed file as reader object
page = reader.pages[2] # isolate single page
print(page.extract_text()) # show extracted text output
To iterate over all pages in the document and write the contents to a .txt
file:
with open("output.txt", "a") as f: # create file
for p in reader.pages: # iterate over pages
f.write(p.extract_text()) # extract text and write to file
Folks may be wondering how we might handle tables or tabular data in a PDF file. We can use Tabula’s .read_pdf()
function to extract all tables in our document as Pandas DataFrames
.
dfs = tabula.read_pdf("pdf20.pdf", pages="all") # load file, isolate dataframes
dfs[5].to_csv("output.csv", index=False) # write single dataframe to CSV file
dfs[5] # show single dataframe
There’s more we probably want to do with output, but this gets us started.
Application#
Q5A: Apply this workflow to another document from the City of South Bend Document Search Center.
Note- you’ll need to find a document that already includes text.
Q5B: Briefly skim the document. What kinds of information does it contain?
Q5C: Write a Python program that extracts text from the document and writes the output to a .txt
file, using the workflows covered in this lab.
Answer to this question includes program + comments that document process and explain your code.
Q5D: Write a Python program that extracts tables from the document, and write one of these tables to a .csv
file, using the Pandas workflow covered in this lab.
Answer to this question includes program + comments that document process and explain your code.
Q5D: What challenges did you encounter? How did you address or solve them?