From SQLite to Python#

Imagine we want to work with a relational database from within Python.

We could work in SQL using a combination of queries and joins to create a new table with specific fields we want to work with, then export that table as a .csv file and load it into Python. But we also have the option to interact with a SQL-based relational database from within Python, using a couple of key packages.

Establishing a Connection#

The sqlite3 module “provides a SQL interface” from within Python (sqlite3 documentation), letting us establish a connection with a SQL database from within Python and then we can interact with the database using modified SQL commands.

The basic steps for sqlite3:

  • Install and import sqlite3 module

  • Establish a connection with the database

  • Create a cursor object

  • Use the execute() method in combination with the cursor object to execute SQL queries

  • Close the connection using close() when done working with the database

To express this programmatically in Python:

# install sqlite3 module
!pip install sqlite3
# import sqlite3 module
import sqlite3

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# closes the connection
cursor.close()

Modifying SQL Query Syntax#

Now we can use our cursor object to interact with the database using modified SQL syntax. The basic template for sqlite3 syntax:

cursor.execute("SQL QUERY/COMMANDS GO HERE")

ALTERNATE WORKFLOW:

sql_command = """SQL QUERY/COMMANDS GO HERE;"""

cursor.execute(sql_command)

cursor.execute() followed by the SQL command (in quotation marks) lets us use SQL statements on our database from within Python.

  • Alternatively, saving the SQL query as a string (with three double quotation marks and a semi colon) and running cursor.execute() on the string variable.

Check out the sqlite3 documentation to learn more about the module's functionality and syntax.

For example, let’s say we wanted to select the unique player ids from the Player_Birthplaces table in our database. Remember the original SQL syntax for this query:

-- select unique values from table field
SELECT DISTINCT playerID
FROM players;

The modified syntax for this query in Python using sqlite3:

# import sqlite3 module
import sqlite3

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# creates a new variable player ids
player_ids = cursor.execute("SELECT DISTINCT playerID FROM players")

# get the query return
player_id_results = cursor.fetchall()

# print the list of unique player ids now contained in the player_id_results variable
print(player_id_results)

ALTERNATE WORKFLOW:

# import sqlite3 module
import sqlite3

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# save query as string
sql_command = """SELECT playerID FROM birthplaces;"""

# creates a new variable player ids
player_ids = cursor.execute(sql_command)

# get the query return
player_id_results = cursor.fetchall()

# print the list of unique player ids now contained in the player_id_results variable
print(player_id_results)

# closes the connection
cursor.close()

The cursor.execute() function runs the SELECT DISTINCT statement. The new player_ids variable contains the playerID column from the players table as a list in Python.

Another example using the WHERE statement for filtering. The original SQL query:

-- select all values from table where specific condition is met
SELECT *
FROM locations
WHERE country='DO';

This query returns all columns from the locations table where data in the country field is equal to DO. The modified syntax for this query in Python using sqlite3:

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# creates a new variable for query results
do = cursor.execute("SELECT * FROM locations WHERE country='DO'")

# get the query return
results = cursor.fetchall()

# print the new do_players variable
print(results)

# closes the connection
cursor.close()

ALTERNATIVE WORKFLOW

# import sqlite3 module
import sqlite3

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# save query as string
sql_command = """SELECT * FROM locations WHERE country='DO';"""

# creates a new variable player ids
do = cursor.execute(sql_command)

# get the query return
results = cursor.fetchall()

# print the list of unique player ids now contained in the player_id_results variable
print(results)

# closes the connection
cursor.close()

Let’s say we wanted to just see a query result and not load it to a new variable. A modified version of the previous example:

# creates a new variable for query results
cursor.execute("SELECT * FROM locations WHERE country='DO'")

# set a row count variable
count = 0

# for loop that prints row contents and reassigns count value
for row in cursor:
  print(row)
  count = count + 1

# print statement that takes final count value (i.e. the number of rows) and prints that total
print(count, 'rows')

In this example, the for loop iterates over the rows in the database and uses the count variable to track how many rows in the database meet the WHERE condition.

Creating a Pandas DataFrame#

The previous section of the lab included sample code for a query that returns all columns from the locations table where data in the country field is equal to DO.

In this example, the query results are being stored to the results variable as a list with sublists or nested lists. We can use pd.DataFrame to create a Pandas DataFrame from that list with sublists.

# import pandas
import pandas as pd

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# creates a new variable for query results
do = cursor.execute("SELECT * FROM locations WHERE country='DO'")

# get the query return
results = cursor.fetchall()

# closes the connection
cursor.close()

# create dataframe
df = pd.DataFrame(results)

# show df
df

ALTERNATIVE WORKFLOW:

# import sqlite3 module
import sqlite3

# establish connection to database
connection = sqlite3.connect("data.db")

# creates the cursor object
cursor = connection.cursor()

# save query as string
sql_command = """SELECT * FROM locations WHERE country='DO';"""

# creates a new variable player ids
do = cursor.execute(sql_command)

# get the query return
results = cursor.fetchall()

# closes the connection
cursor.close()

# create dataframe
df = pd.DataFrame(results)

# show df
df

From there, a single line of code will save the DataFrame as a .csv file.

# save to csv
df.to_csv("output.csv", index=False)
# runs query
results = cursor.execute(sql_command)

# store column names as list
cols = [column[0] for column in results.description]

# create dataframe with query results and column names
df = pd.DataFrame.from_records(data= results.fetchall(), columns=cols)

# close connection
cursor.close()

Additional Considerations#

Why would we want to work with a relational database from within Python? The short answer is program performance and memory load. Storing your data in an SQLite database and loading query results into Python requires significantly less memory than storing all the database data in Python. Less stored data = lower memory needs = improved program performance.

Interacting with a database from Python can also be a workflow consideration. Imagine your company has data stored in a relational database system (Oracle, AWS, Microsft Access, etc.) and there are specific aggregations or calculations that need to be performed on that data regularly (say, for quarterly or annual reports). Provided the underlying data structure remains largely consistent (this is true for most proprietary/commercial database systems), you could write a program in Python to automatically generate those aggregations and calculations.

Application#

Q6: Take at least 3 of the queries you wrote in previous sections and modify them to run within a Python environment. Include code + comments.

  • For this question, your program needs to:

    • Establish a connection to the database

      • sqlite3.connect()

    • Create the cursor object

      • connection.cursor()

    • Include modified query syntax

      • cursor.execute()

    • Get query return and store to variable

      • cursor.fetchall()

    • Close connection

      • cursor.close()

Q7: For at least one of the Q6 queries, create a Pandas DataFrame from the query results and write to a .csv file. Include code + comments.

  • For this question, your program needs to:

    • Create the dataframe

      • pd.DataFrame()

    • Save as CSV

      • pd.to_csv()