From Table to Database

From Table to Database#

Let’s imagine we want to know how many professional baseball players that were born in Puerto Rico played for teams located in the state of Indiana during the 2016 season.

Our first step is to break down the different components of this research question, with an eye toward what data we would need to answer this question.

  • We would need to have a list of people who played professional baseball during the 2016 season.

  • We would need to know country birthplace information for everyone on that list.

  • We would need to know what professional teams were located in Indiana during the 2016 season.

  • We would need to know which players were on specific teams

Our next step is to possible sources or resources that might provide this data. Baseball Reference, “the complete source for current and historical baseball players, teams, scores and leaders,” seems like a good place to start.

We could start by looking up the records for an individual player to see what information is available. Let’s take a look at the individual player page on Baseball Reference for former Notre Dame baseball and football player Jeff Samardzija.

We can get some pieces of information from this page, like player birthplace and teams played for. But not all of that information is located in a table structure. And the table structure that is present doesn’t lend itself to answering our specific research question.

We could also look at the individual team page for a team located in Indiana. Let’s take a look at the 2019 season team page for the South Bend Cubs.

We can get some pieces of information from this page, like the team location and roster. But again, not all of that information is located in a table structure. And the table structure that is present doesn’t lend itself to answering our specific research questions.

Since there are only 3 affiliated professional teams in Indiana, manually downloading the spreadsheets for these teams for a single season and wrangling the data into the structure needed to answer our specific research question wouldn’t be overly time intensive. But let’s say we might have other research questions down the line, or we might want to make our data available to other researchers.

Only gathering data that responds to our specific research question might save time in the short term but foreclose possibilities long-term. Imagine the types of research questions we could ask and answer with data that covered multiple seasons, for teams in multiple locations.

We could do some automated scraping and manual wrangling to end up with data structures that answer our specific research question but also would work to answer a variety of other research questions.

Let’s imagine…

A players table that includes information about where players were born.

personIDDoBlocationID
11985SanPedrodeMacoris_SanPedrodeMacoris_DO
501969Santurce_PR
631989Bani_Peravia_DO

A teams table that includes information about affiliations.

teamIDseasonaffiliateLeagueaffiliateleagueteamNameclasslocationID
1960_AmericanAssociation_Indianapolis1960NationalPHIAmerican AssociationIndianapolisAAAIndianapolis_IN_US
1960_Midwest_Kokomo1960NationalLADMidwestKokomoDKokomo_IN_US
1961_AmericanAssociation_Indianapolis1961NationalCINAmerican AssociationIndianapolisAAAIndianapolis_IN_US

A locations table that includes additional information about locations.

locationIDcitystatecountryregiongrouplatlon
Abbeville_LA_USAbbevilleLAUSNorthern AmericaNorthern America31.571835-85.250489
Abbeville_SC_USAbbevilleSCUSNorthern AmericaNorthern America34.178172-82.379015
Abbotsford_BC_CAAbbotsfordBCCANorthern AmericaNorthern America49.05-122.3
Aberdeen_MD_USAberdeenMDUSNorthern AmericaNorthern America39.509556-76.16412

A transactions table that includes information about which players played on specific teams in a given season.

personIDteamID
251999_International_Indianapolis
1252013_International_Indianapolis
1252014_International_Indianapolis

Let’s take a look at at Excel workbook that includes these tables. Open the Google Sheets project in a spreadsheet program (Google Sheets, Microsoft Excel, Apple Numbers).

You may need to consult the following resources as needed to understand this data:

So now that we have some structured data, let’s return to our original research question.

  • The players table includes information about where players were born.

  • The teams table includes information about teams.

  • The locations table includes additional location information.

  • The transactions table tells us which players played for which teams.

But the transactions table on its own doesn’t include the information about player birthplace and team location we need to answer the question of how many baseball players born in Puerto Rico played for teams located in the state of Indiana. We could add that location information to the transactions table, but we would end up with significant redundant and duplicate information.

Behold the magic of relational databases!