From Table to Database#
Let’s imagine we want to know how many professional baseball players that were born in Puerto Rico played for teams located in the state of Indiana during the 2016 season.
Our first step is to break down the different components of this research question, with an eye toward what data we would need to answer this question.
We would need to have a list of people who played professional baseball during the 2016 season.
We would need to know country birthplace information for everyone on that list.
We would need to know what professional teams were located in Indiana during the 2016 season.
We would need to know which players were on specific teams
Our next step is to possible sources or resources that might provide this data. Baseball Reference, “the complete source for current and historical baseball players, teams, scores and leaders,” seems like a good place to start.
We could start by looking up the records for an individual player to see what information is available. Let’s take a look at the individual player page on Baseball Reference for former Notre Dame baseball and football player Jeff Samardzija.
We can get some pieces of information from this page, like player birthplace and teams played for. But not all of that information is located in a table structure. And the table structure that is present doesn’t lend itself to answering our specific research question.
We could also look at the individual team page for a team located in Indiana. Let’s take a look at the 2019 season team page for the South Bend Cubs.
We can get some pieces of information from this page, like the team location and roster. But again, not all of that information is located in a table structure. And the table structure that is present doesn’t lend itself to answering our specific research questions.
Since there are only 3 affiliated professional teams in Indiana, manually downloading the spreadsheets for these teams for a single season and wrangling the data into the structure needed to answer our specific research question wouldn’t be overly time intensive. But let’s say we might have other research questions down the line, or we might want to make our data available to other researchers.
Only gathering data that responds to our specific research question might save time in the short term but foreclose possibilities long-term. Imagine the types of research questions we could ask and answer with data that covered multiple seasons, for teams in multiple locations.
We could do some automated scraping and manual wrangling to end up with data structures that answer our specific research question but also would work to answer a variety of other research questions.
Let’s imagine…
A players
table that includes information about where players were born.
personID | DoB | locationID |
---|---|---|
1 | 1985 | SanPedrodeMacoris_SanPedrodeMacoris_DO |
50 | 1969 | Santurce_PR |
63 | 1989 | Bani_Peravia_DO |
A teams
table that includes information about affiliations.
teamID | season | affiliateLeague | affiliate | league | teamName | class | locationID |
---|---|---|---|---|---|---|---|
1960_AmericanAssociation_Indianapolis | 1960 | National | PHI | American Association | Indianapolis | AAA | Indianapolis_IN_US |
1960_Midwest_Kokomo | 1960 | National | LAD | Midwest | Kokomo | D | Kokomo_IN_US |
1961_AmericanAssociation_Indianapolis | 1961 | National | CIN | American Association | Indianapolis | AAA | Indianapolis_IN_US |
A locations
table that includes additional information about locations.
locationID | city | state | country | region | group | lat | lon |
---|---|---|---|---|---|---|---|
Abbeville_LA_US | Abbeville | LA | US | Northern America | Northern America | 31.571835 | -85.250489 |
Abbeville_SC_US | Abbeville | SC | US | Northern America | Northern America | 34.178172 | -82.379015 |
Abbotsford_BC_CA | Abbotsford | BC | CA | Northern America | Northern America | 49.05 | -122.3 |
Aberdeen_MD_US | Aberdeen | MD | US | Northern America | Northern America | 39.509556 | -76.16412 |
A transactions
table that includes information about which players played on specific teams in a given season.
personID | teamID |
---|---|
25 | 1999_International_Indianapolis |
125 | 2013_International_Indianapolis |
125 | 2014_International_Indianapolis |
Let’s take a look at at Excel workbook that includes these tables. Open the Google Sheets project in a spreadsheet program (Google Sheets, Microsoft Excel, Apple Numbers).
You may need to consult the following resources as needed to understand this data:
So now that we have some structured data, let’s return to our original research question.
The
players
table includes information about where players were born.The
teams
table includes information about teams.The
locations
table includes additional location information.The
transactions
table tells us which players played for which teams.
But the transactions
table on its own doesn’t include the information about player birthplace and team location we need to answer the question of how many baseball players born in Puerto Rico played for teams located in the state of Indiana. We could add that location information to the transactions
table, but we would end up with significant redundant and duplicate information.
Behold the magic of relational databases!