.merge()
#
.merge()
connects rows in DataFrames based on one or more key fields, similar to SQL JOIN operations.
# general syntax
merged = pd.DataFrame(left, right, on='key', how='right')
Parameters:
left
: leftDataFrame
right
: rightDataFrame
on=
: common field(s)how=
: type of merge
In a scenario where the columns don’t have common names…
# syntax when labels don't match
merged = pd.DataFrame(left, right, left_on='key', right_on='key')
In a scenario where we want to join on multiple fields…
# syntax for joining on multiple fields
merged = pd.DataFrame(left, right, on=['key1', 'key2'])
As with SQL, we can specify the type of join or merge.
Merge Method |
SQL Join Name |
Description |
---|---|---|
|
|
Use keys from left frame only |
|
|
Use keys from right frame only |
|
|
Use union of keys from both frames |
|
|
Use intersection of keys from both frames |
|
|
Create the cartesian product of rows of both frames |
Let’s work with a few air quality datasets.
air_quality_no2
provies NO2 values for three measurement stations.air_quality_pm25
provides PM25 values (particulate matter less than 2.5 micrometers) for the same three measurement stations.air_quality_stations
provides latitude and longitude coordinates for five different measurement stations.air_quality_parameters
provides parameter full description and name for five different element types.
import pandas as pd # import statement
# load data
no2 = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_no2_long.csv", parse_dates=True)
pm25 = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_pm25_long.csv", parse_dates=True)
stations = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_stations.csv")
param = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_parameters.csv")
Example #1#
Let’s say we wanted to add station location coordinates to corresponding rows in the measurements table. We have this data loaded as stations_coord
.
We can merge the stations
and no2
data frames based on a common field, location
.
merged = pd.merge(no2, stations, on='location', how='left') # left merged based on location
merged # show output
Example #2#
Let’s say we want to add the columns from the parameters
table to the pm25
table.
There are no commonly-labeled fields, so we’ll need to use left_on
and right_on
arguments.
param.head()
id | description | name | |
---|---|---|---|
0 | bc | Black Carbon | BC |
1 | co | Carbon Monoxide | CO |
2 | no2 | Nitrogen Dioxide | NO2 |
3 | o3 | Ozone | O3 |
4 | pm10 | Particulate matter less than 10 micrometers in... | PM10 |
pm25.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Antwerpen | BE | 2019-06-18 06:00:00+00:00 | BETR801 | pm25 | 18.0 | µg/m³ |
1 | Antwerpen | BE | 2019-06-17 08:00:00+00:00 | BETR801 | pm25 | 6.5 | µg/m³ |
2 | Antwerpen | BE | 2019-06-17 07:00:00+00:00 | BETR801 | pm25 | 18.5 | µg/m³ |
3 | Antwerpen | BE | 2019-06-17 06:00:00+00:00 | BETR801 | pm25 | 16.0 | µg/m³ |
4 | Antwerpen | BE | 2019-06-17 05:00:00+00:00 | BETR801 | pm25 | 7.5 | µg/m³ |
merged = pd.merge(param, pm25, left_on='id', right_on='parameter', how='right') # merge
merged # show output
Additional Resources#
For more on different join/merge types: