.merge()#

.merge() connects rows in DataFrames based on one or more key fields, similar to SQL JOIN operations.

# general syntax
merged = pd.DataFrame(left, right, on='key', how='right')

Parameters:

  • left: left DataFrame

  • right: right DataFrame

  • on=: common field(s)

  • how=: type of merge

In a scenario where the columns don’t have common names…

# syntax when labels don't match
merged = pd.DataFrame(left, right, left_on='key', right_on='key')

In a scenario where we want to join on multiple fields…

# syntax for joining on multiple fields
merged = pd.DataFrame(left, right, on=['key1', 'key2'])

As with SQL, we can specify the type of join or merge.

Merge Method

SQL Join Name

Description

left

LEFT OUTER JOIN

Use keys from left frame only

right

RIGHT OUTER JOIN

Use keys from right frame only

outer

FULL OUTER JOIN

Use union of keys from both frames

inner

INNER JOIN

Use intersection of keys from both frames

cross

CROSS JOIN

Create the cartesian product of rows of both frames

Let’s work with a few air quality datasets.

  • air_quality_no2 provies NO2 values for three measurement stations.

  • air_quality_pm25 provides PM25 values (particulate matter less than 2.5 micrometers) for the same three measurement stations.

  • air_quality_stations provides latitude and longitude coordinates for five different measurement stations.

  • air_quality_parameters provides parameter full description and name for five different element types.

import pandas as pd # import statement

# load data
no2 = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_no2_long.csv", parse_dates=True)
pm25 = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_pm25_long.csv", parse_dates=True)
stations = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_stations.csv")
param = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_parameters.csv")

Example #1#

Let’s say we wanted to add station location coordinates to corresponding rows in the measurements table. We have this data loaded as stations_coord.

We can merge the stations and no2 data frames based on a common field, location.

merged = pd.merge(no2, stations, on='location', how='left') # left merged based on location
merged # show output

Example #2#

Let’s say we want to add the columns from the parameters table to the pm25 table.

There are no commonly-labeled fields, so we’ll need to use left_on and right_on arguments.

param.head()
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
4 pm10 Particulate matter less than 10 micrometers in... PM10
pm25.head()
city country date.utc location parameter value unit
0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³
merged = pd.merge(param, pm25, left_on='id', right_on='parameter', how='right') # merge
merged # show output

Additional Resources#

For more on different join/merge types: