.merge()

`.merge()`#

.merge() connects rows in DataFrames based on one or more key fields, similar to SQL JOIN operations.

# general syntax
merged = pd.DataFrame(left, right, on='key', how='right')

Parameters:

left: left DataFrame
right: right DataFrame
on=: common field(s)
how=: type of merge

In a scenario where the columns don’t have common names…

# syntax when labels don't match
merged = pd.DataFrame(left, right, left_on='key', right_on='key')

In a scenario where we want to join on multiple fields…

# syntax for joining on multiple fields
merged = pd.DataFrame(left, right, on=['key1', 'key2'])

As with SQL, we can specify the type of join or merge.

Merge Method	SQL Join Name	Description
`left`	`LEFT OUTER JOIN`	Use keys from left frame only
`right`	`RIGHT OUTER JOIN`	Use keys from right frame only
`outer`	`FULL OUTER JOIN`	Use union of keys from both frames
`inner`	`INNER JOIN`	Use intersection of keys from both frames
`cross`	`CROSS JOIN`	Create the cartesian product of rows of both frames

Let’s work with a few air quality datasets.

air_quality_no2 provies NO₂ values for three measurement stations.
air_quality_pm25 provides PM₂₅ values (particulate matter less than 2.5 micrometers) for the same three measurement stations.
air_quality_stations provides latitude and longitude coordinates for five different measurement stations.
air_quality_parameters provides parameter full description and name for five different element types.

import pandas as pd # import statement

# load data
no2 = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_no2_long.csv", parse_dates=True)
pm25 = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_pm25_long.csv", parse_dates=True)
stations = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_stations.csv")
param = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air_quality_parameters.csv")

Example #1#

Let’s say we wanted to add station location coordinates to corresponding rows in the measurements table. We have this data loaded as stations_coord.

We can merge the stations and no2 data frames based on a common field, location.

merged = pd.merge(no2, stations, on='location', how='left') # left merged based on location
merged # show output

Example #2#

Let’s say we want to add the columns from the parameters table to the pm25 table.

There are no commonly-labeled fields, so we’ll need to use left_on and right_on arguments.

param.head()

	id	description	name
0	bc	Black Carbon	BC
1	co	Carbon Monoxide	CO
2	no2	Nitrogen Dioxide	NO2
3	o3	Ozone	O3
4	pm10	Particulate matter less than 10 micrometers in...	PM10

pm25.head()

	city	country	date.utc	location	parameter	value	unit
0	Antwerpen	BE	2019-06-18 06:00:00+00:00	BETR801	pm25	18.0	µg/m³
1	Antwerpen	BE	2019-06-17 08:00:00+00:00	BETR801	pm25	6.5	µg/m³
2	Antwerpen	BE	2019-06-17 07:00:00+00:00	BETR801	pm25	18.5	µg/m³
3	Antwerpen	BE	2019-06-17 06:00:00+00:00	BETR801	pm25	16.0	µg/m³
4	Antwerpen	BE	2019-06-17 05:00:00+00:00	BETR801	pm25	7.5	µg/m³

merged = pd.merge(param, pm25, left_on='id', right_on='parameter', how='right') # merge
merged # show output

Additional Resources#

For more on different join/merge types:

.merge()

Contents

.merge()#

Example #1#

Example #2#

Additional Resources#

`.merge()`#