Plotting with seaborn
#
As we’ve covered previously, matplotlib
gives you a wide range of base components to work with when generating plots. But, matplotlib
does have limitations, and building a plot from the ground up, specifying each component can be cumbersome (and result in significant boilerplate code). Advanced statistical analysis with matplotlib
is possible, but cumbersome.
If you need to engage in advanced statistical analysis beyond what is easily accessible in matplotlib
, seaborn
is a statistical data visualization library that works well with pandas
. “seaborn
is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics” (“Seaborn”) and works with numpy
, scipy
, pandas
, and matplotlib
to simply high-level functions for common statistical plots.
To install seaborn
(run in the terminal):
pip install seaborn
conda install seaborn
A simple way to incorporate seaborn
with matplotlib
is to use seaborn
’s plot styling. matplotlib
also has a few different style sheets based on seaborn
.
Let’s look at a couple more sophisticated statistical plots built using seaborn
. We can create a plot that highlights the relationship between resturaunt bill amount, tip, and meal time. This data comes from the tips
example dataset already packaged in seaborn
.
import seaborn as sns # import statements
sns.set_theme() # set seaborn theme
tips = sns.load_dataset("tips") # load data
tips # inspect data
sns.relplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker", style="smoker", size="size",) # create plot
seaborn
draws on matplotlib
, but we don’t need to directly load or import matplotlib
. We load the tips
dataset as a DataFrame
. We then create a visualization that shows the relationships between the five dataset variables through the single call to the .relplot()
function. We can start to think through all of the work happening behind the scenes in matplotlib
to generate this visualization:
create figure/axes object with a 2, 1 subplot grid
create subplots
set tick values and labels for each axis for both subplots
set axis labels and plot titles for both subplots
set plot type for each subplot
set symbol types, color, and size for each type of datapoint, for each subplot
generate legend
And the list goes on…seaborn
handles all of those translations from the dataframe to matplotlib
arguments. This simplifies the work of writing code to generate this plot.
.relplot()
#
seaborn
’s .relplot()
function is designed to visualize statistical relationships. Sometimes scatterplots are the most effective way to show these relationships. But, in a relationship where one variable is a measure of time, a line can be a more effective representation.
We can use the kind
parameter with the .relplot()
function to make this change. In this example, the style
parameter impacted line weight and style, rather than marker size as it did in the previous example.
A few different replot
examples.
dots = sns.load_dataset('dots') # load data
dots # inspect data
# line plot showing relationships
sns.relplot(
data=dots, kind="line",
x="time", y="firing_rate", col="align",
hue="choice", size="coherence", style="choice",
facet_kws=dict(sharex=False),
)
fmri = sns.load_dataset("fmri") # load data
fmri # inspect data
# relationship plot that presents average of one variable as a function of other variables
sns.relplot(
data=fmri, kind="line",
x="timepoint", y="signal", col="region",
hue="event", style="event",
)
.lmplot()
#
seaborn
estimates the statistical values using bootstrapping to compute confidence intervals and draw error bars to show uncertainty. We could go back to our bill and tip data to generate a scatterplot that includes a linear regression model.
sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker") # linear model plot
We can visualize variable distribution with kernel density estimation.
sns.displot(data=tips, x="total_bill", col="time", kde=True) # # distribution plot with kernel density estimation
.displot()
#
seaborn
can also calculate and plot the empirical cumulative distribution function (ecdf
).
sns.displot(data=tips, kind="ecdf", x="total_bill", col="time", hue="smoker", rug=True) # # distribution plot with empirical cumulative distribution function
.catplot()
#
We can also generate plots that are geared toward categorical data.
A swarm
plot is a scatterplot with adjusted point positions on the categorical axis to minimize overlap.
sns.catplot(data=tips, kind="swarm", x="day", y="total_bill", hue="smoker") # categorical swarm plot
We could also display this categorical data using kernel density estimation and a violin plot.
sns.catplot(data=tips, kind="violin", x="day", y="total_bill", hue="smoker", split=True) # categorical violin plot
We could also display this data with a grouped bar chart that shows mean values and confidence intervals for each category.
sns.catplot(data=tips, kind="bar", x="day", y="total_bill", hue="smoker") # categorical bar plot
Additional Resources#
This is just a taste of how seaborn
works to generate more advanced statistical plots. Because the package integrates with matplotlib
, customizing seaborn
plots requires knowledge of matplotlib
functionality and syntax. Dropping down to the matplotlib
layer is not always necessary (as shown in these examples), but a robust matplotlib
foundation is knowledge that transfers when working with seaborn
.
For more on seaborn
: