Plotting with seaborn#

As we’ve covered previously, matplotlib gives you a wide range of base components to work with when generating plots. But, matplotlib does have limitations, and building a plot from the ground up, specifying each component can be cumbersome (and result in significant boilerplate code). Advanced statistical analysis with matplotlib is possible, but cumbersome.

If you need to engage in advanced statistical analysis beyond what is easily accessible in matplotlib, seaborn is a statistical data visualization library that works well with pandas. “seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics” (“Seaborn”) and works with numpy, scipy, pandas, and matplotlib to simply high-level functions for common statistical plots.

To install seaborn (run in the terminal):

  • pip install seaborn

  • conda install seaborn

A simple way to incorporate seaborn with matplotlib is to use seaborn’s plot styling. matplotlib also has a few different style sheets based on seaborn.

Let’s look at a couple more sophisticated statistical plots built using seaborn. We can create a plot that highlights the relationship between resturaunt bill amount, tip, and meal time. This data comes from the tips example dataset already packaged in seaborn.

import seaborn as sns # import statements
sns.set_theme() # set seaborn theme

tips = sns.load_dataset("tips") # load data
tips # inspect data
sns.relplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker", style="smoker", size="size",) # create plot

seaborn draws on matplotlib, but we don’t need to directly load or import matplotlib. We load the tips dataset as a DataFrame. We then create a visualization that shows the relationships between the five dataset variables through the single call to the .relplot() function. We can start to think through all of the work happening behind the scenes in matplotlib to generate this visualization:

  • create figure/axes object with a 2, 1 subplot grid

  • create subplots

  • set tick values and labels for each axis for both subplots

  • set axis labels and plot titles for both subplots

  • set plot type for each subplot

  • set symbol types, color, and size for each type of datapoint, for each subplot

  • generate legend

And the list goes on…seaborn handles all of those translations from the dataframe to matplotlib arguments. This simplifies the work of writing code to generate this plot.

.relplot()#

seaborn’s .relplot() function is designed to visualize statistical relationships. Sometimes scatterplots are the most effective way to show these relationships. But, in a relationship where one variable is a measure of time, a line can be a more effective representation.

We can use the kind parameter with the .relplot() function to make this change. In this example, the style parameter impacted line weight and style, rather than marker size as it did in the previous example.

A few different replot examples.

dots = sns.load_dataset('dots') # load data
dots # inspect data
# line plot showing relationships
sns.relplot(
    data=dots, kind="line",
    x="time", y="firing_rate", col="align",
    hue="choice", size="coherence", style="choice",
    facet_kws=dict(sharex=False),
)
fmri = sns.load_dataset("fmri") # load data
fmri # inspect data
# relationship plot that presents average of one variable as a function of other variables
sns.relplot(
    data=fmri, kind="line",
    x="timepoint", y="signal", col="region",
    hue="event", style="event",
)

.lmplot()#

seaborn estimates the statistical values using bootstrapping to compute confidence intervals and draw error bars to show uncertainty. We could go back to our bill and tip data to generate a scatterplot that includes a linear regression model.

sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker") # linear model plot

We can visualize variable distribution with kernel density estimation.

sns.displot(data=tips, x="total_bill", col="time", kde=True) # # distribution plot with kernel density estimation

.displot()#

seaborn can also calculate and plot the empirical cumulative distribution function (ecdf).

sns.displot(data=tips, kind="ecdf", x="total_bill", col="time", hue="smoker", rug=True) # # distribution plot with empirical cumulative distribution function

.catplot()#

We can also generate plots that are geared toward categorical data.

A swarm plot is a scatterplot with adjusted point positions on the categorical axis to minimize overlap.

sns.catplot(data=tips, kind="swarm", x="day", y="total_bill", hue="smoker") # categorical swarm plot

We could also display this categorical data using kernel density estimation and a violin plot.

sns.catplot(data=tips, kind="violin", x="day", y="total_bill", hue="smoker", split=True) # categorical violin plot

We could also display this data with a grouped bar chart that shows mean values and confidence intervals for each category.

sns.catplot(data=tips, kind="bar", x="day", y="total_bill", hue="smoker") # categorical bar plot

Additional Resources#

This is just a taste of how seaborn works to generate more advanced statistical plots. Because the package integrates with matplotlib, customizing seaborn plots requires knowledge of matplotlib functionality and syntax. Dropping down to the matplotlib layer is not always necessary (as shown in these examples), but a robust matplotlib foundation is knowledge that transfers when working with seaborn.

For more on seaborn: