Tutorial: Data Visualization in Python with Seaborn

Author: Joanna Yao (xinyao@andrew.cmu.edu)

(Estimated reading time: 20 minutes)

Introduction

This tutorial provides an overview of statistical data visualization in Python using a library called seaborn. Data visualization is an important component in data science, and is usually the first step in a data analysis process after data collection and preprocessing. It allows researchers to have a quick sense of the distribution of their data and potential relationships, which helps shed a light on directions of further analysis on the data. It is also a powerful yet intuitive way of presenting data and conveying messages to any target audience.

The tutorial will briefly introduce different choices of graphs, explain their advantages, disadvantages, the scenarios under which they can be used, as well as ways to code them using seaborn. After the tutorials, readers will have a basic understanding of common data visualization methods and be able to implement and customize them in Python using seaborn.

Overview

The tutorial consists of the following sections:

  1. Installing and Importing Seaborn
  2. Loading the Data
  3. Visualization: 1D Quantitative Data
  4. Visualization: Incorporating Categorical Data into 1D Quantitative Data
  5. Visualization: 1D Categorical Data
  6. Visualization: Incorporating More Dimensions of Categorical Data
  7. Visualization: 2D Quantitative Data
  8. Higher Dimensional Data in General
  9. Seaborn vs Matplotlib vs Other Choices

1. Installing and Importing Seaborn

Seaborn is only supported on Python 3.6+. To check your python version, run the following code chunk or copy the command (without the exclaimation mark) and run it from command line.

If you need to install or upgrade Python, check out the official website.

Once the correct Python version has been installed, we can install the library from PyPI or Anaconda by running one of the following two lines of code or copying one of the two commands (without the exclamation mark) and run it from the command line.

Required dependencies of seaborn include NumPy, SciPy, pandas, and matplotlib. When seaborn is installed, it will automatically check for these required libraries and install them if needed. If you run into any trouble during installation, visit Installing and getting started on seaborn's official website for more information.

After successful installation, we will load the libraries by running the following code. According to recommendations on seaborn's official website, we will import both seaborn and its four required dependencies for more comprehensive functionalities.

2. Loading the Data

The sample data we will be using in this tutorial is the daily data on the COVID-19 pandemic in California, Pennsylvania and Massachusetts, United States, from which is provided by the COVID Tracking Project at The Atlantic under a Creative Commons CC BY 4.0 license.

To keep our demonstration simple, we will retain five variables as shown below, and only keep observations between April 1st, 2020 to December 31st, 2020. For demonstration purpose, we will add another categorical variable, hospitalizedLevel, based on hospitalizedCurrently.

As shown above, our final data frame consists of 825 rows (observations) and 6 columns (variables). The columns contain date, two categorical variables, and three quantitative variables.

Now that we have preprocessed our data, we will move on to using seaborn for visualization.

3. Visualization: 1D Quantitative Data

When we have a 1D quantitative data, some information we may be interested in include its mean (average), median (50% quantile), range, spread, shape of distribution, etc. Below we will see several methods that focus on different information.

(a) Histogram

A histogram can show the distribution of a quantitative variable. On the x-axis is the variable we are visualizing, whose values are separated into different bins, and on the y-axis is usually the count or density of the observations that fall into each bin.

Let's try to visualize the distribution of deathIncrease, the daily increase in death counts in a state.

By default, histplot puts count on the y-axis. To put density on the y-axis, we specify stat = "density", which normalizes the counts to make the total area of bins equal to 1.

There are two other values we can use for the stat argument: frequency, which is the count of each bin divided by the bin width, and probability, which normalizes the counts to make the total height of bars equal to 1.

We can easily change the width of the bins by specifying bins or binwidth. If bins is set to one value, the histogram uses this value as the total number of bins. If bins is set to a vector or tuple of values, the histogram uses these values as the breaks of the bins.

binwidth allows us to control the width of each bin. If both bins and binwidth are specified, binwidth will override bins.

If we don't want to show all values on the x-axis, we can specify the smallest and the largest bin edges using binrange.

We can overlay a smoothed density curve on the histogram by setting kde = True. Note that since we are adding a density curve, it only makes sense if we put density on the y-axis of the histogram.

To change the color of the bins, we can specify color.

For a comprehensive list of arguments for histplot, check out seaborn.histplot.

(b) Density Curve

Similar to histograms, a density curve is useful if we want to see the shape of the distribution. kdeplot generates a kernel density estimate curve using Gaussian kernels for the given data.

We can change the color of the curve using color. To fill the area under curve, we set shade = True. Seaborn will automatically use a filling color that is of the same shade with the color of the curve.

We can specify the minimum and maximum value on the x-axis in a density curve plot using clip.

To change the smoothing bandwidth, we use the bw_method argument. This argument will be passed to the gaussian_kde function in Scipy for further calculation. For more information, check out scipy.stats.gaussian_kde.

We can also use bw_adjust to control the level of smoothing. Larger values correspond to more smoothing and vice versa.

The kernel density curve we get from kdeplot estimates the probability density function of the original distribution, but we can also estimate the cumulative distribution function by setting cumulative = True.

We used to be able to choose non-gaussian kernels for density estimate, but unfortunately it's no longer supported. If we want to use non-Gaussian kernels, we can either use statsmodels or scikit-learn.

For a comprehensive list of arguments for kdeplot, check out seaborn.kdeplot.

(c) Box Plot

The two methods above are good for visualizing the shape of the distribution, but not statistics such as median and outliers. If we want to visualize the "summary" statistics of a distribution and do not care about the shape, we can resort to box plots.

A box plot shows the "five-number summary" of a distribution, which includes the minimum, the first quartile (25% quantile), the median, the third quartile (75% quantile), and the maximum. It also marks the outliers separately.

In the plot above, the line inside the box is the median. The left and right edge of the box is the first and third quartile, respectively. The two lines extending out from the box are called whiskers. The endpoint of the left whisker is either the minimum value, or Q1 - 1.5IQR (the first quartile minus 1.5 times the interquartile range), in other words the minimum value that is not considered an outlier. Similarly, the endpoint of the right whisker is either the maximum value, or Q3 + 1.5 IQR. The points outside the whiskers are outliers.

We can flip the box plot easily by specifying the data as argument y, instead of x. Needless to say, we can change the color of the box using color.

For a comprehensive list of arguments for boxplot, check out seaborn.boxplot.

(d) Strip Plot

A strip plot is a variation of a dot plot, which plots every single observation. The density of a value is visualized as the literal density of points on the plot at a given x value. The x-coordinate of each point is its value of variable on the x-axis, but the y-coordinate is meaningless. The points are spread out on the y-axis only to make the points overlap less.

We can control how spread out the points are by setting the jitter argument. jitter = 0 means no spread at all, in this case it means all points will be on the same horizontal line. jitter = 1 or jitter = True is the default value used. Any other value in range [0, 1) represents the amount of jitter, so larger value means points that are more spread out.

A strip plot alone doesn't contain much information. However, we can overlay it on a box plot as a complement, since a box plot does not visualize density. In the plot below, the color and transparency of the points in the strip plot are changed using color and alpha so that we can still see the box plot.

We can still flip the coordinates of a strip plot by specifying the data as the y argument.

For a comprehensive list of arguments for stripplot, check out seaborn.stripplot.

(e) Swarm Plot

A swarm plot is similar to a strip plot, but it spreads the points out more at values with higher densities. Note that swarm plots are extremely unscalable. In the code below, we only plot the first 300 observations. If we include more points than can be placed on the plot, we will receive a warning to either decrease the point size or use a strip plot instead. We can control the point size using the size argument.

We can flip the coordinates of a swarm plot or overlay it on top of a box plot using the same methods mentioned earlier.