(Estimated reading time: 20 minutes)

This tutorial provides an overview of statistical data visualization in Python using a library called seaborn. Data visualization is an important component in data science, and is usually the first step in a data analysis process after data collection and preprocessing. It allows researchers to have a quick sense of the distribution of their data and potential relationships, which helps shed a light on directions of further analysis on the data. It is also a powerful yet intuitive way of presenting data and conveying messages to any target audience.

The tutorial will briefly introduce different choices of graphs, explain their advantages, disadvantages, the scenarios under which they can be used, as well as ways to code them using seaborn. After the tutorials, readers will have a basic understanding of common data visualization methods and be able to implement and customize them in Python using seaborn.

The tutorial consists of the following sections:

- Installing and Importing Seaborn
- Loading the Data
- Visualization: 1D Quantitative Data
- Visualization: Incorporating Categorical Data into 1D Quantitative Data
- Visualization: 1D Categorical Data
- Visualization: Incorporating More Dimensions of Categorical Data
- Visualization: 2D Quantitative Data
- Higher Dimensional Data in General
- Seaborn vs Matplotlib vs Other Choices

Seaborn is only supported on Python 3.6+. To check your python version, run the following code chunk or copy the command (without the exclaimation mark) and run it from command line.

In [3]:

```
!python -V
```

Python 3.8.8

If you need to install or upgrade Python, check out the official website.

Once the correct Python version has been installed, we can install the library from `PyPI`

or `Anaconda`

by running one of the following two lines of code or copying one of the two commands (without the exclamation mark) and run it from the command line.

In [ ]:

```
# installing using pip
!pip install seaborn
# installing using conda
#!conda install seaborn
```

Required dependencies of seaborn include NumPy, SciPy, pandas, and matplotlib. When seaborn is installed, it will automatically check for these required libraries and install them if needed. If you run into any trouble during installation, visit Installing and getting started on seaborn's official website for more information.

After successful installation, we will load the libraries by running the following code. According to recommendations on seaborn's official website, we will import both seaborn and its four required dependencies for more comprehensive functionalities.

In [2]:

```
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

The sample data we will be using in this tutorial is the daily data on the COVID-19 pandemic in California, Pennsylvania and Massachusetts, United States, from which is provided by the COVID Tracking Project at The Atlantic under a Creative Commons CC BY 4.0 license.

In [3]:

```
# read the data from the csv files
covid_ca = pd.read_csv("https://covidtracking.com/data/download/california-history.csv")
covid_pa = pd.read_csv("https://covidtracking.com/data/download/pennsylvania-history.csv")
covid_ma = pd.read_csv("https://covidtracking.com/data/download/massachusetts-history.csv")
# concatenate two data frames
covid = pd.concat([covid_ca, covid_pa, covid_ma])
```

To keep our demonstration simple, we will retain five variables as shown below, and only keep observations between April 1st, 2020 to December 31st, 2020. For demonstration purpose, we will add another categorical variable, `hospitalizedLevel`

, based on `hospitalizedCurrently`

.

In [5]:

```
# retain only the five columns listed
covid = covid[["date", "state", "deathIncrease", "hospitalizedCurrently", "totalTestResultsIncrease"]]
# change the date from string to datetime
covid["date"] = pd.to_datetime(covid["date"], format = "%Y-%m-%d")
# keep data between 2020-04-01 to 2020-12-31
covid = covid[(covid["date"] >= "2020-04-01") & (covid["date"] <= "2020-12-31")]
# add a new column hospitalizedLevel
def addHospitalizedLevel(row):
if row["hospitalizedCurrently"] >= 10000:
return "high"
elif row["hospitalizedCurrently"] < 5000:
return "low"
else:
return "medium"
covid["hospitalizedLevel"] = covid.apply(addHospitalizedLevel, axis = 1)
# specify categorical variables
covid = covid.astype({"state": "category", "hospitalizedLevel":"category"})
# show first five row of the data frame
covid.head()
```

Out[5]:

date | state | deathIncrease | hospitalizedCurrently | totalTestResultsIncrease | hospitalizedLevel | |
---|---|---|---|---|---|---|

66 | 2020-12-31 | CA | 428 | 21449.0 | 232406 | high |

67 | 2020-12-30 | CA | 432 | 21433.0 | 248605 | high |

68 | 2020-12-29 | CA | 242 | 21240.0 | 245955 | high |

69 | 2020-12-28 | CA | 64 | 20642.0 | 301820 | high |

70 | 2020-12-27 | CA | 237 | 20059.0 | 380154 | high |

In [66]:

```
# show dimension of the data frame
covid.shape
```

Out[66]:

(825, 6)

As shown above, our final data frame consists of 825 rows (observations) and 6 columns (variables). The columns contain `date`

, two categorical variables, and three quantitative variables.

Now that we have preprocessed our data, we will move on to using seaborn for visualization.

When we have a 1D quantitative data, some information we may be interested in include its mean (average), median (50% quantile), range, spread, shape of distribution, etc. Below we will see several methods that focus on different information.

A histogram can show the distribution of a quantitative variable. On the x-axis is the variable we are visualizing, whose values are separated into different bins, and on the y-axis is usually the count or density of the observations that fall into each bin.

Let's try to visualize the distribution of `deathIncrease`

, the daily increase in death counts in a state.

In [538]:

```
# set the figure size using matplotlib
plt.figure(figsize=(8, 5))
# histogram with count on the y-axis
sns.histplot(x = "deathIncrease", data = covid)
```

Out[538]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Count'>

By default, `histplot`

puts count on the y-axis. To put density on the y-axis, we specify `stat = "density"`

, which normalizes the counts to make the total area of bins equal to 1.

In [540]:

```
plt.figure(figsize=(8, 5))
# histogram with density on the y-axis
sns.histplot(x = "deathIncrease", data = covid, stat = "density")
```

Out[540]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

There are two other values we can use for the `stat`

argument: `frequency`

, which is the count of each bin divided by the bin width, and `probability`

, which normalizes the counts to make the total height of bars equal to 1.

We can easily change the width of the bins by specifying `bins`

or `binwidth`

. If `bins`

is set to one value, the histogram uses this value as the total number of bins. If `bins`

is set to a vector or tuple of values, the histogram uses these values as the breaks of the bins.

In [541]:

```
plt.figure(figsize=(8, 5))
# specify the total number of bins
sns.histplot(x = "deathIncrease", data = covid, bins = 10)
```

Out[541]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Count'>

In [542]:

```
plt.figure(figsize=(8, 5))
# specify the breaks of the bins
sns.histplot(x = "deathIncrease", data = covid,
bins = (0, 100, 200, 300, 400, 500, 600))
```

Out[542]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Count'>

`binwidth`

allows us to control the width of each bin. If both `bins`

and `binwidth`

are specified, `binwidth`

will override `bins`

.

In [543]:

```
plt.figure(figsize=(8, 5))
# specify the width of each bin
sns.histplot(x = "deathIncrease", data = covid, binwidth = 50)
```

Out[543]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Count'>

If we don't want to show all values on the x-axis, we can specify the smallest and the largest bin edges using `binrange`

.

In [545]:

```
plt.figure(figsize=(8, 5))
# limit the histogram to between 0 and the maximum value of deathIncrease
sns.histplot(x = "deathIncrease", data = covid,
binrange = (0, max(covid["deathIncrease"])))
```

Out[545]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Count'>

We can overlay a smoothed density curve on the histogram by setting `kde = True`

. Note that since we are adding a density curve, it only makes sense if we put `density`

on the y-axis of the histogram.

In [6]:

```
plt.figure(figsize=(8, 5))
# histogram with density on the y-axis and kernel density estimate overlayed
sns.histplot(x = "deathIncrease", data = covid, stat = "density", kde = True)
```

Out[6]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

To change the color of the bins, we can specify `color`

.

In [546]:

```
plt.figure(figsize=(8, 5))
# histogram with a different color
sns.histplot(x = "deathIncrease", data = covid, color = "darksalmon")
```

Out[546]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Count'>

For a comprehensive list of arguments for `histplot`

, check out seaborn.histplot.

Similar to histograms, a density curve is useful if we want to see the shape of the distribution. `kdeplot`

generates a kernel density estimate curve using Gaussian kernels for the given data.

In [547]:

```
plt.figure(figsize=(8, 5))
# kernal density estimate
sns.kdeplot(x = "deathIncrease", data = covid)
```

Out[547]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

We can change the color of the curve using `color`

. To fill the area under curve, we set `shade = True`

. Seaborn will automatically use a filling color that is of the same shade with the color of the curve.

In [548]:

```
plt.figure(figsize=(8, 5))
# kernal density estimate with area under curve filled
sns.kdeplot(x = "deathIncrease", data = covid,
color = "darksalmon", fill = True)
```

Out[548]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

We can specify the minimum and maximum value on the x-axis in a density curve plot using `clip`

.

In [549]:

```
plt.figure(figsize=(8, 5))
# limit the density curve to between 0 and the maximum value of deathIncrease
sns.kdeplot(x = "deathIncrease", data = covid,
clip = (0, max(covid["deathIncrease"])))
```

Out[549]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

To change the smoothing bandwidth, we use the `bw_method`

argument. This argument will be passed to the `gaussian_kde`

function in Scipy for further calculation. For more information, check out scipy.stats.gaussian_kde.

We can also use `bw_adjust`

to control the level of smoothing. Larger values correspond to more smoothing and vice versa.

In [550]:

```
plt.figure(figsize=(8, 5))
# less smoothing
sns.kdeplot(x = "deathIncrease", data = covid, bw_adjust = 0.5)
```

Out[550]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

In [551]:

```
plt.figure(figsize=(8, 5))
# more smoothing
sns.kdeplot(x = "deathIncrease", data = covid, bw_adjust = 3)
```

Out[551]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

The kernel density curve we get from `kdeplot`

estimates the probability density function of the original distribution, but we can also estimate the cumulative distribution function by setting `cumulative = True`

.

In [552]:

```
plt.figure(figsize=(8, 5))
# estimate the cumulative density
sns.kdeplot(x = "deathIncrease", data = covid, cumulative = True)
```

Out[552]:

<AxesSubplot:xlabel='deathIncrease', ylabel='Density'>

We used to be able to choose non-gaussian kernels for density estimate, but unfortunately it's no longer supported. If we want to use non-Gaussian kernels, we can either use statsmodels or scikit-learn.

For a comprehensive list of arguments for `kdeplot`

, check out seaborn.kdeplot.

The two methods above are good for visualizing the shape of the distribution, but not statistics such as median and outliers. If we want to visualize the "summary" statistics of a distribution and do not care about the shape, we can resort to box plots.

A box plot shows the "five-number summary" of a distribution, which includes the minimum, the first quartile (25% quantile), the median, the third quartile (75% quantile), and the maximum. It also marks the outliers separately.

In [567]:

```
plt.figure(figsize=(8, 4))
# box plot - horizontal
sns.boxplot(x = "deathIncrease", data = covid)
```

Out[567]:

<AxesSubplot:xlabel='deathIncrease'>

In the plot above, the line inside the box is the median. The left and right edge of the box is the first and third quartile, respectively. The two lines extending out from the box are called whiskers. The endpoint of the left whisker is either the minimum value, or Q1 - 1.5IQR (the first quartile minus 1.5 times the interquartile range), in other words the minimum value that is not considered an outlier. Similarly, the endpoint of the right whisker is either the maximum value, or Q3 + 1.5 IQR. The points outside the whiskers are outliers.

We can flip the box plot easily by specifying the data as argument `y`

, instead of `x`

. Needless to say, we can change the color of the box using `color`

.

In [574]:

```
plt.figure(figsize=(4, 6))
#box plot - vertical
sns.boxplot(y = "deathIncrease", data = covid, color = "darksalmon")
```

Out[574]:

<AxesSubplot:ylabel='deathIncrease'>

For a comprehensive list of arguments for `boxplot`

, check out seaborn.boxplot.

A strip plot is a variation of a dot plot, which plots every single observation. The density of a value is visualized as the literal density of points on the plot at a given x value. The x-coordinate of each point is its value of variable on the x-axis, but the y-coordinate is meaningless. The points are spread out on the y-axis only to make the points overlap less.

In [568]:

```
plt.figure(figsize=(8, 4))
# strip plot
sns.stripplot(x = "deathIncrease", data = covid)
```

Out[568]:

<AxesSubplot:xlabel='deathIncrease'>

We can control how spread out the points are by setting the `jitter`

argument. `jitter = 0`

means no spread at all, in this case it means all points will be on the same horizontal line. `jitter = 1`

or `jitter = True`

is the default value used. Any other value in range [0, 1) represents the amount of jitter, so larger value means points that are more spread out.

In [569]:

```
plt.figure(figsize=(8, 4))
# strip plot with more jitter
sns.stripplot(x = "deathIncrease", data = covid, jitter = 0.3)
```

Out[569]:

<AxesSubplot:xlabel='deathIncrease'>

A strip plot alone doesn't contain much information. However, we can overlay it on a box plot as a complement, since a box plot does not visualize density. In the plot below, the color and transparency of the points in the strip plot are changed using `color`

and `alpha`

so that we can still see the box plot.

In [570]:

```
plt.figure(figsize=(8, 4))
# strip plot on top of box plot
sns.boxplot(x = "deathIncrease", data = covid)
sns.stripplot(x = "deathIncrease", data = covid,
color = "darksalmon", jitter = 0.4, alpha = 0.4)
```

Out[570]:

<AxesSubplot:xlabel='deathIncrease'>

We can still flip the coordinates of a strip plot by specifying the data as the `y`

argument.

In [573]:

```
plt.figure(figsize=(4, 6))
# strip plot on top of box plot, coordinates flipped
sns.boxplot(y = "deathIncrease", data = covid)
sns.stripplot(y = "deathIncrease", data = covid,
color = "darksalmon", jitter = 0.4, alpha = 0.4)
```

Out[573]:

<AxesSubplot:ylabel='deathIncrease'>

For a comprehensive list of arguments for `stripplot`

, check out seaborn.stripplot.

A swarm plot is similar to a strip plot, but it spreads the points out more at values with higher densities. Note that swarm plots are extremely unscalable. In the code below, we only plot the first 300 observations. If we include more points than can be placed on the plot, we will receive a warning to either decrease the point size or use a strip plot instead. We can control the point size using the `size`

argument.

In [575]:

```
plt.figure(figsize=(8, 4))
# swarm plot
sns.swarmplot(x = "deathIncrease", data = covid.iloc[0:300,])
```

Out[575]:

<AxesSubplot:xlabel='deathIncrease'>

We can flip the coordinates of a swarm plot or overlay it on top of a box plot using the same methods mentioned earlier.

In [577]:

```
plt.figure(figsize=(4, 6))
# swarm plot on top of box plot, coordinates flipped
sns.boxplot(y = "deathIncrease", data = covid)
sns.swarmplot(y = "deathIncrease", data = covid.iloc[0:300,], color = "darksalmon", alpha = 0.5)
```

Out[577]:

<AxesSubplot:ylabel='deathIncrease'>