Python data visualization | 7. How Seaborn makes distribution more intuitive

%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats, integrate
from warnings import filterwarnings
filterwarnings('ignore')
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(color_codes=True)
np.random.seed(sum(map(ord, "distributions")))

Univariate distribution

Grayscale image

The most convenient and fast way~

x = np.random.normal(size=100)
sns.distplot(x, kde=True)
# Kernel density estimation kde is True by default

# Want a more detailed description?
# Adjust bins!
sns.distplot(x, kde=False, bins=30)
# bins=30, thirty columns!

# Want to see it with examples?
sns.distplot(x, kde=False, bins=30, rug=True)
# rug controls whether the observed sliver (marginal blanket) is displayed
# Whether to draw a rugplot on the support axis.

What are the advantages of looking at it together with examples?
A: guide you to set the appropriate bins.

Note: whether the above kde parameter is enabled or not has a default bandwidth of about 0.3.

Kernel density estimation (KDE)

The shape of the probability density function is estimated by observation. What's the use? Calculation of probability density function by undetermined coefficient method~

Steps of nuclear density estimation:

A normal distribution curve is used to approximate each observation
Superimpose the normal distribution curve of all observations
normalization

How to draw in seaborn?

sns.kdeplot(x)

The concept of bandwidth: the width of the approximate normal distribution curve
The larger the bandwidth, the smoother the curve

sns.kdeplot(x, label = "bw: 'scott'")
sns.kdeplot(x, bw=.2, label="bw: 0.2")
sns.kdeplot(x, bw=2, label="bw: 2") # Too smooth
plt.legend()

Model parameter fitting

x = np.random.gamma(6, size=200)            # A gamma distribution
sns.distplot(x, 
             kde=True, 
             fit=stats.gamma
            )  # Our tentative guess is the gamma function

The blue line is SNS Results plotted by distplot (x)
The black line is SNS Distplot (x, fit = stats. Gamma)

Bivariate distribution

mean, cov = [0, 1], [(1, 0.5), (0.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200) 
# np.random.multivariate_normal() multivariate normal distribution, generating data according to the specified mean and covariance
# The mean value is 0 and 1 respectively, the variance is 1, and there is a correlation coefficient of 0.5 between points
df = pd.DataFrame(data, columns=["x", "y"])
df.head()

Two correlated normal distributions~

Scatter diagram

For the two related distributions, there is a strong SNS The jointplot() function can take advantage of:

sns.jointplot(x="x", y="y", data=df).annotate(stats.pearsonr)

Information in the figure: x and y scatter plot / x and y gray scale map / personr correlation coefficient / p value sampling error (the smaller p, the better)

About Pearson Correlation Coefficient
- Related links: Discussion of simulation metrics, zh wikipedia. org
- pearsonr correlation coefficient calculation:

ρ X , Y = c o v ( X , Y ) σ X σ Y \rho_{X,Y} = \frac{cov(X, Y)}{\sigma_X\sigma_Y} ρX,Y=σXσYcov(X,Y)

Simple classification of correlation coefficients:
- 0.8-1.0 very strong correlation
- 0.6-0.8 strong correlation
- 0.4-0.6 moderate correlation
- 0.2-0.4 weak correlation
- 0.0-0.2 very weak correlation or no correlation
Pearson correlation coefficient game: http://guessthecorrelation.com

Hexagonal box diagram

x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("ticks"):
    sns.jointplot(x=x, y=y, kind="hex").annotate(stats.pearsonr)
    # What shape (hex hexagon) can be specified
# np.random.multivariate_normal(mean, cov, 10).T

Kernel density estimation

# Contour type
sns.jointplot(x="x", y="y", data=df, kind="kde").annotate(stats.pearsonr)

f, ax = plt.subplots(figsize=(8, 8)) #  axes
sns.kdeplot(df.x, df.y, ax=ax, shade=False)
# shade=False do not fill, otherwise it will become a contour line
sns.rugplot(df.x, color="b", ax=ax) 
sns.rugplot(df.y, vertical=True, ax=ax, color="r") 
# sns. Ruglot specializes in drawing rugs; Vertical levelization

Want to see a more continuous dream effect~

f, ax = plt.subplots(figsize=(6, 6))
# cubehelix color system, brightness proportional to intensity, used for astronomical image rendering. http://www.mrao.cam.ac.uk/~dag/CUBEHELIX/
cmap = sns.cubehelix_palette(as_cmap=True, dark=1, light=0) 
# cmap: color map color mapping
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True)

g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0) # Sets the transparency of the middle picture background
g.set_axis_labels("$X$", "$Y$") # Latex

Note: for kde graphs, one-dimensional ones mainly guess the distribution. If you can see that there are several centers in two-dimensional ones, you can do clustering related work.

Pairwise relationships in datasets

iris =  pd.read_csv("iris.csv") # Iris database
iris.head()

sns.pairplot(iris)  # Default diagonal hist, non diagonal scatter

Relationship between attributes + grayscale image of attributes

g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot) # Diagonal single attribute graph
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=20) # Non diagonal two attribute diagram

Summary

distplot(bins, rug)
kdeplot(bw, fit)
joinplot(kind)
pairplot
Source code acquisition: focus on WeChat official account "AI reading knowledge map", reply to "Python data visualization" to get all the updated content.

Keywords: Python AI

Added by PHPTOM on Sun, 19 Dec 2021 03:06:43 +0200

Programming VIP