19_ Scatter diagram, connection scatter diagram, bubble diagram, scatter diagram of different colors, histogram

19. Scatter plots
19.1. Connected scatter plot
19.2. Bubble chart
19.3. Scatter diagram of different colors
20. Histograms

19. Scatter plots

Use a scatter chart when you want to show the relationship between two variables. Scatter charts are sometimes called correlation charts because they show the relationship between two variables.

import numpy as np
import matplotlib.pyplot as plt

plt.scatter(x=range(77, 770, 10),
            y=np.random.randn(70)*55+range(77, 770, 10),
            s=200, alpha=0.6)
plt.tick_params(labelsize=12)
plt.xlabel('Surface(m2)', size=12)
plt.ylabel('Turnover (K dollars)', size=12)
plt.xlim(left=0)
plt.ylim(bottom=0)

plt.show()


This figure describes the positive correlation between store area and its turnover.

import numpy as np
import matplotlib.pyplot as plt

plt.scatter(x=range(20, 80, 1), y=np.abs(np.random.randn(60)*40),
            s=200,
            alpha=0.6)
plt.xlabel('Age', size=12)
plt.ylabel('Average purchase cost per week(dollars)', size=12)

plt.show()


This figure shows that there is no concern between the customer's age and his weekly purchase cost.

19.1. Connected scatter plot

Connected scatter chart is a mixture of scatter chart and line chart. It uses line segments to connect continuous scatter points, for example, to illustrate the trajectory over time.

The connected scatter diagram visualizes two related time series in the scatter diagram and connects these points with lines in the time sequence.

import numpy as np
import matplotlib.pyplot as plt

turnover = [30, 38, 26, 20, 21, 15, 8, 5, 3, 9, 25, 27]
plt.plot(np.arange(12), turnover, marker='o')

plt.show()

Suppose the figure above describes the sales turnover within one year. According to the chart, we can find that sales peak in winter and then decline from spring to summer.

19.2. Bubble chart

A bubble chart is a chart that displays data in three dimensions. The value of an additional variable is represented by the size of the point.

import numpy as np
import matplotlib.pyplot as plt

nbclients = range(10, 494, 7)
plt.scatter(x=range(77, 770, 10),
            y=np.random.randn(70)*55+range(77, 770, 10),
            s=nbclients, alpha=0.6)

plt.show()

19.3. Scatter diagram of different colors

Scatter charts created by matplotlib cannot specify colors based on the values of category variables. Therefore, we must overlap the graphs with different colors.

import numpy as np
import matplotlib.pyplot as plt

plt.scatter(x=range(40, 70, 1),
            y=np.abs(np.random.randn(30)*20),
            s=200,
            #c = 'blue',
            alpha=0.6,
            label='40-69')
plt.scatter(x=range(20, 40, 1),
            y=np.abs(np.random.randn(20)*40),
            s=200,
            #c = 'red',
            alpha=0.6,
            label='20-39')
plt.legend()  # Every time
plt.show()

This 2-dot chart clearly shows the difference in weekly purchase costs between young people and middle-aged or elderly people: the average weekly purchase of young people is twice that of middle-aged or elderly people.

20. Histograms

Histogram is a statistical report chart, which is a graphical representation of the frequency distribution of some numerical data. The data distribution is represented by a series of longitudinal stripes or line segments with different heights. Generally, the horizontal axis represents the data type and the vertical axis represents the distribution.

Histogram is a graphical representation of the distribution of numerical data and an estimation of the probability distribution of a continuous variable.

In order to construct the histogram, the first step is to segment the range of values, that is, divide the whole range of values into a series of intervals, and then calculate how many values there are in each interval. These values are usually specified as continuous, non overlapping variable intervals. Intervals must be adjacent and usually of equal size.

If a histogram is constructed, the range of possible x values is first assigned to intervals that are usually equal in size and adjacent.
pyplot.hist function definition document: plot Hist function definition document plot hist( https://matplotlib.org/stable/api/_as_gen/matplotlib.py plot.hist.html )

Now let's create a histogram of random numbers:

import matplotlib.pyplot as plt
import numpy as np
gaussian_numbers = np.random.normal(size=10000)
print(gaussian_numbers)
'''
Output result:
[-2.88618646 -0.15302214 -0.35230715 ... -0.42074156  0.41650123 0.56230326]
'''

plt.hist(gaussian_numbers)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()


Return value:
n The values of the histogram bins.
bins
The edges of the bins. Length nbins + 1 (nbins left edges and right edge of last bin).
patches
Silent list of individual patches used to create the histogram or list of such list if multiple input datasets.

import matplotlib.pyplot as plt
import numpy as np
gaussian_numbers = np.random.normal(size=10000)
print(gaussian_numbers)
'''
Output result:
[-2.88618646 -0.15302214 -0.35230715 ... -0.42074156  0.41650123 0.56230326]
'''


n, bins, patches = plt.hist(gaussian_numbers)
print("n: ", n, sum(n))
print("bins: ", bins)
for i in range(len(bins)-1):
    print(bins[i+1] -bins[i])
print("patches: ", patches)
print(patches[1])
"""
Output result:
n:  [  25.  182.  940. 2230. 2951. 2268. 1066.  300.   34.    4.] 10000.0
bins:  [-3.44422377 -2.68683456 -1.92944534 -1.17205613 -0.41466692  0.3427223
  1.10011151  1.85750072  2.61488994  3.37227915  4.12966836]
0.7573892130572868
0.757389213057287
0.757389213057287
0.7573892130572868
0.7573892130572868
0.7573892130572872
0.7573892130572863
0.7573892130572872
0.7573892130572872
0.7573892130572859
patches:  <BarContainer object of 10 artists>
Rectangle(xy=(-2.68683, 0), width=0.757389, height=182, angle=0)
"""

plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

Parameter Description:
 data: required parameter, drawing data
 bins: the number of long bars in the histogram. Optional. The default value is 10
 normalized: whether to normalize the obtained histogram vector. Optional. The default value is 0, which represents non normalization and display frequency. Normalized = 1, indicating normalization and displaying frequency.
 facecolor: color of long strip
 edgecolor: the color of the long bar border
• alpha: transparency

Let's increase the number of bin. If there are 10000 random values, set the keyword parameter bins to 100:

plt.hist(gaussian_numbers, bins=100)
plt.show()

Another important keyword parameter of hist is density. Density is optional and the default value is False. If it is set to True, the first element of the returned tuple will be normalized to form the count value of probability density, that is, the sum of the area (or integral) under the histogram is 1.

plt.hist(gaussian_numbers, bins=100, density=True)
plt.show()


hist to set edgecolor and color.

plt.hist(gaussian_numbers, 
         bins=100, 
         density=True, 
         edgecolor="#6A9662",
         color="#DDFFDD")
plt.show()

By setting the parameter cumulative, we can also draw it as a cumulative distribution function.

n, bins, patches = plt.hist(gaussian_numbers,
                            bins=100,
                            density=True,
                            edgecolor="#BB5566",
                            color="#DDFFDD",
                            cumulative=True)
plt.show()

Added by angelkay73 on Wed, 09 Mar 2022 15:07:06 +0200