Machine learning Chapter 4 linear regression algorithm advanced 4.1 (learning notes of big talk Python machine learning)

Chapter 4 advanced linear regression algorithm

   the solution of multivariable linear regression algorithm is far away from univariate linear regression algorithm, and overcomes the limitation of univariate linear regression algorithm with only one characteristic variable in practical application, so it is widely used.
   there are specific requirements for variables in the conventional solution of multivariable linear regression, but it is impossible to meet this requirement in practical application. At the same time, there are problems such as over fitting. Therefore, on the basis of basic solution, regularization, ridge regression and Lasso regression need to be introduced to further optimize and expand the solution of multivariable linear regression algorithm.

4.1 multivariable linear regression algorithm

4.1.1 least square solution of multivariable linear regression algorithm

Basic model:

  • h θ (x) Table Shiyi θ Is a parameter, θ 0 θ 1, θ 2, θ 3,… θ n is the regression parameter to be solved
       since there are n characteristic variables x, x can be expressed in the form of matrix:

    Of which:

    Cost function:
  • m: Number of training samples in the training set
  • (x(i),y(i)): the ith training sample. Superscript i is the index, indicating the ith training sample

solve θ:


  • Here, it is assumed that the rank of matrix X is generally n+1, that is, (XTX) - 1 exists
  • (XTX) - 1 causes of irreversibility:
       there is a high degree of multicollinearity between independent variables, such as x2=2x1
      there are too many characteristic variables, too high complexity and relatively few training data. Solution: use regularization and ridge regression

4.1.2 Python implementation of multivariable linear regression: fitting of cinema audience (I)

Visualization of multivariate data

  1. Histogram (including imported data)
# histogram
    xlabelsize=12: X Shaft size
    ylabelsize=12: Y Shaft size
    figsize=(12,7): Size of the entire drawing
df = pd.read_csv('D:/PythonProject/machine/data/3_film.csv')

  1. Density map
# Density map
    kind: Graphic type
    subplots=True: Multiple subgraphs need to be drawn
    layout=(2,2): Number of subgraphs drawn 2*2
    sharex=False: Subgraphs are not shared X axis
    fontsize=8: font size

  1. Box diagram
# Box diagram

  1. Correlation coefficient thermodynamic diagram
# Correlation coefficient thermodynamic diagram
# Set variable name
names = ['filmnum','filmsize','ratio','quality']
# Calculate the correlation coefficient matrix between variables
correlations = df.corr()
# Call figure to create a drawing object
fig = plt.figure()
# Call the Sketchpad to draw the first sub graph
ax = fig.add_subplot(111)
# Draw thermal diagram from 0.3 to 1
cax = ax.matshow(correlations,vmin=0.3,vmax=1)
# Set the heat map generated by matshow as a color gradient bar
# Generate 0-4 in steps of 1
ticks = np.arange(0,4,1)
# Generate x/y axis scale
# Generate x/y axis labels

  1. Scatter matrix
# Scatter matrix
# Plot scatter matrix
    df: data sources 
    figsize=(8,8): Graphic size
    c='b': Color of scatter points

Data fitting and prediction of multivariable linear regression algorithm

  1. Select characteristic variables and response variables, and divide the data
df = pd.read_csv('D:/PythonProject/machine/data/3_film.csv')
# Select the X variable in data
    :,1:4: Represents the selected dataset 2-4 column
X = df.iloc[:,1:4]
# Set target to y
y = df.filmnum
# Convert X and y into array form for easy calculation
X = np.array(X.values)
y = np.array(y.values)
# The test samples are constructed with 25% of the data, and the rest are used as training samples
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)

(94, 3) (32, 3) (94,) (32,)

  1. Perform linear regression operation and output the results
lr = LinearRegression(),y_train)

a=[ 0.37048549 -0.03831678 0.23046921]

  1. According to the calculated parameters, the test set is predicted
y_hat = lr.predict(X_test)

[20.20848598 74.31231952 66.97828797 50.61650336 50.53930128 44.72762082
57.00320531 35.55222669 58.49953514 19.43063402 27.90136964 40.25616051
40.81879843 40.01387623 24.56900454 51.36815239 38.97648053 39.25651308
65.4877603 60.82558336 54.29943364 40.45641818 29.69241868 49.29096985
44.60028689 48.05074366 35.23588166 72.29071323 53.79760562 51.94308584
46.42621262 73.37680499]

  1. Comparison of actual values of corresponding variables in test set and prediction set
# Create t variable
t = np.arange(len(X_test))
# Draw y_test curve
# Draw y_hat curve

  1. Evaluate the prediction results
# Output method I of goodness of fit R2
# Output method 2 of goodness of fit R2
# Calculate MAE
# Calculate MSE
# Calculate RMSE


Data file:
Extraction code: h4ex

Keywords: Python Algorithm Machine Learning

Added by genix2011 on Tue, 04 Jan 2022 17:00:26 +0200