Python Learning Notes-StatsModels Statistical Regression Linear Regression

1. Background knowledge

Coupon Platform https://m.cqfenfa.com/

1.1 Interpolation, Fitting, Regression and Prediction

Interpolation, fitting, regression and prediction are all concepts often mentioned in mathematical modeling and are often confused.

  • Interpolation is the interpolation of continuous functions on the basis of discrete data so that the continuous curve passes through all given discrete data points. Interpolation is an important method for approximating discrete functions. It can be used to estimate the approximation of functions at other points by the value of functions at a limited number of points.
  • Fitting is the approximation of a continuous function (curve) to a given discrete data so that it fits the given data.

Therefore, interpolation and fitting are the process of finding approximate curves with similar variation and characteristics based on known data points, but interpolation requires approximate curves to pass through the given data points completely, while fitting only requires approximate curves to be as close as possible to the data points as possible in the whole, and reflects the law of data change and development trend. Interpolation can be seen as a special kind of fitting, which requires an error function of 0. Since data points usually have errors, an error of 0 often means over-fitting, and the over-fitting model has a poor generalization ability for data outside the training set. Therefore, in practice, interpolation is mostly used for image processing, and fitting is mostly used for experimental data processing.

  • Regression is a statistical analysis method to study the relationship between one set of random variables and another. It includes establishing a mathematical model and estimating model parameters, and verifying the reliability of the mathematical model. It also includes predicting or controlling with the established model and estimated model parameters.

  • Forecasting is a very broad concept. In digital model, it refers to the quantitative study of the data and information obtained, from which a mathematical model suitable for the purpose of forecasting is established, and then the future development and changes are predicted quantitatively. Generally, interpolation and fitting are methods of prediction class.

Regression is a data analysis method, and fitting is a specific data processing method. Fitting focuses on the optimization of curve parameters to make the curve conform to the data. Regression focuses on the relationship between two or more variables.

1.2 Linear Regression

Regression analysis is a statistical analysis method that studies the quantitative relationship between independent and dependent variables and is often used in predictive analysis, time series models, and the discovery of causal relationships among variables. Regression analysis can be divided into linear regression and non-linear regression according to the type of relationship between variables.

Linear regression assumes that there is a linear relationship between the target (y) and the feature (X) in a given dataset, that is, it satisfies a multivariate linear equation. In regression analysis, only one independent variable and one dependent variable are included, and the relationship between them can be approximated by a straight line, which is called univariate linear regression. If two or more independent variables are included and there is a linear relationship between them, it is called multivariate linear regression.
  
Based on the sample data, the least squares method can be used to obtain estimates of the parameters of the linear regression model and minimize the sum of squares of errors between the model data calculated from the estimated parameters and the given sample data.

Furthermore, it is necessary to analyze whether the linear regression method can be used on the sample data or whether the assumption of linear correlation is reasonable and whether the linear model is stable. This requires statistical analysis for significance testing, testing whether the linear relationship between dependent variables and independent variables is significant, and describing the relationship properly with a linear model.

2. Linear Regression by Statsmodels

This section introduces linear fitting and regression analysis in conjunction with the use of the Statsmodels statistical analysis package. Linear models can be expressed as follows:

2.1 Import Toolkit

import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

2.2 Importing Sample Data

Sample data is usually stored in a data file, so read the data file to get the sample data. To facilitate reading and testing the program, this paper uses random numbers to generate sample data. The method of reading data files to import data is described later.

#Generate sample data:
nSample = 100
x1 = np.linspace(0, 10, nSample) #Start at 0, end at 10, divided into nSample points
E = np. Random. Normal(size=len(x1)#random numbers with normal distribution
yTrue = 2.36 + 1.58 * x1 # y = b0 + b1*x1
yTest = yTrue + e #Generate model data

This case is a univariate linear regression problem, (yTest, x) is imported sample data, and we need to use linear regression to obtain the quantitative relationship between dependent variable y and independent variable x. yTrue is the numeric value of the ideal model, yTest simulates the data detected by the experiment, and adds random errors of normal distribution to the ideal model.

2.3 Modeling and Fitting

The equation for the univariate linear regression model is:
  y = β0 + β1 * x + e
Pass SM first. Add_ Constant() adds intercept columns to matrix X, then sm.OLS() establishes the ordinary least squares model, and finally uses the model.fit() fits the linear regression model and returns a summary of the results of the fitting and statistical analysis.

X = sm.add_constant(x1) #Add intercept column x0=[1,...1] to the left of x1
model = sm.OLS(yTest, X) #Build least squares model (OLS)
results = model.fit()#Returns model fit results

statsmodels.OLS is statsmodels.regression.linear_model has four parameters (endog, exog, missing, hasconst).

The first parameter, endog, is the dependent variable y(t) in the regression model and is a 1-d array data type.

The second input exog is the independent variable x0(t),x1(t),..., xm(t), which is the (m+1)-d array data type.
* Note that statsmodels. The OLS regression model has no constants in the form of:
  y = B*X + e = β0*x0 + β1*x1 + e, x0 = [1,...1]
The previously imported data (yTest, x1) does not contain x0, so you need to add a column to the left of X1 with intercept column x0=[1,...1] to convert the argument matrix to X = (x0, x1). Function sm.add_constant() implements this function.
The missing parameter is used for data checking, and hasconst is used for checking constants, which is not normally required. *

2.4 Fitting and output of statistical results

The output of Statsmodels linear regression analysis is very rich, results. Summy() returns a summary of the regression analysis.

print(results.summary())#Summary of output regression analysis

The summary returns a wealth of content, starting with the most important results, which are discussed in the middle paragraph of the summary.

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4669      0.186     13.230      0.000       2.097       2.837
x1             1.5883      0.032     49.304      0.000       1.524       1.652
==============================================================================
  • coef: Regression coefficient, the model parameter β 0, β 1... Estimate of.

  • std err: Standard deviation, also known as standard deviation, is the arithmetic square root of the variance, reflecting the average difference between sample data values and regression model estimates. The larger the standard deviation, the more unreliable the regression coefficient is.

  • The t:t statistic, which equals the regression coefficient divided by the standard deviation, is used to test each regression coefficient individually to see if each independent variable has a significant effect on the dependent variable. If the influence of an independent variable xi is not significant, it means that the independent variable can be excluded from the model.

  • The P value (Prob(t-Statistic)) of the P>|t|:t test reflects the significance of the hypothesis that each independent variable xi is related to the dependent variable y. If P < 0.05, it can be understood that at the significance level of 0.05, the variable xi has a regression relationship with y, which is significant.

  • [0.025,0.975]: The lower and upper limits of the Confidence interval for a regression coefficient, which is included with a 95% confidence level. Note that the probability of sample data falling into this interval is not 95%.

    In addition, there are some important indicators to focus on:

  • R-squared:The Coefficient of determination is a measure of the combined influence of all independent variables on the dependent variable. It measures the goodness of fit of the regression equation, and the closer to 1, the better the fit.

  • F-statistic: The F-Statistic is used to test the significance of the overall regression equation to see if all independent variables have a significant effect on the dependent variable as a whole.

Statsmodels can also use attributes to obtain the data needed for regression analysis, for example:

print("OLS model: Y = b0 + b1 * x") # b0: intercept of regression line, slope of regression line
Print ('Parameters:', results.params) #Output: Coefficient for fitting the model
yFit = results.fittedvalues #fitted model calculated y value
Ax. Plot (x1, yTest,'o', label='data') #raw data
Ax. Plot (x1, yFit,'r-', label='OLS') #Fit the data

3. Univariate linear regression

3.1 Univariate linear regression Python program:

# LinearRegression_v1.py
# Linear Regression with statsmodels (OLS: Ordinary Least Squares)
# v1.0: Call statsmodels for unary linear regression
# Date: 2021-05-04

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# main program
def main():  # main program

    # Generate test data:
    nSample = 100
    x1 = np.linspace(0, 10, nSample)  # Start at 0 and end at 10, divided into nSample points
    e = np.random.normal(size=len(x1))  # Normal Distribution Random Numbers
    yTrue = 2.36 + 1.58 * x1  #  y = b0 + b1*x1
    yTest = yTrue + e  # Generate model data

    # Univariate linear regression: least squares (OLS)
    X = sm.add_constant(x1)  # Add intercept column to matrix X (x0=[1,...1])
    model = sm.OLS(yTest, X)  # Establish Least Squares Model (OLS)
    results = model.fit()  # Return model fitting results
    yFit = results.fittedvalues  # y-values of model fitting
    prstd, ivLow, ivUp = wls_prediction_std(results) # Returns the standard deviation and confidence interval

    # OLS model: Y = b0 + b1*X + e
    print(results.summary())  # Summary of Output Regression Analysis
    print("
OLS model: Y = b0 + b1 * x")  # b0: intercept of regression line, b1: slope of regression line
    print('Parameters: ', results.params)  # Output: Coefficient to fit the model

    # Drawing: original data points, fitting curves, confidence intervals
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(x1, yTest, 'o', label="data")  # Raw data
    ax.plot(x1, yFit, 'r-', label="OLS")  # Fit data
    ax.plot(x1, ivUp, '--',color='orange',label="upConf")  # Upper limit of 95% confidence interval
    ax.plot(x1, ivLow, '--',color='orange',label="lowConf")  # Lower limit of 95% confidence interval
    ax.legend(loc='best')  # Show Legend
    plt.title('OLS linear regression ')
    plt.show()
    return

if __name__ == '__main__': #YouCans, XUPT
    main()

3.2 Univariate linear regression program results:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.961
Model:                            OLS   Adj. R-squared:                  0.961
Method:                 Least Squares   F-statistic:                     2431.
Date:                Wed, 05 May 2021   Prob (F-statistic):           5.50e-71
Time:                        16:24:22   Log-Likelihood:                -134.62
No. Observations:                 100   AIC:                             273.2
Df Residuals:                      98   BIC:                             278.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4669      0.186     13.230      0.000       2.097       2.837
x1             1.5883      0.032     49.304      0.000       1.524       1.652
==============================================================================
Omnibus:                        0.070   Durbin-Watson:                   2.016
Prob(Omnibus):                  0.966   Jarque-Bera (JB):                0.187
Skew:                           0.056   Prob(JB):                        0.911
Kurtosis:                       2.820   Cond. No.                         11.7
==============================================================================

OLS model: Y = b0 + b1 * x
Parameters:  [2.46688389 1.58832741]

4. Multivariate Linear Regression

4.1 Multivariate Linear Regression Python program:

# LinearRegression_v2.py
# Linear Regression with statsmodels (OLS: Ordinary Least Squares)
# v2.0:Call statsmodels for multivariate linear regression
# Date: 2021-05-04

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# main program
def main():  # main program

    # Generate test data:
    nSample = 100
    x0 = np.ones(nSample)  # Intercept column x0=[1,...1]
    x1 = np.linspace(0, 20, nSample)  # Start at 0 and end at 10, divided into nSample points
    x2 = np.sin(x1)
    x3 = (x1-5)**2
    X = np.column_stack((x0, x1, x2, x3))  # (nSample,4): [x0,x1,x2,...,xm]
    beta = [5., 0.5, 0.5, -0.02] # beta = [b1,b2,...,bm]
    yTrue = np.dot(X, beta)  # Vector dot product y = b1*x1 +...+ Bm*xm
    yTest = yTrue + 0.5 * np.random.normal(size=nSample)  # Generate model data
    
    # Multivariate linear regression: least squares (OLS)
    model = sm.OLS(yTest, X)  # Establish OLS model: Y = B0 + b1*X +... + Bm*Xm + e
    results = model.fit()  # Return model fitting results
    yFit = results.fittedvalues  # y-values of model fitting
    print(results.summary())  # Summary of Output Regression Analysis
    print("
OLS model: Y = b0 + b1*X + ... + bm*Xm")
    print('Parameters: ', results.params)  # Output: Coefficient to fit the model    

    # Drawing: original data points, fitting curves, confidence intervals
    prstd, ivLow, ivUp = wls_prediction_std(results) # Returns the standard deviation and confidence interval
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(x1, yTest, 'o', label="data")  # Experimental data (raw data + error)
    ax.plot(x1, yTrue, 'b-', label="True")  # Raw data
    ax.plot(x1, yFit, 'r-', label="OLS")  # Fit data
    ax.plot(x1, ivUp, '--',color='orange', label="ConfInt")  # Confidence interval previous
    ax.plot(x1, ivLow, '--',color='orange')  # Next confidence interval
    ax.legend(loc='best')  # Show Legend
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()
    return

if __name__ == '__main__':
    main()
    

4.2 Multivariate linear regression program results:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.930
Method:                 Least Squares   F-statistic:                     440.0
Date:                Thu, 06 May 2021   Prob (F-statistic):           6.04e-56
Time:                        10:38:51   Log-Likelihood:                -68.709
No. Observations:                 100   AIC:                             145.4
Df Residuals:                      96   BIC:                             155.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0411      0.120     41.866      0.000       4.802       5.280
x1             0.4894      0.019     26.351      0.000       0.452       0.526
x2             0.5158      0.072      7.187      0.000       0.373       0.658
x3            -0.0195      0.002    -11.957      0.000      -0.023      -0.016
==============================================================================
Omnibus:                        1.472   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.479   Jarque-Bera (JB):                1.194
Skew:                           0.011   Prob(JB):                        0.551
Kurtosis:                       2.465   Cond. No.                         223.
==============================================================================

OLS model: Y = b0 + b1*X + ... + bm*Xm
Parameters:  [ 5.04111867  0.4893574   0.51579806 -0.01951219]

5. Appendix: Detailed explanation of regression results

    Dep.Variable: y dependent variable
    Model: OLS Least Squares Model
    Method: Least Squares least square
    No. Observations: Number of sample data
    Df Residuals: Residual Degree of Freedom(degree of freedom of residuals)
    Df Model: Model Degree of Freedom(degree of freedom of model)
    Covariance Type: nonrobust Robustness of Covariance Matrix
    R-squared: R Coefficient of determination
    Adj. R-squared: Corrected coefficient of determination
    F-statistic:  statistical test F Statistic
    Prob (F-statistic): F Inspected P value
    Log likelihood: Log likelihood

    coef: The coefficients of independent variables and constant terms, b1,b2,...bm,b0
    std err: Standard error of coefficient estimation
    t: statistical test t Statistic
    P>|t|: t Inspected P value
    [0.025, 0.975]: Estimated parameter 95%Lower and upper limits of confidence intervals
    Omnibus: Testing data normality based on kurtosis and skewness
    Prob(Omnibus): Test probability of data normality based on kurtosis and skewness
    Durbin-Watson: Testing for autocorrelation in residuals
    Skewness: Skewness, reflecting the degree of asymmetry in the distribution of data
    Kurtosis: Kurtosis, reflecting the steepness or smoothness of the data distribution
    Jarque-Bera(JB): Testing data normality based on kurtosis and skewness
    Prob(JB): Jarque-Bera(JB)Inspected P Value.
    Cond. No.: Tests for exact or high correlation between variables.

Copyright Notes:
Original YouCans
Copyright 2021 YouCans, XUPT
Crated: 2021-05-05

Added by Joeddox on Thu, 17 Feb 2022 21:57:31 +0200