1. Background knowledge
Coupon Platform https://m.cqfenfa.com/1.1 Interpolation, Fitting, Regression and Prediction
Interpolation, fitting, regression and prediction are all concepts often mentioned in mathematical modeling and are often confused.
- Interpolation is the interpolation of continuous functions on the basis of discrete data so that the continuous curve passes through all given discrete data points. Interpolation is an important method for approximating discrete functions. It can be used to estimate the approximation of functions at other points by the value of functions at a limited number of points.
- Fitting is the approximation of a continuous function (curve) to a given discrete data so that it fits the given data.
Therefore, interpolation and fitting are the process of finding approximate curves with similar variation and characteristics based on known data points, but interpolation requires approximate curves to pass through the given data points completely, while fitting only requires approximate curves to be as close as possible to the data points as possible in the whole, and reflects the law of data change and development trend. Interpolation can be seen as a special kind of fitting, which requires an error function of 0. Since data points usually have errors, an error of 0 often means over-fitting, and the over-fitting model has a poor generalization ability for data outside the training set. Therefore, in practice, interpolation is mostly used for image processing, and fitting is mostly used for experimental data processing.
-
Regression is a statistical analysis method to study the relationship between one set of random variables and another. It includes establishing a mathematical model and estimating model parameters, and verifying the reliability of the mathematical model. It also includes predicting or controlling with the established model and estimated model parameters.
-
Forecasting is a very broad concept. In digital model, it refers to the quantitative study of the data and information obtained, from which a mathematical model suitable for the purpose of forecasting is established, and then the future development and changes are predicted quantitatively. Generally, interpolation and fitting are methods of prediction class.
Regression is a data analysis method, and fitting is a specific data processing method. Fitting focuses on the optimization of curve parameters to make the curve conform to the data. Regression focuses on the relationship between two or more variables.
1.2 Linear Regression
Regression analysis is a statistical analysis method that studies the quantitative relationship between independent and dependent variables and is often used in predictive analysis, time series models, and the discovery of causal relationships among variables. Regression analysis can be divided into linear regression and non-linear regression according to the type of relationship between variables.
Linear regression assumes that there is a linear relationship between the target (y) and the feature (X) in a given dataset, that is, it satisfies a multivariate linear equation. In regression analysis, only one independent variable and one dependent variable are included, and the relationship between them can be approximated by a straight line, which is called univariate linear regression. If two or more independent variables are included and there is a linear relationship between them, it is called multivariate linear regression.
Based on the sample data, the least squares method can be used to obtain estimates of the parameters of the linear regression model and minimize the sum of squares of errors between the model data calculated from the estimated parameters and the given sample data.
Furthermore, it is necessary to analyze whether the linear regression method can be used on the sample data or whether the assumption of linear correlation is reasonable and whether the linear model is stable. This requires statistical analysis for significance testing, testing whether the linear relationship between dependent variables and independent variables is significant, and describing the relationship properly with a linear model.
2. Linear Regression by Statsmodels
This section introduces linear fitting and regression analysis in conjunction with the use of the Statsmodels statistical analysis package. Linear models can be expressed as follows:
2.1 Import Toolkit
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
2.2 Importing Sample Data
Sample data is usually stored in a data file, so read the data file to get the sample data. To facilitate reading and testing the program, this paper uses random numbers to generate sample data. The method of reading data files to import data is described later.
#Generate sample data:
nSample = 100
x1 = np.linspace(0, 10, nSample) #Start at 0, end at 10, divided into nSample points
E = np. Random. Normal(size=len(x1)#random numbers with normal distribution
yTrue = 2.36 + 1.58 * x1 # y = b0 + b1*x1
yTest = yTrue + e #Generate model data
This case is a univariate linear regression problem, (yTest, x) is imported sample data, and we need to use linear regression to obtain the quantitative relationship between dependent variable y and independent variable x. yTrue is the numeric value of the ideal model, yTest simulates the data detected by the experiment, and adds random errors of normal distribution to the ideal model.
2.3 Modeling and Fitting
The equation for the univariate linear regression model is:
y = β0 + β1 * x + e
Pass SM first. Add_ Constant() adds intercept columns to matrix X, then sm.OLS() establishes the ordinary least squares model, and finally uses the model.fit() fits the linear regression model and returns a summary of the results of the fitting and statistical analysis.
X = sm.add_constant(x1) #Add intercept column x0=[1,...1] to the left of x1
model = sm.OLS(yTest, X) #Build least squares model (OLS)
results = model.fit()#Returns model fit results
statsmodels.OLS is statsmodels.regression.linear_model has four parameters (endog, exog, missing, hasconst).
The first parameter, endog, is the dependent variable y(t) in the regression model and is a 1-d array data type.
The second input exog is the independent variable x0(t),x1(t),..., xm(t), which is the (m+1)-d array data type.
* Note that statsmodels. The OLS regression model has no constants in the form of:
y = B*X + e = β0*x0 + β1*x1 + e, x0 = [1,...1]
The previously imported data (yTest, x1) does not contain x0, so you need to add a column to the left of X1 with intercept column x0=[1,...1] to convert the argument matrix to X = (x0, x1). Function sm.add_constant() implements this function.
The missing parameter is used for data checking, and hasconst is used for checking constants, which is not normally required. *
2.4 Fitting and output of statistical results
The output of Statsmodels linear regression analysis is very rich, results. Summy() returns a summary of the regression analysis.
print(results.summary())#Summary of output regression analysis
The summary returns a wealth of content, starting with the most important results, which are discussed in the middle paragraph of the summary.
============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 2.4669 0.186 13.230 0.000 2.097 2.837 x1 1.5883 0.032 49.304 0.000 1.524 1.652 ==============================================================================
-
coef: Regression coefficient, the model parameter β 0, β 1... Estimate of.
-
std err: Standard deviation, also known as standard deviation, is the arithmetic square root of the variance, reflecting the average difference between sample data values and regression model estimates. The larger the standard deviation, the more unreliable the regression coefficient is.
-
The t:t statistic, which equals the regression coefficient divided by the standard deviation, is used to test each regression coefficient individually to see if each independent variable has a significant effect on the dependent variable. If the influence of an independent variable xi is not significant, it means that the independent variable can be excluded from the model.
-
The P value (Prob(t-Statistic)) of the P>|t|:t test reflects the significance of the hypothesis that each independent variable xi is related to the dependent variable y. If P < 0.05, it can be understood that at the significance level of 0.05, the variable xi has a regression relationship with y, which is significant.
-
[0.025,0.975]: The lower and upper limits of the Confidence interval for a regression coefficient, which is included with a 95% confidence level. Note that the probability of sample data falling into this interval is not 95%.
In addition, there are some important indicators to focus on:
-
R-squared:The Coefficient of determination is a measure of the combined influence of all independent variables on the dependent variable. It measures the goodness of fit of the regression equation, and the closer to 1, the better the fit.
-
F-statistic: The F-Statistic is used to test the significance of the overall regression equation to see if all independent variables have a significant effect on the dependent variable as a whole.
Statsmodels can also use attributes to obtain the data needed for regression analysis, for example:
print("OLS model: Y = b0 + b1 * x") # b0: intercept of regression line, slope of regression line
Print ('Parameters:', results.params) #Output: Coefficient for fitting the model
yFit = results.fittedvalues #fitted model calculated y value
Ax. Plot (x1, yTest,'o', label='data') #raw data
Ax. Plot (x1, yFit,'r-', label='OLS') #Fit the data
3. Univariate linear regression
3.1 Univariate linear regression Python program:
# LinearRegression_v1.py # Linear Regression with statsmodels (OLS: Ordinary Least Squares) # v1.0: Call statsmodels for unary linear regression # Date: 2021-05-04 import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std # main program def main(): # main program # Generate test data: nSample = 100 x1 = np.linspace(0, 10, nSample) # Start at 0 and end at 10, divided into nSample points e = np.random.normal(size=len(x1)) # Normal Distribution Random Numbers yTrue = 2.36 + 1.58 * x1 # y = b0 + b1*x1 yTest = yTrue + e # Generate model data # Univariate linear regression: least squares (OLS) X = sm.add_constant(x1) # Add intercept column to matrix X (x0=[1,...1]) model = sm.OLS(yTest, X) # Establish Least Squares Model (OLS) results = model.fit() # Return model fitting results yFit = results.fittedvalues # y-values of model fitting prstd, ivLow, ivUp = wls_prediction_std(results) # Returns the standard deviation and confidence interval # OLS model: Y = b0 + b1*X + e print(results.summary()) # Summary of Output Regression Analysis print(" OLS model: Y = b0 + b1 * x") # b0: intercept of regression line, b1: slope of regression line print('Parameters: ', results.params) # Output: Coefficient to fit the model # Drawing: original data points, fitting curves, confidence intervals fig, ax = plt.subplots(figsize=(10, 8)) ax.plot(x1, yTest, 'o', label="data") # Raw data ax.plot(x1, yFit, 'r-', label="OLS") # Fit data ax.plot(x1, ivUp, '--',color='orange',label="upConf") # Upper limit of 95% confidence interval ax.plot(x1, ivLow, '--',color='orange',label="lowConf") # Lower limit of 95% confidence interval ax.legend(loc='best') # Show Legend plt.title('OLS linear regression ') plt.show() return if __name__ == '__main__': #YouCans, XUPT main()
3.2 Univariate linear regression program results:
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.961 Model: OLS Adj. R-squared: 0.961 Method: Least Squares F-statistic: 2431. Date: Wed, 05 May 2021 Prob (F-statistic): 5.50e-71 Time: 16:24:22 Log-Likelihood: -134.62 No. Observations: 100 AIC: 273.2 Df Residuals: 98 BIC: 278.5 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 2.4669 0.186 13.230 0.000 2.097 2.837 x1 1.5883 0.032 49.304 0.000 1.524 1.652 ============================================================================== Omnibus: 0.070 Durbin-Watson: 2.016 Prob(Omnibus): 0.966 Jarque-Bera (JB): 0.187 Skew: 0.056 Prob(JB): 0.911 Kurtosis: 2.820 Cond. No. 11.7 ============================================================================== OLS model: Y = b0 + b1 * x Parameters: [2.46688389 1.58832741]
4. Multivariate Linear Regression
4.1 Multivariate Linear Regression Python program:
# LinearRegression_v2.py # Linear Regression with statsmodels (OLS: Ordinary Least Squares) # v2.0:Call statsmodels for multivariate linear regression # Date: 2021-05-04 import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std # main program def main(): # main program # Generate test data: nSample = 100 x0 = np.ones(nSample) # Intercept column x0=[1,...1] x1 = np.linspace(0, 20, nSample) # Start at 0 and end at 10, divided into nSample points x2 = np.sin(x1) x3 = (x1-5)**2 X = np.column_stack((x0, x1, x2, x3)) # (nSample,4): [x0,x1,x2,...,xm] beta = [5., 0.5, 0.5, -0.02] # beta = [b1,b2,...,bm] yTrue = np.dot(X, beta) # Vector dot product y = b1*x1 +...+ Bm*xm yTest = yTrue + 0.5 * np.random.normal(size=nSample) # Generate model data # Multivariate linear regression: least squares (OLS) model = sm.OLS(yTest, X) # Establish OLS model: Y = B0 + b1*X +... + Bm*Xm + e results = model.fit() # Return model fitting results yFit = results.fittedvalues # y-values of model fitting print(results.summary()) # Summary of Output Regression Analysis print(" OLS model: Y = b0 + b1*X + ... + bm*Xm") print('Parameters: ', results.params) # Output: Coefficient to fit the model # Drawing: original data points, fitting curves, confidence intervals prstd, ivLow, ivUp = wls_prediction_std(results) # Returns the standard deviation and confidence interval fig, ax = plt.subplots(figsize=(10, 8)) ax.plot(x1, yTest, 'o', label="data") # Experimental data (raw data + error) ax.plot(x1, yTrue, 'b-', label="True") # Raw data ax.plot(x1, yFit, 'r-', label="OLS") # Fit data ax.plot(x1, ivUp, '--',color='orange', label="ConfInt") # Confidence interval previous ax.plot(x1, ivLow, '--',color='orange') # Next confidence interval ax.legend(loc='best') # Show Legend plt.xlabel('x') plt.ylabel('y') plt.show() return if __name__ == '__main__': main()
4.2 Multivariate linear regression program results:
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.932 Model: OLS Adj. R-squared: 0.930 Method: Least Squares F-statistic: 440.0 Date: Thu, 06 May 2021 Prob (F-statistic): 6.04e-56 Time: 10:38:51 Log-Likelihood: -68.709 No. Observations: 100 AIC: 145.4 Df Residuals: 96 BIC: 155.8 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 5.0411 0.120 41.866 0.000 4.802 5.280 x1 0.4894 0.019 26.351 0.000 0.452 0.526 x2 0.5158 0.072 7.187 0.000 0.373 0.658 x3 -0.0195 0.002 -11.957 0.000 -0.023 -0.016 ============================================================================== Omnibus: 1.472 Durbin-Watson: 1.824 Prob(Omnibus): 0.479 Jarque-Bera (JB): 1.194 Skew: 0.011 Prob(JB): 0.551 Kurtosis: 2.465 Cond. No. 223. ============================================================================== OLS model: Y = b0 + b1*X + ... + bm*Xm Parameters: [ 5.04111867 0.4893574 0.51579806 -0.01951219]
5. Appendix: Detailed explanation of regression results
Dep.Variable: y dependent variable Model: OLS Least Squares Model Method: Least Squares least square No. Observations: Number of sample data Df Residuals: Residual Degree of Freedom(degree of freedom of residuals) Df Model: Model Degree of Freedom(degree of freedom of model) Covariance Type: nonrobust Robustness of Covariance Matrix R-squared: R Coefficient of determination Adj. R-squared: Corrected coefficient of determination F-statistic: statistical test F Statistic Prob (F-statistic): F Inspected P value Log likelihood: Log likelihood coef: The coefficients of independent variables and constant terms, b1,b2,...bm,b0 std err: Standard error of coefficient estimation t: statistical test t Statistic P>|t|: t Inspected P value [0.025, 0.975]: Estimated parameter 95%Lower and upper limits of confidence intervals Omnibus: Testing data normality based on kurtosis and skewness Prob(Omnibus): Test probability of data normality based on kurtosis and skewness Durbin-Watson: Testing for autocorrelation in residuals Skewness: Skewness, reflecting the degree of asymmetry in the distribution of data Kurtosis: Kurtosis, reflecting the steepness or smoothness of the data distribution Jarque-Bera(JB): Testing data normality based on kurtosis and skewness Prob(JB): Jarque-Bera(JB)Inspected P Value. Cond. No.: Tests for exact or high correlation between variables.
Copyright Notes:
Original YouCans
Copyright 2021 YouCans, XUPT
Crated: 2021-05-05