Linear regression prediction model

Univariate linear regression model:

E: Model Error Term, Equilibrium Equal Side Value

import seaborn as sns
income = pd.read_csv(r'Salary_Date.csv')
sns.lmplot(x='YearExperience',y='Salary',
			data=income,ci=None)
plt.show()

Linear fitting solution:
Minimum Error Term and Minimum Conversion to Minimum Square Error Term

At the minimum, the partial derivative is 0.

Using Basic Grammar to Solve

n = income.shape[0]
sum_x = income.YearsExperience.sum()
sum_y = income.Salary.sum()
sum_x2 = income.YearsExperience.pow(2).sum()
xy = income.YearsExperience * income.Salary
sum_xy = xy.sum()
b = (sum_xy - sum_x * sum_y / n) / (sum_x2 - sum_x ** 2 / n)
a = sum_y.mean() - b * sum_x.mean()

(2) Using ols function in statsmodels
ols(formula,data,subset=None,drop_cols)
formula: 'y~x'
Subset: bool type, subset modeling

import statsmodels.api as sm
fit = sm.formula.ols('income.Salary ~ income.YearsExperience',data=income).fit()
fit.params

multiple linear regression

The data set of multivariate linear regression consists of n observations and p+1 variables (p independent variables and 1 dependent variable).

One-dimensional Vector of Beta:p1, Partial Regression Coefficient
One-Dimensional Vector of E:n1, Error Term

Constant, transposed to itself

Prodict_Profit.xlsx

State is a character discrete variable

from sklearn import model_selection
import statsmodels.api as sm
profit = pd.read_excel(r'Prodict_Profit.xlsx')
train,test = model_selection.train_test_split(profit,test_size=0.2,random_state=1234)
model = sm.formula.ols('Profit ~ RD_Spend + Administration + Marketing_Spend + C(State)',data=train).fit()
print(model.params)
test_X = test.drop(labels='Profit',axis=1)
pred = model.predict(exog=test.X)
print({'prediction':pred,'real':test.Porfit})

State is a discrete variable, dumb variable, nested in C(), expressed as category variable.
State: Cal, Flo, NY. Cal was used as the control group by default in the model.

NY as control group

dummies = pd.get_dummies(profit.State)
profit_new = pd.concat([profit,dummies],axis=1)
profit_new.drop(labels=['State','New York'],axis=1,inplace=True)
train,test = model_selection.train_test_split(profit_new,test_size=0.2,random_state=1234)
model2 = sm.formula.ols('Profit ~  RD_Spend + Administration + Marketing_Spend + Flo + Cal',data=train).fit()
print(model2.params)

Dummy variables evolved from state retain only Flo and Al in the regression coefficients, while NY serves as the reference group.
Take RD_Spend and Fla as examples: with other variables unchanged, R&D cost increases by 1 yuan and profit increases by 0.8 yuan; with NY as the benchmark, sales in Flo increases by 1440 yuan.

Hypothesis Test of Regression Model

The saliency test of the model is whether the linear combination of dependent variables is effective, that is, whether there is an independent variable in the model can really affect the fluctuation of dependent variables and the overall effect.
The significance test of regression coefficient is to show whether a single independent variable is valid in the model.

The significance of model and regression coefficient were tested by F test and t test respectively.

F test

(1) Putting forward the original hypothesis and alternative hypothesis of the question

Overthrow H0 by data and accept H1
(2) Constructive statistics

ESS: sum of squares of errors, varying with time
RSS: The sum of squares of regression deviations, varying with time
TSS: Total sum of squares of deviations, unchanged
TSS=ESS+RSS
The parameters of linear regression model are solved to minimize ESS and maximize RSS.

Constructing F-statistics

p: RSS degree of freedom
n-p-1:ESS degree of freedom

③

import numpy as np
ybar = train.Profit.mean()
p = model2.df_model
n = train.shape[0]
RSS = np.sum((model2.fittedvalues - ybar) **2)
ESS = np.sum(model2.resid **2)
F = (RSS / p) / (ESS / (n-p-1))
print(F)

#Verify that the manually calculated values are correct
model2.fvalue

(4) Comparing the value of F statistics with the value of theoretical F distribution

#Call to calculate the theoretical distribution value
from scipy.stats import f
F_throry = f.ppf(q=0.95,dfn=p,dfd=n-p-1)
print(F_theory)

The value of F statistic is much larger than the theoretical value of F distribution, so the original hypothesis is rejected and the multivariate linear regression model is considered to be significant, so the partial regression coefficient of the regression model is not all 0.

Significance test and t test of regression coefficient
(1) Presenting hypotheses

The original hypothesis is that the partial regression coefficient of the j variable is 0.

(2) Constructive statistics

(3) Computational statistics

model2.summary()

Part I:
The Decision Coefficient R-square of the Model: The Explanation Degree of Independent Variables to Dependent Variables
F-statistic of the model: testing the significance of the model
Model Information Criteria AIC, BIC: Comparing the Fitting Effect of the Model
Part II:
Estimates of regression coefficients Coef, t statistics, confidence intervals of regression coefficients
The third part:
Relevant information of model error term e, Durbin-Watson statistic to test the independence of error term, JB statistic to measure whether the error term obeys normal distribution, Skew of error term, Kurtosis of Kurtosis

(4) The second part contains t-statistics of each partial regression coefficient, which is derived from the quotient of estimated Coef and standard error Std err. Each t-statistics corresponds to the probability value P. It is a direct method to judge whether the statistics are significant or not. Generally, when the probability p is less than 0.05, it denies the original hypothesis and fails to indicate that the variables are not important factors.

After passing the test, the model was diagnosed.

Diagnosis of Regression Model

The hypothesis of the model requires normal distribution for residual items, and the essence requires that dependent variables obey normal distribution.
Normality test uses two methods:
Qualitative Graphic Method
Quantitative Nonparametric Method

(1) Histogram method

import scipy.stats as stats
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode-minus'] = False
plt.style.use('ggplot')
sns.distplot(a=profit_new.Profit,bins=10,fit=stats.norm,norm_hist=True,
				hist_kws=['color':'steelblue','edgecolor':'blue'],
				kde_kws=['color':'orange','linestyle':'-','label':True],
				fit_kws=['color':red,'linestyle':'--'])
plt.legend()
plt.show()

Draw the histogram of the variable Profit, the nuclear density curve, and the theoretical normal distribution. If the two curves are approximate or coincident, the normal distribution is obeyed.

PP chart, QQ chart

#Normal Test of Residual
import ststamodels.api as sm
pp_qq_plot = sm.ProbPlot(profit_new.Profit)
#PP diagram
pp_qq_plot.ppplot(line=45)
plt.title('PP chart')
pp_qq_plot.qqplot(line='q')
plt.title('QQ chart')
plt.show()

PP chart: Comparing the cumulative probability of normal distribution with the cumulative probability of actual distribution
QQ Diagram: Quantiles Contrasting Normal Distribution and Century Distribution
Criteria for judging whether variables obey normal distribution: scatters are fairly uniform on a straight line.

③Shapiro K-S
Assuming that the variables obey normal distribution, if the data is less than 5000, shapiro test is used.

import scipy.stats as stats
stats.shapiro(profit_new.Profit)

Returning to tuple, the first element is the statistical value of shapiro test and the second element is the corresponding probability P. Since the p value is greater than the confidence level of 0.05, it accepts that the profit distribution obeys the normal distribution.

#Generating Normal Distribution and Uniform Distribution
rnorm = np.random.normal(loc=5,scale=2,size=1000)
runif = np.random.unifrom(low=1,high=100,size=1000)
#Normality test
KS_test1 = stats.kstest(rvs=rnorm,args=(rnorm.mean(),rnorm.std()),cdf='norm')
KS_test2 = stats.kstest(rvs=runif,args=(runif.mean(),runif.std()),cdf='norm')

The test p value of the random number of normal random distribution is greater than 0.05 confidence level. Accept the original hypothesis.
p value of uniform distribution is less than 0.05, rejecting the original hypothesis
When using kstest function to test the normality of variables, args parameters must be specified, and the mean and standard deviation of the tested variables must be transmitted.
The results of variable test do not satisfy the normal distribution and use mathematical transformation.

Keywords: less

Added by monezz on Fri, 09 Aug 2019 10:48:05 +0300

Programming VIP

Linear regression prediction model

Univariate linear regression model:

multiple linear regression

Hypothesis Test of Regression Model

Diagnosis of Regression Model

Popular Keywords