R language book learning 02 "R language data analysis, mining, modeling and visualization" - Chapter 7 prediction application of linear regression model

Draw a scatter chart and observe the correlation first
Calculate according to the correlation coefficient, such as pearson correlation coefficient
Absolute value of correlation0.8 is highly correlated, 0.5 to 0.8 is moderately correlated, 0.3-0.5 is weakly correlated, and less than 0.3 is almost irrelevant.

Regression analysis

When the combination contains only one dependent variable and one independent variable, it is called simple linear regression model, on the contrary, it is multiple linear regression model.

Introduction of linear regression model

Solution of regression coefficient

It is assumed that the mean value of the error term is 0 and the standard value is 0Normal distribution of
Construct likelihood function
Take logarithm and sort it out
Expand and derive
Calculate partial regression coefficient

Linear regression in R language

lm(formula,data,subset,wrights,na.action)
#formula: specify the function, such as y~x1+x2, if y ~ Then it is related to all variables
#Subset sample subset
# weights

Significance test

#F test, anova function was used for analysis of variance

The theoretical value of F distribution under this degree of freedom can be obtained by using qf(0.95,5,34).

# modeling
model <- lm(Profit ~ ., data = train)
model

# Calculation of F statistics
result <- anova(model)
result

# Calculation of F statistics
RSS <- sum(result$`Sum Sq`[1:4])                                                                                       
df_RSS <- sum(result$Df[1:4])
ESS <- result$`Sum Sq`[5]
df_ESS <- sum(result$Df[5])
F <- (RSS/df_RSS)/(ESS/df_ESS)
F

qf(0.975,5,34)

Where RSS is the sum of squares of regression deviations (predicted value minus average value) and ESS is the sum of squares of errors (actual value minus predicted value)

Significance test of parameters -- t test

F test is used to test the model and t test is used to test a single parameter.

The value of the t-test can be obtained directly using the summary function

# Overview of the model
summary(model)

# Value of theoretical t distribution
n <- nrow(train)
p <- ncol(train)
t <- qt(0.975,n-p-1)
t

stepwise regression

Forward regression: add variables step by step and adjust constantly
Backward regression: stepwise deletion of variables
Two way stepwise regression: deleted items can be added and added items can be deleted

Stepwise regression uses the step function

# Stepwise regression complete traversal selection
model2 <- step(model)
# Final model overview
summary(model2)

Verify various assumptions of the model

Multiple linearity test

Using the vif function of the car package

VIF is abbreviated as the expansion factor of variance. When 0 < VIF < 10, there is no multiple reproducibility. It exists from 10 to 100, and more than 100 is serious.

library(car)
vif(model2)

Normality test

#Draw histogram
hist(x = profit$Profit, freq = FALSE, main = 'Histogram of profit',
     ylab = 'Nuclear density value',xlab = NULL, col = 'steelblue')

#Add nuclear density map
lines(density(profit$Profit), col = 'red', lty = 1, lwd = 2)

#Add normal distribution map
x <- profit$Profit[order(profit$Profit)]
lines(x, dnorm(x, mean(x), sd(x)),
      col = 'black', lty = 2, lwd = 2.5)

Use PP chart or QQ chart

The abscissa of PP diagram is the cumulative probability, and the ordinate is the actual cumulative probability; The abscissa of QQ chart is the theoretical quantile and the ordinate is the actual quantile.

shapior test and k-s test

When the sample size is less than 5000, use shapiro test, otherwise use k-2 test.

If the P value is greater than the confidence level, the original hypothesis is accepted.

# Statistical method
shapiro.test(profit$Profit)

Mathematical transformation

When the variable does not obey the normal distribution, it carries out mathematical transformation, such as root opening, logarithm, and BOX-COX transformation.

# box-cox transformation 
powerTransform(model2)

Independence test

Check whether the variables are independent.

# Independence test
durbinWatsonTest(model2)

Variance homogeneity test

The homogeneity of variance requires that the variance of the residual term of the model does not show a certain trend with the change of independent variables.

Whether there is a linear relationship is tested by BP, and the non-linearity is tested by White. If there is, the condition of homogeneity is not satisfied.

# Variance homogeneity test
ncvTest(model2)

model prediction

Using the predict function

# model prediction 
pred <- predict(model2, newdata = test[,c('R.D.Spend','Marketing.Spend')])
ggplot(data = NULL, mapping = aes(pred, test$Profit)) + 
  geom_point(color = 'red', shape = 19) + 
  geom_abline(slope = 1, intercept = 0, size = 1) +
  labs(x = 'Estimate', y = 'actual value')

Keywords: R Language Data Analysis

Added by detalab on Tue, 25 Jan 2022 11:09:58 +0200

Programming VIP