Classification I-tree, delay and probability note

Prepare training and test data sets

As soon as I came up, I found that the dataset could not be found. After searching, I finally found the dataset in another package.

# install.packages("C50")
# library(C50)
# data('churn', package = 'C50')
# install.packages("modeldata")
# https://stackoverflow.com/questions/60506936/data-set-churn-not-found-in-package-c50
library(modeldata)
data(mlc_churn)
churn <- mlc_churn
# 7: 3-point training and test set
set.seed(2)
ind <- sample(2,nrow(churnTrain),replace = TRUE,
              prob = c(0.7,0.3))
trainset <- churnTrain[ind==1,]
testset <- churnTrain[ind==2,]

This data set is slightly different from that in the book, but it should contain more relationships. There are more samples of this data, which should not be affected. Extension: split function completes the division of training and testing

split.data <- function(data, p= 0.7, s= 666){
  set.seed(s)
  index <- sample(1:dim(data)[1])
  train <- data[index[1:floor(dim(data)[1]*p)],]
  test <-  data[index[((ceiling(dim(data)[1]*p))+1):dim(data)[1]],]
  return(list(train=train,test=test))
}

li <- split.data(churnTrain)

Using recursive segmentation tree to establish classification model

Recursion and segmentation are two steps of this algorithm. CP is a cost complexity parameter. The disadvantage of decision tree algorithm is that it is easy to produce deviation and over adaptation. Conditional reasoning tree can overcome deviation, and over adaptation can be solved by random forest method or tree pruning.

library(rpart)
churn.rp <- rpart(churn~., data=trainset)
plotcp(churn.rp)
summary(churn.rp)

5.4 recursive split tree visualization

plot and text functions draw a classification tree.

plot(churn.rp, margin = 0.1) # frame
text(churn.rp, all = TRUE, use.n = TRUE) # use.n number of actual observations per category
# Change the parameters to adjust the display results
plot(churn.rp, uniform = TRUE, branch = 0.6, margin = 0.1) # brach setting shoudler
text(churn.rp,all = TRUE, use.n = TRUE)

5.5 evaluate the classification ability of recursive split tree

# forecast
predictions <- predict(churn.rp, testset, type = "class")
table(testset$churn, predictions)
# #############
     predictions
       yes   no
  yes  133   81
  no    29 1278
# Generating confusion matrix
library(caret)
confusionMatrix(table(predictions, testset$churn))
Confusion Matrix and Statistics
# ##############
           
predictions  yes   no
        yes  133   29
        no    81 1278
                                          
               Accuracy : 0.9277          
                 95% CI : (0.9135, 0.9402)
    No Information Rate : 0.8593          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6671   

5.6 recursive split tree pruning

Sometimes, it is necessary to prune the rules with weak classification description ability to avoid over adaptation and improve the prediction accuracy. The cost complexity method is used here.

min(churn.rp$cptable[,"xerror"])
[1] 0.4523327
which.min(churn.rp$cptable[,"xerror"])
5 
5 
# Minimum cost complexity parameter
churn.cp <- churn.rp$cptable[5,"CP"]
churn.cp 
[1] 0.01014199
# trim
prune.tree <- prune(churn.rp, cp=churn.cp)
plot(churn.rp, margin = 0.1)
text(churn.rp,all = TRUE, use.n = TRUE)
# The confusion matrix is slightly lower than that before pruning to avoid over fitting
confusionMatrix(table(predictions, testset$churn))
# ################

I don't seem to find much

5.7 establish classification model using conditional reasoning tree

In addition to the traditional rpart decision tree algorithm, conditional inference tree ctree is another commonly used tree based classification algorithm. The recursive partition of data is also realized for non independent variables. The difference is that the conditional reasoning tree selects split variables based on the results of significance measurement rather than the information maximization method. The Gini coefficient is used in rpart, which does not represent the gap between the rich and the poor.

# Conditional reasoning tree
library(party)
ctree.moddel <- ctree(churn~., data = trainset)
ctree.moddel

5.8 conditional reasoning tree visualization

# visualization
plot(ctree.moddel)
daycharhe.model <- ctree(churn~total_day_charge, data = trainset)
plot(daycharhe.model)

Simplify it

5.9 evaluation and prediction ability

 # forecast
> ctree.predict<- predict(ctree.moddel, testset)
> table(ctree.predict, testset$churn)
             
ctree.predict  yes   no
          yes  139    9
          no    75 1298
confusionMatrix(table(ctree.predict, testset$churn))
# #######################
Confusion Matrix and Statistics

             
ctree.predict  yes   no
          yes  139    9
          no    75 1298
                                          
               Accuracy : 0.9448          
                 95% CI : (0.9321, 0.9557)
    No Information Rate : 0.8593          
    P-Value [Acc > NIR] : < 2.2e-16  
# probability
> treeresponse(ctree.moddel, newdata = testset[1:5,])
[[1]]
[1] 0.02715356 0.97284644

[[2]]
[1] 0.06842105 0.93157895

[[3]]
[1] 0.06842105 0.93157895

[[4]]
[1] 0.06842105 0.93157895

[[5]]
[1] 0.02715356 0.97284644

5.10 using k-adjacency classification algorithm

It is a nonparametric inert learning method, does not make any assumptions about the data distribution, and does not require the algorithm to have an explicit learning process.

install.packages("class")
library(class)
levels(trainset$international_plan) = list("0" = "no", "1" = "yes")
levels(trainset$voice_mail_plan) = list("0" = "no", "1" = "yes")
levels(testset$international_plan) = list("0" = "no", "1" = "yes")
levels(testset$voice_mail_plan) = list("0" = "no", "1" = "yes")

churn.knn <- knn(trainset[,!names(trainset) %in% c("churn", "area_code", "state" )],
                 testset[,!names(testset)  %in% c("churn", "area_code", "state" )], trainset$churn, k=3)
summary(churn.knn)
plot(churn.knn)
library(caret)
confusionMatrix(table(testset$churn,churn.knn))
# ########################
Confusion Matrix and Statistics

     churn.knn
       yes   no
  yes   76  138
  no    46 1261
                                         
               Accuracy : 0.879          
                 95% CI : (0.8616, 0.895)
    No Information Rate : 0.9198         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.3901  

knn algorithm uses similarity distance to train and classify. For example, Euclidean distance or Manhattan distance, k=1, will allocate samples to the nearest category. If K is small, it may be over fitting, too large or low fitting, and a suitable value can be obtained by cross test. The advantage is that the learning cost is 0, there is no need to assume distribution, and any type of data can be processed; The disadvantage is that it is difficult to understand, the data set is large, the calculation cost is very high, and the dimension of high-dimensional data must be reduced first. Character type data shall be processed into integer first, and k=3 shall be allocated to the last three clusters. kknn package can provide weighted k-nearest neighbor algorithm, regression and clustering.

5.11 using logistic regression

It is an algorithm based on probability and statistics. The logit function can be executed. The glm family specified as binomial is also a logical regression algorithm.

# logistic
fit <- glm(churn~., data = trainset,family = binomial)
summary(fit)
# Remove non significant variables

fit <- glm(churn~international_plan + voice_mail_plan + total_intl_calls + number_customer_service_calls , data = trainset,family = binomial)
summary(fit)
pred <- predict(fit, testset, type = "response")
pred <- predict(fit, testset, type = "response")
Class <- pred > .5
summary(Class)
   Mode   FALSE    TRUE 
logical      44    1477 
pred_class <- churn.mod
pred_class[pred <=.5] = 1- pred_class[pred<=.5]
ctb <- table(churn.mod, pred_class)
ctb
# ###########
         pred_class
churn.mod    0    1
        0 1287   20
        1   24  190
 confusionMatrix(ctb)
Confusion Matrix and Statistics

         pred_class
churn.mod    0    1
        0 1287   20
        1   24  190
                                          
               Accuracy : 0.9711          
                 95% CI : (0.9614, 0.9789)
    No Information Rate : 0.8619          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8794 

Logistic regression is easy to understand, directly outputs probability and confidence interval, and can quickly merge new data sets and update classification models. The disadvantage is that it cannot deal with multicollinearity, and the explanatory variables must be linear independent.

5.12 using naive Bayesian classification algorithm

It is also a probability based classifier, which assumes that the sample attributes are independent of each other.

library(e1071)
classifer <- naiveBayes(trainset[,!names(trainset) %in% c("churn")], trainset$churn)
classifer
# ##############
Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = trainset[, !names(trainset) %in% c("churn")], 
    y = trainset$churn)
# ###############
A-priori probabilities:
trainset$churn
      yes        no 
0.1417074 0.8582926 
bayes.table <- table(predict(classifer, testset[,!names(trainset) %in% c("churn")]), testset$churn)
bayes.table
# #############     
       yes   no
  yes  104   52
  no   110 1255
library(caret)
confusionMatrix(bayes.table)
Confusion Matrix and Statistics
# ############
     
       yes   no
  yes  104   52
  no   110 1255
                                          
               Accuracy : 0.8935          
                 95% CI : (0.8769, 0.9086)
    No Information Rate : 0.8593          
    P-Value [Acc > NIR] : 4.220e-05       
                                          
                  Kappa : 0.5032  

Classification evaluation is based on the above routine. The naive Yess algorithm assumes that the characteristic variables are conditionally independent, with relatively simple advantages and direct application. It is suitable for training data sets with small-scale trees and possible missing or data noise. The disadvantage is that the above conditions are independent and equally important, which is difficult to achieve in the real world. To summarize this chapter, see the following pictures:

Added by adamddickinson on Sat, 18 Dec 2021 16:42:02 +0200