Note | statistical learning method: Decision Tree

Basic concepts of decision tree

Decision tree is a tree.

  • The leaf node corresponds to the decision result, and each other node corresponds to an attribute test;
  • The sample set contained in each node is divided into sub nodes according to the results of attribute test;
  • The root node contains the complete set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence.
    Example:

The key of decision tree learning is how to select the optimal partition attribute. For binary classification, the so-called optimal partition attribute is to make the divided samples belong to the same category as far as possible, that is, the attribute with the highest "purity". So how to measure the purity of features, we need to use "information entropy".

Empirical entropy

Let's first look at the definition of information entropy: if the proportion of class k samples in the current sample set D is p i ( k = 1 , 2 , 3 , ⋅ ⋅ ⋅ , ∣ I ∣ ) p_i(k=1,2,3,···,|I|) pi (k=1,2,3, ⋅⋅⋅, ∣ I ∣) is the total number of categories (for binary classification, i=2). Then the information entropy of the sample set is:
p k = ∣ C k ∣ D p_k=\frac{|C_k|}{D} pk​=D∣Ck​∣​

H ( D ) = − ∑ k = 1 K p k l o g 2 p k H(D)=-\sum_{k=1}^Kp_klog_2p_k H(D)=−k=1∑K​pk​log2​pk​

H ( D ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ l o g 2 ∣ C k ∣ ∣ D ∣ H(D)=-\sum_{k=1}^K\frac{|C_k|}{|D|}log_2\frac{|C_k|}{|D|} H(D)=−k=1∑K​∣D∣∣Ck​∣​log2​∣D∣∣Ck​∣​
H ( D ) H(D) The smaller the value of H(D), the higher the purity of D.
It reflects a disadvantage of information gain: the more values the attribute can obtain, the more information gain is biased towards this

Information gain algorithm

  • Input: training dataset D D D and characteristics A A A;
  • Exporting: Features A A A. training data D D Information gain of D g ( D , A ) g(D,A) g(D,A)

The information gain represents the degree to which the uncertainty of the information of Class Y is reduced by knowing the information of feature X.

Empirical conditional entropy

Feature calculation A A A pair of data sets D D Empirical conditional entropy of D H ( D ∣ A ) H(D|A) H(D∣A)
H ( D ∣ A ) = ∑ i = 1 n ∣ D i ∣ ∣ D ∣ H ( D i ) H(D|A)=\sum_{i=1}^n\frac{|D_i|}{|D|}H(D_i) H(D∣A)=i=1∑n​∣D∣∣Di​∣​H(Di​)

= − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ ∑ k = 1 K ∣ D i k ∣ ∣ D i ∣ l o g 2 ∣ D i k ∣ ∣ D i ∣ =-\sum_{i=1}^n\frac{|D_i|}{|D|}\sum_{k=1}^K\frac{|D_{ik}|}{|D_i|}log_2\frac{|D_{ik}|}{|D_i|} =−i=1∑n​∣D∣∣Di​∣​k=1∑K​∣Di​∣∣Dik​∣​log2​∣Di​∣∣Dik​∣​

  • D i k D_{ik} Dik ^ means the number of feature samples under a certain feature that meet the training goal
  • D i D_i Di refers to the sample size of the feature

The information gain is:

Information gain = empirical entropy - Empirical conditional entropy
g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A)=H(D)-H(D|A) g(D,A)=H(D)−H(D∣A)

Generally speaking, the greater the information gain, the greater the "purity improvement" obtained by using features to divide the data set. Therefore, information gain can be used to select the attribute of decision tree division. In fact, it is to select the attribute with the largest information gain. ID3 algorithm uses information gain to divide attributes.

Information gain ratio:

Information gain ratio g R ( D , A ) g_R(D,A) gR (D,A) is the information gain g ( D , A ) g(D,A) g(D,A) and training data set D D D about features A A Value entropy of A H A ( D ) H_A(D) Ratio of HA (D) value:
g R ( D , A ) = g ( D , A ) H A ( D ) g_R(D,A)=\frac{g(D,A)}{H_A(D)} gR​(D,A)=HA​(D)g(D,A)​
Of which:
H A ( D ) = − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ l o g 2 ∣ D i ∣ ∣ D ∣ H_A(D)=-\sum_{i=1}^n\frac{|D_i|}{|D|}log_2\frac{|D_i|}{|D|} HA​(D)=−i=1∑n​∣D∣∣Di​∣​log2​∣D∣∣Di​∣​
n n n is the characteristic A A Number of A values.

ID3 algorithm decision tree generation

  • Importing: datasets D D D. Feature set A i A_i Ai, threshold ε \varepsilon ε
  • Output: Decision Tree T T T
  1. Firstly, the information gain of the feature set is calculated, and the largest feature is selected as the root node.
  2. Look at the division of the training data set D into several subsets D i D_i Di (for example, whether the feature set division has the feature)
  3. If a subset has only sample points of the same class, it becomes a leaf node
  4. Otherwise, select new features from the remaining features for the subset to obtain the information gain g ( D k , A r e m a i n ) g(D_{k},A_{remain}) g(Dk​,Aremain​)
  5. This recursive operation until all features are classified to generate a decision tree T T T.

Python code implementation

from math import log
import operator

def calcShannonent(dataSet):                        #Calculate the entropy of data
    numEntries = len(dataSet)                       #Number of data pieces
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]                  #The last word (category) of each row of data
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1              #Count the number of classes and the number of each class
    shannonEnt = 0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries  #Calculate the entropy of a single class
        shannonEnt -= prob*log(prob, 2)             #Accumulate the entropy of each class
    return shannonEnt

def createDataSet_temp():
    #Create instance data
    labels = ['Hair', 'voice']

    with open('TreeGrowth_ID3.txt', 'r', encoding='UTF-8') as f:		#Changed a file read-write
        dataSet = [[] for i in range(9)]
        value = ['long', 'short', 'crude', 'fine', 'male', 'female']
        num_line = 0
        for line in f:
            # re.split(r'(\s{8}\[\')|(\', \')|(\'\],)|(\'\])', line)			#An attempt to use regular partitioning failed
            for i in line:
                if (i in value):
                    dataSet[num_line].append(i)
            num_line += 1
        del(dataSet[0])
    '''
    dataSet = [
        ['long', 'crude', 'male'],
        ['short', 'crude', 'male'],
        ['short', 'crude', 'male'],
        ['long', 'fine', 'female'],
        ['short', 'fine', 'female'],
        ['short', 'crude', 'female'],
        ['long', 'crude', 'female'],
        ['long', 'crude', 'female']
    ] 
    '''
    return dataSet, labels

def splitDataSet(dataSet, axis, value):             #Data classified by a feature
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:                  #axis represents the label of the specified attribute in the label, and value is the target value of the attribute in the classification
            reducedFeatVec = featVec[:axis]         #For each record, the previously used attributes should be deleted after classification
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSplit(dataSet):              #Select the best classification feature
    numFeatures = len(dataSet[0])-1                 #Number of features
    baseEntropy = calcShannonent(dataSet)           #Original entropy
    bestInfoGain = 0
    bestFeature = -1
    for i in range(numFeatures):                    #Cycle each feature
        featList = [example[i] for example in dataSet]#Read each record, take out the value of the ith attribute, and create a new list
        uniqueVals = set(featList)                  #set() creates an unordered set of non repeating elements, which can be used for relationship testing, deleting duplicate elements, and performing intersection and union operations
        newEntropy = 0
        for value in uniqueVals:                    #For each value in the ith attribute value list
            subDataSet = splitDataSet(dataSet, i, value)    #The subdiataset is a list in which the value of the ith attribute is removed
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob*calcShannonent(subDataSet)   #Entropy after feature classification
        infoGain = baseEntropy - newEntropy         #Calculate information gain
        if(infoGain > bestInfoGain):                #If the entropy decreases the most after a feature is divided, the secondary feature is the optimal classification feature
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):                         #Majority voting order, for example: if the final classification is 2 men and 1 woman, it is judged to be male
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    #The traversable key value pair array returned by items is arranged in descending order according to the second field (value) of the key value pair
    return sortedClassCount[0][0]                   #Returns the largest number of class names

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]#Category: male or female takes the last value of each item in the record set to form a list
    if classList.count(classList[0]==len(classList)):#~~~~~~~~~~~~~~~~~~~When the recordset has only one feature
        return classList[0]
    if len(dataSet[0])==1:                          #When the last attribute classification is completed, the length of the record is one
        return majorityCnt((classList))             #Directly return the largest number of values
    bestFeat = chooseBestFeatureToSplit(dataSet)    #Find the label of the optimal feature
    if(bestFeat == -1):                             #Added by yourself, otherwise an error will be reported
        return classList[0]
    bestFeatLabel = labels[bestFeat]                #Find the optimal feature
    myTree = {bestFeatLabel:{}}                     #The classification results are saved in the form of a dictionary to get the current layer of the tree
    del(labels[bestFeat])                           #Delete the current optimal feature in labels
    featValues = [example[bestFeat] for example in dataSet]#Form a list of the values of the current optimal characteristics in the recordset
    uniqueVals = set(featValues)                    #De duplication of value set
    for value in uniqueVals:                        #For each value
        subLabels = labels[:]                       #subLabels is the feature set that removes the current optimal feature
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
        #Use recursion to create the next layer of tree~~~~~~~~~~~~~~~~~~~~~~
    return myTree

if __name__=='__main__':
    dataSet, labels = createDataSet_temp()          #Create sample data
    print(createTree(dataSet, labels))              #Output decision tree model results

Reference blog:

Keywords: Algorithm Machine Learning Deep Learning Decision Tree Mathematical Modeling

Added by catlover on Thu, 10 Feb 2022 12:46:58 +0200