# Note | statistical learning method: Decision Tree

### Basic concepts of decision tree

Decision tree is a tree.

• The leaf node corresponds to the decision result, and each other node corresponds to an attribute test;
• The sample set contained in each node is divided into sub nodes according to the results of attribute test;
• The root node contains the complete set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence.
Example:

The key of decision tree learning is how to select the optimal partition attribute. For binary classification, the so-called optimal partition attribute is to make the divided samples belong to the same category as far as possible, that is, the attribute with the highest "purity". So how to measure the purity of features, we need to use "information entropy".

### Empirical entropy

Let's first look at the definition of information entropy: if the proportion of class k samples in the current sample set D is p i ( k = 1 , 2 , 3 , ⋅ ⋅ ⋅ , ∣ I ∣ ) p_i(k=1,2,3,···,|I|) pi (k=1,2,3, ⋅⋅⋅, ∣ I ∣) is the total number of categories (for binary classification, i=2). Then the information entropy of the sample set is:
p k = ∣ C k ∣ D p_k=\frac{|C_k|}{D} pk​=D∣Ck​∣​

H ( D ) = − ∑ k = 1 K p k l o g 2 p k H(D)=-\sum_{k=1}^Kp_klog_2p_k H(D)=−k=1∑K​pk​log2​pk​

H ( D ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ l o g 2 ∣ C k ∣ ∣ D ∣ H(D)=-\sum_{k=1}^K\frac{|C_k|}{|D|}log_2\frac{|C_k|}{|D|} H(D)=−k=1∑K​∣D∣∣Ck​∣​log2​∣D∣∣Ck​∣​
H ( D ) H(D) The smaller the value of H(D), the higher the purity of D.
It reflects a disadvantage of information gain: the more values the attribute can obtain, the more information gain is biased towards this

### Information gain algorithm

• Input: training dataset D D D and characteristics A A A；
• Exporting: Features A A A. training data D D Information gain of D g ( D , A ) g(D,A) g(D,A)

The information gain represents the degree to which the uncertainty of the information of Class Y is reduced by knowing the information of feature X.

#### Empirical conditional entropy

Feature calculation A A A pair of data sets D D Empirical conditional entropy of D H ( D ∣ A ) H(D|A) H(D∣A)
H ( D ∣ A ) = ∑ i = 1 n ∣ D i ∣ ∣ D ∣ H ( D i ) H(D|A)=\sum_{i=1}^n\frac{|D_i|}{|D|}H(D_i) H(D∣A)=i=1∑n​∣D∣∣Di​∣​H(Di​)

= − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ ∑ k = 1 K ∣ D i k ∣ ∣ D i ∣ l o g 2 ∣ D i k ∣ ∣ D i ∣ =-\sum_{i=1}^n\frac{|D_i|}{|D|}\sum_{k=1}^K\frac{|D_{ik}|}{|D_i|}log_2\frac{|D_{ik}|}{|D_i|} =−i=1∑n​∣D∣∣Di​∣​k=1∑K​∣Di​∣∣Dik​∣​log2​∣Di​∣∣Dik​∣​

• D i k D_{ik} Dik ＾ means the number of feature samples under a certain feature that meet the training goal
• D i D_i Di refers to the sample size of the feature

#### The information gain is:

Information gain = empirical entropy - Empirical conditional entropy
g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A)=H(D)-H(D|A) g(D,A)=H(D)−H(D∣A)

Generally speaking, the greater the information gain, the greater the "purity improvement" obtained by using features to divide the data set. Therefore, information gain can be used to select the attribute of decision tree division. In fact, it is to select the attribute with the largest information gain. ID3 algorithm uses information gain to divide attributes.

### Information gain ratio:

Information gain ratio g R ( D , A ) g_R(D,A) gR (D,A) is the information gain g ( D , A ) g(D,A) g(D,A) and training data set D D D about features A A Value entropy of A H A ( D ) H_A(D) Ratio of HA (D) value:
g R ( D , A ) = g ( D , A ) H A ( D ) g_R(D,A)=\frac{g(D,A)}{H_A(D)} gR​(D,A)=HA​(D)g(D,A)​
Of which:
H A ( D ) = − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ l o g 2 ∣ D i ∣ ∣ D ∣ H_A(D)=-\sum_{i=1}^n\frac{|D_i|}{|D|}log_2\frac{|D_i|}{|D|} HA​(D)=−i=1∑n​∣D∣∣Di​∣​log2​∣D∣∣Di​∣​
n n n is the characteristic A A Number of A values.

### ID3 algorithm decision tree generation

• Importing: datasets D D D. Feature set A i A_i Ai, threshold ε \varepsilon ε
• Output: Decision Tree T T T
1. Firstly, the information gain of the feature set is calculated, and the largest feature is selected as the root node.
2. Look at the division of the training data set D into several subsets D i D_i Di (for example, whether the feature set division has the feature)
3. If a subset has only sample points of the same class, it becomes a leaf node
4. Otherwise, select new features from the remaining features for the subset to obtain the information gain g ( D k , A r e m a i n ) g(D_{k},A_{remain}) g(Dk​,Aremain​)
5. This recursive operation until all features are classified to generate a decision tree T T T.

### Python code implementation

from math import log
import operator

def calcShannonent(dataSet):                        #Calculate the entropy of data
numEntries = len(dataSet)                       #Number of data pieces
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]                  #The last word (category) of each row of data
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1              #Count the number of classes and the number of each class
shannonEnt = 0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries  #Calculate the entropy of a single class
shannonEnt -= prob*log(prob, 2)             #Accumulate the entropy of each class
return shannonEnt

def createDataSet_temp():
#Create instance data
labels = ['Hair', 'voice']

with open('TreeGrowth_ID3.txt', 'r', encoding='UTF-8') as f:		#Changed a file read-write
dataSet = [[] for i in range(9)]
value = ['long', 'short', 'crude', 'fine', 'male', 'female']
num_line = 0
for line in f:
# re.split(r'(\s{8}$\')|(\', \')|(\'$,)|(\'\])', line)			#An attempt to use regular partitioning failed
for i in line:
if (i in value):
dataSet[num_line].append(i)
num_line += 1
del(dataSet[0])
'''
dataSet = [
['long', 'crude', 'male'],
['short', 'crude', 'male'],
['short', 'crude', 'male'],
['long', 'fine', 'female'],
['short', 'fine', 'female'],
['short', 'crude', 'female'],
['long', 'crude', 'female'],
['long', 'crude', 'female']
]
'''
return dataSet, labels

def splitDataSet(dataSet, axis, value):             #Data classified by a feature
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:                  #axis represents the label of the specified attribute in the label, and value is the target value of the attribute in the classification
reducedFeatVec = featVec[:axis]         #For each record, the previously used attributes should be deleted after classification
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet

def chooseBestFeatureToSplit(dataSet):              #Select the best classification feature
numFeatures = len(dataSet[0])-1                 #Number of features
baseEntropy = calcShannonent(dataSet)           #Original entropy
bestInfoGain = 0
bestFeature = -1
for i in range(numFeatures):                    #Cycle each feature
featList = [example[i] for example in dataSet]#Read each record, take out the value of the ith attribute, and create a new list
uniqueVals = set(featList)                  #set() creates an unordered set of non repeating elements, which can be used for relationship testing, deleting duplicate elements, and performing intersection and union operations
newEntropy = 0
for value in uniqueVals:                    #For each value in the ith attribute value list
subDataSet = splitDataSet(dataSet, i, value)    #The subdiataset is a list in which the value of the ith attribute is removed
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob*calcShannonent(subDataSet)   #Entropy after feature classification
infoGain = baseEntropy - newEntropy         #Calculate information gain
if(infoGain > bestInfoGain):                #If the entropy decreases the most after a feature is divided, the secondary feature is the optimal classification feature
bestInfoGain = infoGain
bestFeature = i
return bestFeature

def majorityCnt(classList):                         #Majority voting order, for example: if the final classification is 2 men and 1 woman, it is judged to be male
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
#The traversable key value pair array returned by items is arranged in descending order according to the second field (value) of the key value pair
return sortedClassCount[0][0]                   #Returns the largest number of class names

def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]#Category: male or female takes the last value of each item in the record set to form a list
if classList.count(classList[0]==len(classList)):#~~~~~~~~~~~~~~~~~~~When the recordset has only one feature
return classList[0]
if len(dataSet[0])==1:                          #When the last attribute classification is completed, the length of the record is one
return majorityCnt((classList))             #Directly return the largest number of values
bestFeat = chooseBestFeatureToSplit(dataSet)    #Find the label of the optimal feature
if(bestFeat == -1):                             #Added by yourself, otherwise an error will be reported
return classList[0]
bestFeatLabel = labels[bestFeat]                #Find the optimal feature
myTree = {bestFeatLabel:{}}                     #The classification results are saved in the form of a dictionary to get the current layer of the tree
del(labels[bestFeat])                           #Delete the current optimal feature in labels
featValues = [example[bestFeat] for example in dataSet]#Form a list of the values of the current optimal characteristics in the recordset
uniqueVals = set(featValues)                    #De duplication of value set
for value in uniqueVals:                        #For each value
subLabels = labels[:]                       #subLabels is the feature set that removes the current optimal feature
myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
#Use recursion to create the next layer of tree~~~~~~~~~~~~~~~~~~~~~~
return myTree

if __name__=='__main__':
dataSet, labels = createDataSet_temp()          #Create sample data
print(createTree(dataSet, labels))              #Output decision tree model results



Reference blog:

Added by catlover on Thu, 10 Feb 2022 12:46:58 +0200