# 1, Decision tree

## 1. Concept

Decision tree is a classification algorithm based on tree structure. We hope to learn a model (i.e. decision tree) from a given training data set and use the model to classify new samples. The decision tree can intuitively show the classification process and results. Once the model is built successfully, the classification efficiency of new samples is also quite high.

## 2. information entropy

```Assume current sample set D pass the civil examinations k The proportion of class samples is pk(k=1,,2,...,
|y|)，be D The information entropy of is defined as: Ent(D) = -∑k=1 pk·log2 pk appointment
if p=0，be log2 p=0)obviously, Ent(D)The smaller the value, D The higher the purity of. because
0<=pk<= 1,so log2 pk<=0，Ent(D)>=0. In extreme cases, consider D in
If the samples belong to the same category, the Ent(D)The value is 0 (the minimum value is taken). When D
When the samples belong to different categories, Ent(D)Get the maximum value log2 |y|.
```

## 3. information gain

```  Assumed discrete attribute a have V Possible values{a1,a2,...,aV}. If used a
For sample set D Classification will be generated V Branch nodes, note Dv For the first v Individual points
Included in the branch node D All properties in a The upper value is av A sample of. Different branching nodes
We give different weights to branch nodes when the number of point samples is different:|Dv|/|D|, Should
The weight gives greater influence to the branch nodes with more samples. Therefore, the attribute is used a yes
Sample set D The information gain obtained by dividing is defined as: Gain(D,a) = Ent(D)-∑v=1 |Dv|/|D|·Ent(Dv)
```

Where Ent(D) is the information entropy before data set D division, and Σ v=1 |Dv|/|D | · Ent(Dv) can be expressed as the information entropy after division. The "front back" results show the reduction of information entropy obtained by this division, that is, the improvement of purity. Obviously, the greater the Gain(D,a), the greater the purity improvement, and the better the effect of this division.

## 4. Gain ratio

```  The optimal attribute division principle based on information gain -- information gain criterion
Attributes with more data are preferred. C4.5 The algorithm uses gain rate to replace information gain
Select the optimal partition attribute, and the gain rate is defined as: Gain_ratio(D,a) = Gain(D,a)/IV(a
among IV(a) = -∑v=1 |Dv|/|D|·log2 |Dv|/|D|
```

It is called the intrinsic value of attribute a. The greater the number of possible values of attribute a (i.e. the greater V), the greater the value of IV(a). This eliminates the preference for attributes with more value data to a certain extent.
In fact, the gain rate criterion has a preference for attributes with a small number of values. C4.5 algorithm does not directly use the gain rate criterion, but first finds the attributes with higher information gain than the average level from the candidate partition attributes, and then selects the attribute with the highest gain rate.

Next, we will build a decision tree model for the watermelon samples in the following table. ## 5. Code implementation

1. Import module

#Import module

```import pandas as pd
import numpy as np
from collections import Counter
from math import log2
```

Read with pandas module_ Excel () function reads data text; Use numpy module to convert dataframe into list; Count with Counter; Use the log2 function of math module to calculate the logarithm. It will be reflected in the following code.
2. Data acquisition and processing function
#Data acquisition and processing

```def getData(filePath):
return data

dataList = np.array(data).tolist()
dataSet = [element[1:] for element in dataList]
return dataSet
```

getData() through read in the pandas module_ The EXCEL () function reads the sample data. I tried to save the data file in csv format, but it is not very good for Chinese processing, so I chose to use xls format file.

The dataDeal() function converts the dataframe to a list and removes the numbered column. The number column is not an attribute of watermelon. In fact, if it is regarded as an attribute, it will get the maximum information gain.

These two functions can be combined into the same function, but I want to use data (dataframe structure with attribute tag) and dataSet (list) data samples respectively, so I write two functions separately.

3. Get attribute name

#Get property name

```def getLabels(data):
labels = list(data.columns)[1:-1]
return labels
```

Very simple, get the attribute name: texture, color, root, knock, navel, touch.
4. Get category tag
#Get category tag

```def targetClass(dataSet):
classification = set([element[-1] for element in dataSet])
return classification
```

Get the mark of whether a sample is good (yes or no).

5. Leaf node marking

#Mark the branch node as the leaf node, and select the class with the largest number of samples as the class mark

```def majorityRule(dataSet):
mostKind = Counter([element[-1] for element in dataSet]).most_common(1)
majorityKind = mostKind
return majorityKind
```

6. Calculate information entropy

#Calculating information entropy

```def infoEntropy(dataSet):
classColumnCnt = Counter([element[-1] for element in dataSet])
Ent = 0
for symbol in classColumnCnt:
p_k = classColumnCnt[symbol]/len(dataSet)
Ent = Ent-p_k*log2(p_k)
return Ent
```

7. Sub dataset construction

#Sub dataset construction

```def makeAttributeData(dataSet,value,iColumn):
attributeData = []
for element in dataSet:
if element[iColumn]==value:
row = element[:iColumn]
row.extend(element[iColumn+1:])
attributeData.append(row)
return attributeData
```

Data under a certain attribute value, such as a dataset with clear texture.

8. Calculate information gain

#Calculate information gain

```def infoGain(dataSet,iColumn):
Ent = infoEntropy(dataSet)
tempGain = 0.0
attribute = set([element[iColumn] for element in dataSet])
for value in attribute:
attributeData = makeAttributeData(dataSet,value,iColumn)
tempGain = tempGain+len(attributeData)/len(dataSet)*infoEntropy(attributeData)
Gain = Ent-tempGain
return Gain
```

9. Select the optimal attribute

#Select optimal attribute

```def selectOptimalAttribute(dataSet,labels):
bestGain = 0
sequence = 0
for iColumn in range(0,len(labels)):#Ignore the last category column
Gain = infoGain(dataSet,iColumn)
if Gain>bestGain:
bestGain = Gain
sequence = iColumn
print(labels[iColumn],Gain)
return sequence
```

10. Establish decision tree

#Establish decision tree

```def createTree(dataSet,labels):
classification = targetClass(dataSet) #Get category type (collection de duplication)
if len(classification) == 1:
return list(classification)
if len(labels) == 1:
return majorityRule(dataSet)#Return categories with more sample types
sequence = selectOptimalAttribute(dataSet,labels)
print(labels)
optimalAttribute = labels[sequence]
del(labels[sequence])
myTree = {optimalAttribute:{}}
attribute = set([element[sequence] for element in dataSet])
for value in attribute:

print(myTree)
print(value)
subLabels = labels[:]
myTree[optimalAttribute][value] =  \
createTree(makeAttributeData(dataSet,value,sequence),subLabels)
return myTree
```

The tree itself is not complex and is implemented recursively.

11. Define main function

```def main():
filePath = 'watermelonData.xls'
data = getData(filePath)
labels = getLabels(data)
myTree = createTree(dataSet,labels)
return myTree
```

12. Spanning tree

```if __name__ == '__main__':
myTree = main()
```  ## 6. Draw decision tree

```#Drawing a decision tree using Matlotlib
import matplotlib.pyplot as plt

#Format text boxes and arrows
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['font.family'] = 'sans-serif'

#Draw node
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy = parentPt,\
xycoords = "axes fraction", xytext = centerPt, textcoords = 'axes fraction',\
va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)

#Gets the number of leaf nodes of the decision tree
def getNumLeafs(myTree):
leafNumber = 0
firstStr = list(myTree.keys())
secondDict = myTree[firstStr]
for key in secondDict.keys():
if(type(secondDict[key]).__name__ == 'dict'):
leafNumber = leafNumber + getNumLeafs(secondDict[key])
else:
leafNumber += 1
return leafNumber

#Get the height of the decision tree (recursive)
def getTreeDepth(myTree):
maxDepth = 0
firstStr = list(myTree.keys())
secondDict = myTree[firstStr]
for key in secondDict.keys():
#test to see if the nodes are dictonaires, if not they are leaf nodes
if type(secondDict[key]).__name__=='dict':
thisDepth = 1 + getTreeDepth(secondDict[key])
else:   thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth

def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt-cntrPt)/2.0 + cntrPt
yMid = (parentPt-cntrPt)/2.0 + cntrPt
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

#Painting tree
def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
depth = getTreeDepth(myTree)
firstStr = list(myTree.keys())     #the text label for this node should be this
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
plotTree(secondDict[key],cntrPt,str(key))        #recursion
else:   #it's a leaf node print the leaf node
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

#Canvas initialization
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
#createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()

```

Define main function

```def main():
#createPlot()
print(getTreeDepth(myTree)) #Depth of output number
print(getNumLeafs(myTree))  #Number of output leaves
createPlot(myTree)
```
```main()
``` # 2, The algorithm codes of ID3, C4.5 and CART are implemented for watermelon data set with SK learn library.

## 1. Establish decision tree based on information gain criterion (ID3 id3id3 or c4.5c4.5)

Import related libraries

```#Import related libraries
import pandas as pd
import graphviz
from sklearn.model_selection import train_test_split
from sklearn import tree
```

Import data

```f = open('watermalon.csv','r',encoding='utf-8')

x = data[["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]].copy()
y = data['Good melon'].copy()
print(data)

``` Data preprocessing
Numeric eigenvalues

```#Numeric eigenvalues
x = x.copy()
for i in ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]:
for j in range(len(x)):
if(x[i][j] == "dark green" or x[i][j] == "Curl up" or data[i][j] == "Turbid sound" \
or x[i][j] == "clear" or x[i][j] == "sunken" or x[i][j] == "Hard slip"):
x[i][j] = 1
elif(x[i][j] == "Black" or x[i][j] == "Slightly curled" or data[i][j] == "Dull" \
or x[i][j] == "Slightly paste" or x[i][j] == "Slightly concave" or x[i][j] == "Soft sticky"):
x[i][j] = 2
else:
x[i][j] = 3

y = y.copy()
for i in range(len(y)):
if(y[i] == "yes"):
y[i] = int(1)
else:
y[i] = int(-1)
```

Convert data to DataFrame data type

```#You need to convert the data x and y into a good format and the data frame dataframe, otherwise the format will report an error
x = pd.DataFrame(x).astype(int)
y = pd.DataFrame(y).astype(int)
print(x)
print(y)
``` Dividing training set and test set
80% of the data is used for training and 20% for testing

```x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print(x_train)

``` Modeling and training

```#Decision tree learning
clf = tree.DecisionTreeClassifier(criterion="entropy")                    #instantiation
clf = clf.fit(x_train, y_train)
score = clf.score(x_test, y_test)
print(score)
```

Visual decision tree

```feature_name = ["color and lustre","Root","stroke ","texture","Umbilicus","Tactile sensation"]
dot_data = tree.export_graphviz(clf
,feature_names= feature_name
,filled=True
,rounded=True
,out_file =None
)
graph = graphviz.Source(dot_data)
graph

```

## 2. Establish decision tree based on Gini index (cartcart)

According to the parameter interpretation of the DecisionTreeClassifier function, the decision tree can be established based on the gini index (C A R T CARTCART) by changing the criterion value to "gini".

```#Decision tree learning
clf = tree.DecisionTreeClassifier(criterion="gini")                    #instantiation
clf = clf.fit(x_train, y_train)
score = clf.score(x_test, y_test)
print(score)

```

Visual (C A R T CARTCART) decision tree # 3, Summary

The experiment is to realize the ID3 algorithm code for watermelon dataset under jupyter, and output the visual results.
The algorithm codes of ID3, C4.5 and CART are implemented for watermelon data set with SK learn library.