xgboost implementation of multi classification problem demo and its principle

This paper first writes the demo of the multi classification problem supported by xgboost, prints the tree structure, and then understands the principle of xgboost to realize the multi classification problem. This order is easier to understand.

xgboost multi classification problem demo

This demo can be seen from the source code of xgboost. In / demo/multiclass_classification/train.py. The data in the PY file (dermatology.data) can be found in https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data Download this website. The suffix of the downloaded file is Data, change to csv # or Txt can be used directly. I changed the data to 'data txt'  .

Now let's look at train The code in py ~

I write the code directly at the bottom: there are 6 types of labels for this data, and I set 2 rounds of iteration for the code below.

import numpy as np
import xgboost as xgb

# label need to be 0 to num_class -1
data = np.loadtxt('data.txt', delimiter='\t',
        converters={33: lambda x:int(x == '?'), 34: lambda x:int(x) - 1})
sz = data.shape

train = data[:int(sz[0] * 0.7), :]
test = data[int(sz[0] * 0.7):, :]

train_X = train[:, :33]
train_Y = train[:, 34]

test_X = test[:, :33]
test_Y = test[:, 34]

xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)
# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = 6

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 2   # The number of rounds is set to 2
bst = xgb.train(param, xg_train, num_round, watchlist)
# get prediction
pred = bst.predict(xg_test)
error_rate = np.sum(pred != test_Y) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))

Implementation principle of xgboost multi classification problem

After training, the key step is to print out the trained tree. The following code can save the tree structure as text. I think the text form is much better than the picture form.

bst.dump_model('multiclass_model')

Let's open this file. Each booster represents a tree. There are 12 trees in this model, and the boosters range from 0 to 11.

booster[0]:
0:[f19<0.5] yes=1,no=2,missing=1
	1:[f21<0.5] yes=3,no=4,missing=3
		3:leaf=-0.0587906
		4:leaf=0.0906977
	2:[f6<0.5] yes=5,no=6,missing=5
		5:leaf=0.285523
		6:leaf=0.0906977
booster[1]:
0:[f27<1.5] yes=1,no=2,missing=1
	1:[f12<0.5] yes=3,no=4,missing=3
		3:[f31<0.5] yes=7,no=8,missing=7
			7:leaf=-1.67638e-09
			8:leaf=-0.056044
		4:[f4<0.5] yes=9,no=10,missing=9
			9:leaf=0.132558
			10:leaf=-0.0315789
	2:[f4<0.5] yes=5,no=6,missing=5
		5:[f11<0.5] yes=11,no=12,missing=11
			11:[f10<0.5] yes=15,no=16,missing=15
				15:leaf=0.264427
				16:leaf=0.0631579
			12:leaf=-0.0428571
		6:[f15<1.5] yes=13,no=14,missing=13
			13:leaf=-0.00566038
			14:leaf=-0.0539326
booster[2]:
0:[f32<1.5] yes=1,no=2,missing=1
	1:leaf=-0.0589339
	2:[f9<0.5] yes=3,no=4,missing=3
		3:leaf=0.280919
		4:leaf=0.0631579
booster[3]:
0:[f4<0.5] yes=1,no=2,missing=1
	1:[f0<1.5] yes=3,no=4,missing=3
		3:[f3<0.5] yes=7,no=8,missing=7
			7:[f27<0.5] yes=13,no=14,missing=13
				13:leaf=-0.0375
				14:leaf=0.0631579
			8:leaf=-0.0515625
		4:leaf=-0.058371
	2:[f2<1.5] yes=5,no=6,missing=5
		5:[f32<0.5] yes=9,no=10,missing=9
			9:[f15<0.5] yes=15,no=16,missing=15
				15:leaf=-0.0348837
				16:leaf=0.230097
			10:leaf=-0.0428571
		6:[f3<0.5] yes=11,no=12,missing=11
			11:leaf=0.0622641
			12:[f16<1.5] yes=17,no=18,missing=17
				17:leaf=-1.67638e-09
				18:[f3<1.5] yes=19,no=20,missing=19
					19:leaf=-0.00566038
					20:leaf=-0.0554622
booster[4]:
0:[f14<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0590296
	2:leaf=0.255665
booster[5]:
0:[f30<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0591241
	2:leaf=0.213253
booster[6]:
0:[f19<0.5] yes=1,no=2,missing=1
	1:[f21<0.5] yes=3,no=4,missing=3
		3:leaf=-0.0580493
		4:leaf=0.0831786
	2:leaf=0.214441
booster[7]:
0:[f27<1.5] yes=1,no=2,missing=1
	1:[f12<0.5] yes=3,no=4,missing=3
		3:[f31<0.5] yes=7,no=8,missing=7
			7:leaf=0.000227226
			8:leaf=-0.0551713
		4:[f15<1.5] yes=9,no=10,missing=9
			9:leaf=-0.0314418
			10:leaf=0.121289
	2:[f4<0.5] yes=5,no=6,missing=5
		5:[f11<0.5] yes=11,no=12,missing=11
			11:[f10<0.5] yes=15,no=16,missing=15
				15:leaf=0.206326
				16:leaf=0.0587528
			12:leaf=-0.0420568
		6:[f15<1.5] yes=13,no=14,missing=13
			13:leaf=-0.00512865
			14:leaf=-0.0531389
booster[8]:
0:[f32<1.5] yes=1,no=2,missing=1
	1:leaf=-0.0581933
	2:[f11<0.5] yes=3,no=4,missing=3
		3:leaf=0.0549185
		4:leaf=0.218241
booster[9]:
0:[f4<0.5] yes=1,no=2,missing=1
	1:[f0<1.5] yes=3,no=4,missing=3
		3:[f3<0.5] yes=7,no=8,missing=7
			7:[f27<0.5] yes=13,no=14,missing=13
				13:leaf=-0.0367718
				14:leaf=0.0600201
			8:leaf=-0.0506891
		4:leaf=-0.0576147
	2:[f27<0.5] yes=5,no=6,missing=5
		5:[f3<0.5] yes=9,no=10,missing=9
			9:leaf=0.0238016
			10:leaf=-0.054874
		6:[f5<1] yes=11,no=12,missing=11
			11:leaf=0.200442
			12:leaf=-0.0508502
booster[10]:
0:[f14<0.5] yes=1,no=2,missing=1
	1:leaf=-0.058279
	2:leaf=0.201977
booster[11]:
0:[f30<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0583675
	2:leaf=0.178016

Don't forget that we are 6 classification problems, and the number of training rounds is num_round is 2. There are 6 trees in the first round, corresponding to , booster0-booster5, and 6 trees in the second round, corresponding to , booster6-booster11. So here you should know how xgboost trains multi classification problems. In fact, in each round, six trees are trained, and then the softmax function is used as the loss function. If you don't understand softmax, you can refer to it Derivation of softmax function and loss function for multi classification problems.

The formula derivation of xgboost multi classification problem is not written in this blog. I intend to write a special article on formula derivation ~

That's it. If there's something wrong, you're welcome to leave a message ~

 

Keywords: Big Data Machine Learning AI xgboost

Added by jaku78 on Thu, 10 Feb 2022 10:44:50 +0200