This paper first writes the demo of the multi classification problem supported by xgboost, prints the tree structure, and then understands the principle of xgboost to realize the multi classification problem. This order is easier to understand.
xgboost multi classification problem demo
This demo can be seen from the source code of xgboost. In / demo/multiclass_classification/train.py. The data in the PY file (dermatology.data) can be found in https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data Download this website. The suffix of the downloaded file is Data, change to csv # or Txt can be used directly. I changed the data to 'data txt' .
Now let's look at train The code in py ~
I write the code directly at the bottom: there are 6 types of labels for this data, and I set 2 rounds of iteration for the code below.
import numpy as np import xgboost as xgb # label need to be 0 to num_class -1 data = np.loadtxt('data.txt', delimiter='\t', converters={33: lambda x:int(x == '?'), 34: lambda x:int(x) - 1}) sz = data.shape train = data[:int(sz[0] * 0.7), :] test = data[int(sz[0] * 0.7):, :] train_X = train[:, :33] train_Y = train[:, 34] test_X = test[:, :33] test_Y = test[:, 34] xg_train = xgb.DMatrix(train_X, label=train_Y) xg_test = xgb.DMatrix(test_X, label=test_Y) # setup parameters for xgboost param = {} # use softmax multi-class classification param['objective'] = 'multi:softmax' # scale weight of positive examples param['eta'] = 0.1 param['max_depth'] = 6 param['silent'] = 1 param['nthread'] = 4 param['num_class'] = 6 watchlist = [(xg_train, 'train'), (xg_test, 'test')] num_round = 2 # The number of rounds is set to 2 bst = xgb.train(param, xg_train, num_round, watchlist) # get prediction pred = bst.predict(xg_test) error_rate = np.sum(pred != test_Y) / test_Y.shape[0] print('Test error using softmax = {}'.format(error_rate))
Implementation principle of xgboost multi classification problem
After training, the key step is to print out the trained tree. The following code can save the tree structure as text. I think the text form is much better than the picture form.
bst.dump_model('multiclass_model')
Let's open this file. Each booster represents a tree. There are 12 trees in this model, and the boosters range from 0 to 11.
booster[0]: 0:[f19<0.5] yes=1,no=2,missing=1 1:[f21<0.5] yes=3,no=4,missing=3 3:leaf=-0.0587906 4:leaf=0.0906977 2:[f6<0.5] yes=5,no=6,missing=5 5:leaf=0.285523 6:leaf=0.0906977 booster[1]: 0:[f27<1.5] yes=1,no=2,missing=1 1:[f12<0.5] yes=3,no=4,missing=3 3:[f31<0.5] yes=7,no=8,missing=7 7:leaf=-1.67638e-09 8:leaf=-0.056044 4:[f4<0.5] yes=9,no=10,missing=9 9:leaf=0.132558 10:leaf=-0.0315789 2:[f4<0.5] yes=5,no=6,missing=5 5:[f11<0.5] yes=11,no=12,missing=11 11:[f10<0.5] yes=15,no=16,missing=15 15:leaf=0.264427 16:leaf=0.0631579 12:leaf=-0.0428571 6:[f15<1.5] yes=13,no=14,missing=13 13:leaf=-0.00566038 14:leaf=-0.0539326 booster[2]: 0:[f32<1.5] yes=1,no=2,missing=1 1:leaf=-0.0589339 2:[f9<0.5] yes=3,no=4,missing=3 3:leaf=0.280919 4:leaf=0.0631579 booster[3]: 0:[f4<0.5] yes=1,no=2,missing=1 1:[f0<1.5] yes=3,no=4,missing=3 3:[f3<0.5] yes=7,no=8,missing=7 7:[f27<0.5] yes=13,no=14,missing=13 13:leaf=-0.0375 14:leaf=0.0631579 8:leaf=-0.0515625 4:leaf=-0.058371 2:[f2<1.5] yes=5,no=6,missing=5 5:[f32<0.5] yes=9,no=10,missing=9 9:[f15<0.5] yes=15,no=16,missing=15 15:leaf=-0.0348837 16:leaf=0.230097 10:leaf=-0.0428571 6:[f3<0.5] yes=11,no=12,missing=11 11:leaf=0.0622641 12:[f16<1.5] yes=17,no=18,missing=17 17:leaf=-1.67638e-09 18:[f3<1.5] yes=19,no=20,missing=19 19:leaf=-0.00566038 20:leaf=-0.0554622 booster[4]: 0:[f14<0.5] yes=1,no=2,missing=1 1:leaf=-0.0590296 2:leaf=0.255665 booster[5]: 0:[f30<0.5] yes=1,no=2,missing=1 1:leaf=-0.0591241 2:leaf=0.213253 booster[6]: 0:[f19<0.5] yes=1,no=2,missing=1 1:[f21<0.5] yes=3,no=4,missing=3 3:leaf=-0.0580493 4:leaf=0.0831786 2:leaf=0.214441 booster[7]: 0:[f27<1.5] yes=1,no=2,missing=1 1:[f12<0.5] yes=3,no=4,missing=3 3:[f31<0.5] yes=7,no=8,missing=7 7:leaf=0.000227226 8:leaf=-0.0551713 4:[f15<1.5] yes=9,no=10,missing=9 9:leaf=-0.0314418 10:leaf=0.121289 2:[f4<0.5] yes=5,no=6,missing=5 5:[f11<0.5] yes=11,no=12,missing=11 11:[f10<0.5] yes=15,no=16,missing=15 15:leaf=0.206326 16:leaf=0.0587528 12:leaf=-0.0420568 6:[f15<1.5] yes=13,no=14,missing=13 13:leaf=-0.00512865 14:leaf=-0.0531389 booster[8]: 0:[f32<1.5] yes=1,no=2,missing=1 1:leaf=-0.0581933 2:[f11<0.5] yes=3,no=4,missing=3 3:leaf=0.0549185 4:leaf=0.218241 booster[9]: 0:[f4<0.5] yes=1,no=2,missing=1 1:[f0<1.5] yes=3,no=4,missing=3 3:[f3<0.5] yes=7,no=8,missing=7 7:[f27<0.5] yes=13,no=14,missing=13 13:leaf=-0.0367718 14:leaf=0.0600201 8:leaf=-0.0506891 4:leaf=-0.0576147 2:[f27<0.5] yes=5,no=6,missing=5 5:[f3<0.5] yes=9,no=10,missing=9 9:leaf=0.0238016 10:leaf=-0.054874 6:[f5<1] yes=11,no=12,missing=11 11:leaf=0.200442 12:leaf=-0.0508502 booster[10]: 0:[f14<0.5] yes=1,no=2,missing=1 1:leaf=-0.058279 2:leaf=0.201977 booster[11]: 0:[f30<0.5] yes=1,no=2,missing=1 1:leaf=-0.0583675 2:leaf=0.178016
Don't forget that we are 6 classification problems, and the number of training rounds is num_round is 2. There are 6 trees in the first round, corresponding to , booster0-booster5, and 6 trees in the second round, corresponding to , booster6-booster11. So here you should know how xgboost trains multi classification problems. In fact, in each round, six trees are trained, and then the softmax function is used as the loss function. If you don't understand softmax, you can refer to it Derivation of softmax function and loss function for multi classification problems.
The formula derivation of xgboost multi classification problem is not written in this blog. I intend to write a special article on formula derivation ~
That's it. If there's something wrong, you're welcome to leave a message ~