xgboost: Higgs Boson Machine Learning Challenge

The original source of the code: https://github.com/dmlc/xgboost/tree/master/demo/kaggle-higgs 

1. Introduction of problems

Competition website: https://www.kaggle.com/c/higgs-boson/

Higgs boson is a basic particle in the standard model, named after physicist Peter Higgs.

On July 4, 2012, the European Organization for Nuclear Research (CERN) announced that the compact micron coil (CMS) of the LHC detected new bosons (4.9 standard deviations) with a mass of 125.3 (+0.6 GeV) and the hypertoroidal instrument (ATLAS) measured new bosons (5 standard deviations) with a mass of 126.5 GeV. The Gus Boson.

On March 14, 2013, the European Organization for Nuclear Research issued a press release officially announcing that the previously detected new particles were temporarily identified as Higgs bosons with zero spin and even parity, which are two basic properties of Higgs bosons. However, some experimental results do not meet the theoretical prediction, and more data are still available. Waiting for processing and analysis.
October 8, 2013, because "the theory of the generation mechanism of sub-atomic particle mass promotes human understanding of this aspect, and recently confirmed by the supertoroidal instrument of the Large Hadron Collider under the European Organization for Nuclear Research and the elementary particles found by the compact muon coil detector", Franois Engler and Peter Higg He won the Nobel Prize in Physics in 2013. An important characteristic of a particle is how late it lags behind other particles. CERN uses ATLAS for physical experiments to find new particles. A Higgs boson delay has recently been found in two tau particles, but the delay is only a small signal submerged in background noise.

The aim of the competition is to improve the saliency of ATLAS particles by using machine learning methods. Contests do not require background knowledge of particle physics (background knowledge is still useful to a large extent in solving practical problems). Competition data is data synthesized according to the characteristics of events detected by ATLAS. Competition tasks are to classify events into "tau decay of a Higgs boson" or "background".

Introduction of Data Set

Training data sets: http://download.csdn.net/download/u011630575/10268125

training.csv: The training set contains 250,000 events, each with an ID, 30 features, weights, and labels.

test.csv: The test data contains 550000 events, each containing an ID and 30 features.

All variables are floating point types, except that PRI_jet_num is integer
The "original" data about bunch collision measured by the detector with PRI mitives as prefix feature.
Physicists using DER (DERived) as ATLAS choose data calculated from the original features.
The missing data is 999.0, which is different from the normal values of all the features.

Specific semantics of features, weights and labels can be viewed in CERN technical documents. (There are links on the competition website)

3. Code Description

1. Reading data

#coding:utf-8
# this is the example script to use xgboost to train
import numpy as np
import xgboost as xgb

test_size = 550000

# path to where the data lies
dpath = './data/'

# Load in training data, direct use numpy delimiter as delimiter skiprows as skipped header converters as converters
dtrain = np.loadtxt(dpath + '/higgsboson_training.csv', delimiter=',', skiprows=1,
                    converters={32: lambda x: int(x == 's'.encode('utf-8'))})
print ('finish loading from csv ')

2. Data Preprocessing

label  = dtrain[:,32] 
data   = dtrain[:,1:31] 
# rescale weight to make it same as test set
weight = dtrain[:,31] * float(test_size) / len(label) #Column 31 of the data set is the weight tag, and len(label) is 25000.

# The weight of positive and negative samples is the proportion of positive and negative samples in training set.
sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0  ) #Positive sample
sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0  ) #Negative sample
# print weight statistics
print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos ))#Not simple counting

3. Training environment preparation

Training parameter setting:
1. objective [default reg:linear]:
Define loss functions that need to be minimized:
Logical regression of binary:logistic dichotomy returns the probability of prediction (not the category).
Higgs Boson Competition is a two-class classification task, using two-class logistic regression.
2. scale_pos_weight [default 1]:
Setting this parameter as a positive value can make the algorithm converge faster when the samples of each class are very unbalanced.
In Higgs Boson competition, the weights of each (positive/negative) sample are given in the training set. When the weights of all positive/negative samples are added together, the proportion of positive and negative samples in the training set can be obtained.
3. eta [default 0.3]
For learning rate. In order to prevent over-fitting, the shrinkage step used in the update process. After each lifting calculation, the weight of the new feature is obtained directly. eta reduces the weight of features to make the calculation process more conservative. The range of values is: [0,1]
4. max_depth [default 6]
Define the maximum depth of the tree, which is also used to avoid over-fitting. The larger the max_depth, the more complex the model, and the more specific and localized the samples will be learned.
Typical values: 3-10
5. eval_metric [The default depends on the value of the object parameter]
The measurement of valid data.
For regression problems, the default value is rmse, and for classification problems, the default value is error.
Typical values are:
rmse root mean square error
mae mean absolute error
logloss negative logarithmic likelihood function value
error Bi-classification error Rate (Threshold 0.5)
merror multi-classification error rate
mlogloss multiclass logloss loss function
Area Under Curve: The accuracy of the model under different thresholds.
6. nthread [default is the maximum possible number of threads]
This parameter is used for multithreading control and should be entered into the system's core.

If you want to use all the cores of the CPU, don't enter this parameter, the algorithm will automatically detect it.

The training data is imported into DMatix, so that the follow-up training is faster

# construct xgboost.DMatrix from numpy array, treat -999.0 as missing value
dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight )

# setup parameters for xgboost
param = {}
# use logistic regression loss, use raw prediction before logistic transformation
# since we only need the rank
param['objective'] = 'binary:logitraw'
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1

4. xgboost model training and preservation

The parameters of the train function: xgboost.train(params,dtrain,num_boost_round=10,evals=(),obj=None,feval=None,maximize=False,early_stopping_rounds=None, evals_result=None,verbose_eval=True,learning_rates=None,xgb_model=None)

Params: This is a dictionary containing parameter keywords and corresponding values in the form of params = {booster':'gbtree','eta': 0.1}.

dtrain training data

num_boost_round

Evals: This is a list for evaluating the elements in the list during the training process. The form is evals = [(dtrain,'train'), (dval,'val')] or evals = [(dtrain,'train')]

The first case is used in this code, which allows us to observe the effect of the verification set during the training process.

num_round = 10
# num_round = 1000

print ('running cross validation, with preprocessing function')
# define the preprocessing function preprocessing function
# used to return the preprocessed training, test data, and parameter
# we can use this to do weight rescale, etc.
# as a example, we try to set scale_pos_weight
def fpreproc(dtrain, dtest, param):
    label = dtrain.get_label()
    ratio = float(np.sum(label == 0)) / np.sum(label==1)
    param['scale_pos_weight'] = ratio
    wtrain = dtrain.get_weight()  Obtaining Test Set Weight
    wtest = dtest.get_weight()    Obtaining Test Set Weight
    sum_weight = sum(wtrain) + sum(wtest) 
    wtrain *= sum_weight / sum(wtrain) 
    wtest *= sum_weight / sum(wtest)
    dtrain.set_weight(wtrain)
    dtest.set_weight(wtest)
    return (dtrain, dtest, param)

# do cross validation, for each fold cross validation
# the dtrain, dtest, param will be passed into fpreproc
# then the return value of fpreproc will be used to generate
# results of that fold
cvresult = xgb.cv(param, dtrain, num_round, nfold=2,
       metrics={'ams@0.15', 'auc'}, early_stopping_rounds=2, seed = 0, fpreproc = fpreproc)

print ('finish cross validation')
print (cvresult)
n_estimators = cvresult.shape[0]. #The value is a number 10

from matplotlib import pyplot
# plot
test_means = cvresult['test-ams@0.15-mean']
test_stds = cvresult['test-ams@0.15-std'] 
        
train_means = cvresult['train-ams@0.15-mean']
train_stds = cvresult['train-ams@0.15-std'] 

x_axis = range(0, n_estimators)
pyplot.errorbar(x_axis, test_means, yerr=test_stds ,label='Test')
pyplot.errorbar(x_axis, train_means, yerr=train_stds ,label='Train')
pyplot.title("HiggsBoson n_estimators vs ams@0.15")
pyplot.xlabel( 'n_estimators' )
pyplot.ylabel( 'ams@0.15' )
pyplot.savefig( 'HiggsBoson_estimators.png' ). #Save pictures
#pyplot.show()  #If the graphics result is output first, it will cause the subsequent output to stop, and the subsequent code will be executed after the graphics are closed.
#Fit the algorithm on the data, the CV function has no refit step
#alg.fit(X_train, y_train, eval_metric='ams@0.15')
print ('train model using the best parameters by cv ... ')
bst = xgb.train( param, dtrain, n_estimators );

# save out model
bst.save_model('higgs_cv.model')   #Save the model

print ('train finished')

pyplot.show() 

5. Testing process

Read test data and trained models

#coding:utf-8
import numpy as np
import xgboost as xgb

# path to where the data lies
dpath = './data/'

modelfile = 'higgs_cv.model' #Production model
outfile = 'higgs.pred.csv'   #output file
# make top 15% as positive
threshold_ratio = 0.15 

# load in test, directly use numpy
dtest = np.loadtxt( dpath+'/higgsboson_test.csv', delimiter=',', skiprows=1 )
data   = dtest[:,1:31]
idx = dtest[:,0]

print ('finish loading from csv ')

XGBoost test environment preparation

Test data import DMatrix model import

xgmat = xgb.DMatrix( data, missing = -999.0 )
bst = xgb.Booster({'nthread':8}, model_file = modelfile)

test

ypred = bst.predict( xgmat )

Test results are collated and written to the result submission file

res  = [ ( int(idx[i]), ypred[i] ) for i in range(len(ypred)) ]

rorder = {}
for k, v in sorted( res, key = lambda x:-x[1] ):
    rorder[ k ] = len(rorder) + 1

# write out predictions
ntop = int( threshold_ratio * len(rorder ) ) #Threshold entrance, threshold
fo = open(outfile, 'w')
nhit = 0
ntot = 0
fo.write('EventId,RankOrder,Class\n')  #Write the file title
for k, v in res:
    if rorder[k] <= ntop:
        lb = 's'   #Class s
        nhit += 1
    else:
        lb = 'b'   #Class b
    # change output rank order to follow Kaggle convention
    fo.write('%s,%d,%s\n' % ( k,  len(rorder)+1-rorder[k], lb ) )
    ntot += 1
fo.close()

print ('finished writing into prediction file')

4. Code Arrangement

Model building code

#coding:utf-8
# this is the example script to use xgboost to train
import numpy as np
import xgboost as xgb

test_size = 550000

# path to where the data lies
dpath = './data/'

# load in training data, directly use numpy
dtrain = np.loadtxt(dpath + '/higgsboson_training.csv', delimiter=',', skiprows=1,
                    converters={32: lambda x: int(x == 's'.encode('utf-8'))})
print ('finish loading from csv ')

label  = dtrain[:,32]
data   = dtrain[:,1:31]
# rescale weight to make it same as test set
weight = dtrain[:,31] * float(test_size) / len(label)

# The weight of positive and negative samples is the proportion of positive and negative samples in training set.
sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0  )
sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0  )

# print weight statistics
print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos ))

# construct xgboost.DMatrix from numpy array, treat -999.0 as missing value
dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight )

# setup parameters for xgboost
param = {}
# use logistic regression loss, use raw prediction before logistic transformation
# since we only need the rank
param['objective'] = 'binary:logitraw'
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1

# boost 1000 trees
num_round = 10
# num_round = 1000

print ('running cross validation, with preprocessing function')
# define the preprocessing function
# used to return the preprocessed training, test data, and parameter
# we can use this to do weight rescale, etc.
# as a example, we try to set scale_pos_weight
def fpreproc(dtrain, dtest, param):
    label = dtrain.get_label()
    ratio = float(np.sum(label == 0)) / np.sum(label==1)
    param['scale_pos_weight'] = ratio
    wtrain = dtrain.get_weight()
    wtest = dtest.get_weight()
    sum_weight = sum(wtrain) + sum(wtest)
    wtrain *= sum_weight / sum(wtrain)
    wtest *= sum_weight / sum(wtest)
    dtrain.set_weight(wtrain)
    dtest.set_weight(wtest)
    return (dtrain, dtest, param)

# do cross validation, for each fold
# the dtrain, dtest, param will be passed into fpreproc
# then the return value of fpreproc will be used to generate
# results of that fold
cvresult = xgb.cv(param, dtrain, num_round, nfold=2,
       metrics={'ams@0.15', 'auc'}, early_stopping_rounds=2, seed = 0, fpreproc = fpreproc)

print ('finish cross validation')
print (cvresult)

n_estimators = cvresult.shape[0]

from matplotlib import pyplot

# plot
test_means = cvresult['test-ams@0.15-mean']
test_stds = cvresult['test-ams@0.15-std']

train_means = cvresult['train-ams@0.15-mean']
train_stds = cvresult['train-ams@0.15-std']

x_axis = range(0, n_estimators)
pyplot.errorbar(x_axis, test_means, yerr=test_stds, label='Test')
pyplot.errorbar(x_axis, train_means, yerr=train_stds, label='Train')
pyplot.title("HiggsBoson n_estimators vs ams@0.15")
pyplot.xlabel('n_estimators')
pyplot.ylabel('ams@0.15')
pyplot.savefig('HiggsBoson_estimators.png')
# pyplot.show()

#Fit the algorithm on the data, the CV function has no refit step
#alg.fit(X_train, y_train, eval_metric='ams@0.15')
print ('train model using the best parameters by cv ... ')
bst = xgb.train( param, dtrain, n_estimators );

# save out model
bst.save_model('higgs_cv.model')

print ('train finished')

pyplot.show()

Test code

#coding:utf-8
import numpy as np
import xgboost as xgb

# path to where the data lies
dpath = './data/'

modelfile = 'higgs_cv.model'
outfile = 'higgs.pred.csv'
# make top 15% as positive
threshold_ratio = 0.15

# load in test, directly use numpy
dtest = np.loadtxt( dpath+'/higgsboson_test.csv', delimiter=',', skiprows=1 )
data   = dtest[:,1:31]
idx = dtest[:,0]

print ('finish loading from csv ')

xgmat = xgb.DMatrix( data, missing = -999.0 )
bst = xgb.Booster({'nthread':8}, model_file = modelfile)

ypred = bst.predict( xgmat )

res  = [ ( int(idx[i]), ypred[i] ) for i in range(len(ypred)) ]

rorder = {}
for k, v in sorted( res, key = lambda x:-x[1] ):
    rorder[ k ] = len(rorder) + 1

# write out predictions
ntop = int( threshold_ratio * len(rorder ) )
fo = open(outfile, 'w')
nhit = 0
ntot = 0
fo.write('EventId,RankOrder,Class\n')
for k, v in res:
    if rorder[k] <= ntop:
        lb = 's'
        nhit += 1
    else:
        lb = 'b'
    # change output rank order to follow Kaggle convention
    fo.write('%s,%d,%s\n' % ( k,  len(rorder)+1-rorder[k], lb ) )
    ntot += 1
fo.close()

print ('finished writing into prediction file')

 

Keywords: Lambda github

Added by miseleigh on Sun, 28 Jul 2019 08:54:05 +0300