Reference link:
https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.20850282.J_3678908510.8.4bcd4d572bJoDp&postId=170951
1, Learning objectives
- Learn the machine learning model commonly used in the field of financial sub control.
- Learn the modeling process and parameter adjustment process of machine learning model.
2, Learning content
2.1 logistic regression model
Understand logistic regression model;
Application of logistic regression model;
Advantages and disadvantages of logistic regression.
2.2 tree model
Understand tree model;
Application of tree model;
Advantages and disadvantages of tree model.
2.3 integration model
Integration model based on bagging idea
- Stochastic forest model
Integration model based on boosting idea
- XGBoost model
- LightGBM model
- CatBoost model
2.4 model comparison and performance evaluation
Regression model / tree model / integration model;
Model evaluation method;
Model evaluation results.
2.5 model parameter adjustment
Greedy parameter adjustment method;
Grid parameter adjustment method;
Bayesian parameter adjustment method.
Three, model comparison and performance evaluation
3.1 logistic regression
advantage
- The training speed is fast, and the amount of calculation is only related to the number of features;
- It is simple and easy to understand, and the interpretability of the model is very good. From the weight of features, we can see the impact of different features on the final results;
- It is suitable for binary classification problems and does not need to scale the input features;
- The memory resource occupation is small, and only the characteristic values of each dimension need to be stored.
shortcoming
- For logistic regression, missing values and outliers need to be handled in advance [refer to task3 Feature Engineering];
- Logistic regression can not be used to solve nonlinear problems, because the decision surface of logistic is linear;
- It is sensitive to multicollinearity data, and it is difficult to deal with the problem of data imbalance;
- The accuracy is not very high, because the form is very simple, it is difficult to fit the real distribution of the data.
3.2 decision tree model
- advantage
- Simple and intuitive, the generated decision tree can be displayed visually
- The data does not need preprocessing, normalization or missing data processing
- You can work with both discrete and continuous values
- shortcoming
- The decision tree algorithm is very easy to over fit, resulting in weak generalization ability (proper pruning)
- The greedy algorithm is used, which is easy to obtain the local optimal solution
3.3 integration model integration method
By combining multiple learners to complete the learning task, multiple weak learners can be combined into a strong classifier through the integration method. Therefore, the generalization ability of integrated learning is generally better than that of a single classifier.
The integration methods mainly include bagging and Boosting. Bagging and Boosting combine the existing classification or regression algorithms in a certain way to form a more powerful classification. Both methods integrate several classifiers into one classifier, but the integration methods are different, and finally get different results. The common integration models based on Baggin's idea are: random forest, and the integration models based on Boosting's idea are: Adaboost, GBDT, xgbboost, LightGBM, etc.
The differences between Baggin and Boosting are summarized as follows:
- Sample selection: the training set of Bagging method is selected from the original set, so each round of training set selected from the original set is independent; The Boosting method requires that the training set of each round remains unchanged, but the weight of each sample in the training set in the classifier changes. The weight is adjusted according to the classification results of the previous round
- Sample weight: the Bagging method uses uniform sampling, so the weight of each sample is equal; The Boosting method continuously adjusts the weight of the sample according to the error rate. The greater the error rate, the greater the weight
- On the prediction function: the weights of all prediction functions in Bagging method are equal; Each weak classifier in Boosting method has corresponding weight, and the classifier with small classification error will have greater weight
- Parallel computing: each prediction function in Bagging method can be generated in parallel; In the Boosting method, each prediction function can only be generated sequentially, because the latter model parameter needs the results of the previous round of model.
3.4 model evaluation method
For the model, the error on the training set is called training error or empirical error, while the error on the test set is called test error.
For us, we are more concerned about the learning ability of the model for new samples, that is, we hope to learn the general laws of all potential samples as much as possible through the learning of existing samples. If the model is too good for training samples, it is possible to regard some characteristics of the training samples as the general characteristics of all potential samples, At this time, we will have the problem of fitting.
Therefore, we usually divide the existing data set into two parts: training set and test set. The training set is used to train the model, and the test set is used to evaluate the discrimination ability of the model for new samples.
For the division of data sets, we usually ensure that the following two conditions are met:
- The distribution of training set and test set should be consistent with the real distribution of samples, that is, both training set and test set should be sampled independently and identically from the real distribution of samples;
- The training set and test set should be mutually exclusive.
There are three methods for the division of data sets: set aside method, cross validation method and self-help method
-
① Set aside method
The set aside method is to directly divide the data set D into two mutually exclusive sets, one as the training set S and the other as the test set T. It should be noted that during the division, the consistency of data distribution should be ensured as much as possible, that is, to avoid the impact on the final result due to the introduction of additional deviations in the data division process. In order to ensure the consistency of data distribution, we usually use layered sampling to sample the data.
Tips: generally, about 2 / 3 ~ 4 / 5 of the samples in data set D will be used as the training set and the rest as the test set. -
② Cross validation method
K-fold cross validation usually divides the data set D into k parts, in which k-1 part is used as the training set and the remaining part is used as the test set. In this way, K groups of training / test sets can be obtained, K times of training and testing can be carried out, and the final return is the mean value of K test results. The data set division in cross validation is still based on hierarchical sampling.
For the cross validation method, the selection of k value often determines the stability and fidelity of the evaluation results. Usually, the k value is 10.
When k=1, we call it leave one method. -
③ Self help method
Each time we take a sample from data set D as the element of the training set, then put the sample back and repeat the behavior m times, so that we can get the training set with size M. some samples repeat and some samples do not appear. We take those samples that do not appear as the test set.
The reason for such sampling is that about 36.8% of the data in D has not appeared in the training set. Both set aside method and cross validation method use layered sampling for data sampling and division, while self-help method uses put back repeated sampling for data sampling.
Data set partitioning summary
- When the amount of data is sufficient, the set aside method or k-fold cross validation method is usually used to divide the training / test set;
- When the data set is small and it is difficult to effectively divide the training / test set, the self-help method is used;
- When the data set is small and can be divided effectively, it is best to use the leave one method, because this method is the most accurate.
3.5 model evaluation criteria
auc is selected as the model evaluation standard this time, and similar evaluation standards include ks, F1 score, etc.
Let's see what auc is?
In logistic regression, for the definition of positive and negative examples, a threshold is usually set. Those greater than the threshold are positive and those less than the threshold are negative. If we reduce this threshold, more samples will be identified as positive classes and improve the recognition rate of positive classes, but at the same time, more negative classes will be incorrectly identified as positive classes. In order to intuitively express this phenomenon, ROC is introduced.
The corresponding points in the ROC space are calculated according to the classification results, and the ROC curve is formed by connecting these points. The abscissa is False Positive Rate(FPR) and the ordinate is True Positive Rate(TPR). Generally, this curve should be above the line between (0,0) and (1,1).
Four points in the ROC curve:
Point (0,1): FPR=0, TPR=1, which means FN = 0 and FP = 0, and all samples are correctly classified;
Point (1,0): FPR=1, TPR=0, worst classifier, avoiding all correct answers;
Point (0,0): that is, FPR=TPR=0, FP = TP = 0, and the classifier predicts each instance as a negative class;
Point (1,1): the classifier predicts each instance as a positive class
In short: the closer the ROC curve is to the upper left corner, the better the performance of the classifier and the better its generalization performance. And generally speaking, if the ROC is smooth, it can be judged that there is not much overfitting.
But for the two models, how do we judge which model has better generalization performance? Here we mainly have the following two methods:
If the ROC curve of model A completely covers the ROC curve of model B, we think that model A is better than model B;
If the two curves intersect, we can judge by comparing the area of ROC with the curve surrounded by X and Y axes. The larger the area, the better the performance of the model. This area is called AUC(area under ROC curve).
4, Code example
4.1 import related customs and settings
import pandas as pd import numpy as np import warnings import os import seaborn as sns import matplotlib.pyplot as plt
4.2 reading data
data = pd.read_csv('dataset/data_for_model.csv') data = reduce_mem_usage(data)
4.3 simple modeling
-
Tips1: most of the actual projects of financial risk control involve credit scores, so the model characteristics need to be better interpretable. Therefore, at present, most of the actual projects still use logistic regression as the basic model. However, in the competition, the score is based on the score, and rigorous interpretability is not required, so most modeling is based on integrated algorithm.
-
Tips2: because of the algorithmic characteristics of logistic regression, abnormal value and missing value data need to be processed in advance [refer to task3]
-
Tips3: Based on the algorithm characteristics of the tree model, the processing of outliers and missing values can be skipped, but students who know more about the business can also process missing outliers themselves, and the effect may be better than the result of model processing.
Note: the source data of the following modeling has carried out corresponding feature engineering with reference to baseline, and no corresponding processing operation has been carried out for abnormal missing values.
sklearn.model_selection import KFold # Separate data sets to facilitate cross validation X_train = data.loc[data['sample']=='train', :].drop(['id','issueDate','isDefault', 'sample'], axis=1) X_test = data.loc[data['sample']=='test', :].drop(['id','issueDate','isDefault', 'sample'], axis=1) y_train = data.loc[data['sample']=='train', 'isDefault'] # 5-fold cross validation folds = 5 seed = 2020 kf = KFold(n_splits=folds, shuffle=True, random_state=seed) """The training set data is divided into training set and verification set, and the corresponding operations are carried out""" from sklearn.model_selection import train_test_split import lightgbm as lgb # Data set partition X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2) train_matrix = lgb.Dataset(X_train_split, label=y_train_split) valid_matrix = lgb.Dataset(X_val, label=y_val) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.1, 'metric': 'auc', 'min_child_weight': 1e-3, 'num_leaves': 31, 'max_depth': -1, 'reg_lambda': 0, 'reg_alpha': 0, 'feature_fraction': 1, 'bagging_fraction': 1, 'bagging_freq': 0, 'seed': 2020, 'nthread': 8, 'silent': True, 'verbose': -1, } """Model training using training set data""" model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=20000, verbose_eval=1000, early_stopping_rounds=200) from sklearn import metrics from sklearn.metrics import roc_auc_score """Predict and calculate roc Relevant indicators of""" val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration) fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb) roc_auc = metrics.auc(fpr, tpr) print('Before parameter adjustment lightgbm Application of single model on validation set AUC: {}'.format(roc_auc)) """Draw roc diagram""" plt.figure(figsize=(8, 8)) plt.title('Validation ROC') plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc) plt.ylim(0,1) plt.xlim(0,1) plt.legend(loc='best') plt.title('ROC') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') # Draw a diagonal plt.plot([0,1],[0,1],'r--') plt.show() import lightgbm as lgb """use lightgbm 5 Fold cross validation for modeling and prediction""" cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)): print('************************************ {} ************************************'.format(str(i+1))) X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index] train_matrix = lgb.Dataset(X_train_split, label=y_train_split) valid_matrix = lgb.Dataset(X_val, label=y_val) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'learning_rate': 0.1, 'metric': 'auc', 'min_child_weight': 1e-3, 'num_leaves': 31, 'max_depth': -1, 'reg_lambda': 0, 'reg_alpha': 0, 'feature_fraction': 1, 'bagging_fraction': 1, 'bagging_freq': 0, 'seed': 2020, 'nthread': 8, 'silent': True, 'verbose': -1, } model = lgb.train(params, train_set=train_matrix, num_boost_round=20000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200) val_pred = model.predict(X_val, num_iteration=model.best_iteration) cv_scores.append(roc_auc_score(y_val, val_pred)) print(cv_scores) print("lgb_scotrainre_list:{}".format(cv_scores)) print("lgb_score_mean:{}".format(np.mean(cv_scores))) print("lgb_score_std:{}".format(np.std(cv_scores)))
4.4 model parameter adjustment
4.4.1 greedy parameter adjustment
First, use the parameters that have the greatest impact on the model to achieve the optimization of the model under the current parameters, and then use the parameters that have the second impact on the model to optimize. This continues until all parameters are adjusted.
Disadvantages: it may be adjusted to local optimization rather than global optimization, but it only needs to debug the parameter optimization step by step, which is easy to understand.
It should be noted that the order of parameter adjustment in the tree model, that is, the impact of each parameter on the model, here are the common parameters and parameter adjustment order in the daily parameter adjustment process:
①: max_depth,num_leaves
②: min_data_in_leaf,min_child_weight
③: bagging_fraction, feature_fraction,bagging_freq
④: reg_lambda,reg_alpha
⑤: min_split_gain
from sklearn.model_selection import cross_val_score # Tune objective best_obj = dict() for obj in objective: model = LGBMRegressor(objective=obj) """Predict and calculate roc Relevant indicators of""" score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean() best_obj[obj] = score # num_leaves best_leaves = dict() for leaves in num_leaves: model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves) """Predict and calculate roc Relevant indicators of""" score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean() best_leaves[leaves] = score # max_depth best_depth = dict() for depth in max_depth: model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0], max_depth=depth) """Predict and calculate roc Relevant indicators of""" score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean() best_depth[depth] = score
The parameters of the model can be adjusted and optimized in the above way, and the score of the model under each optimal parameter can be observed visually
4.4.2 grid search
Skylearn provides GridSearchCV for grid search. You only need to input the parameters of the model to give the optimization results and parameters. Compared with greedy parameter adjustment, the results of grid search will be better, but grid search is only suitable for small data sets. Once the magnitude of data increases, it is difficult to get results.
Take Lightgbm algorithm as an example to adjust grid search parameters:
"""The optimal parameters are determined by grid search""" from sklearn.model_selection import GridSearchCV def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None): # Set 50% off cross validation cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, ) model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate, n_estimators=n_estimators, num_leaves=num_leaves, max_depth=max_depth, bagging_fraction=bagging_fraction, feature_fraction=feature_fraction, bagging_freq=bagging_freq, min_data_in_leaf=min_data_in_leaf, min_child_weight=min_child_weight, min_split_gain=min_split_gain, reg_lambda=reg_lambda, reg_alpha=reg_alpha, n_jobs= 8 ) grid_search = GridSearchCV(estimator=model_lgb, cv=cv_fold, param_grid=param_grid, scoring='roc_auc' ) grid_search.fit(X_train, y_train) print('The current optimal parameters of the model are:{}'.format(grid_search.best_params_)) print('The current optimal score of the model is:{}'.format(grid_search.best_score_))
4.4.3 Bayesian parameter adjustment
The main idea of Bayesian parameter adjustment is: given the optimized objective function (generalized function, only specify the input and output, without knowing the internal structure and mathematical properties), the posterior distribution of the objective function is updated by continuously adding sample points (Gaussian process, until the a posteriori distribution basically fits the real distribution). In short, it takes into account the information of the last parameter, so as to better adjust the current parameter.
The steps of Bayesian parameter adjustment are as follows:
- Define optimization function (rf#u CV)
- Model building
- Define parameters to be optimized
- Get the optimization result and return the score index to be optimized
from sklearn.model_selection import cross_val_score """Define optimization function""" def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, min_child_weight, min_split_gain, reg_lambda, reg_alpha): # Model building model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', bjective='binary', metric='auc', learning_rate=0.1, n_estimators=5000, num_leaves=int(num_leaves), max_depth=int(max_depth), bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2), bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf), min_child_weight=min_child_weight, min_split_gain=min_split_gain, reg_lambda=reg_lambda, reg_alpha=reg_alpha, n_jobs= 8 ) val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean() return val from bayes_opt import BayesianOptimization """Define optimization parameters""" bayes_lgb = BayesianOptimization( rf_cv_lgb, { 'num_leaves':(10, 200), 'max_depth':(3, 20), 'bagging_fraction':(0.5, 1.0), 'feature_fraction':(0.5, 1.0), 'bagging_freq':(0, 100), 'min_data_in_leaf':(10,100), 'min_child_weight':(0, 10), 'min_split_gain':(0.0, 1.0), 'reg_alpha':(0.0, 10), 'reg_lambda':(0.0, 10), } )
4.4.4 after the model parameters have been determined, the final model is established and the verification set is verified
import lightgbm as lgb """use lightgbm 5 Fold cross validation for modeling and prediction""" cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)): print('************************************ {} ************************************'.format(str(i+1))) X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index] train_matrix = lgb.Dataset(X_train_split, label=y_train_split) valid_matrix = lgb.Dataset(X_val, label=y_val) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'learning_rate': 0.01, 'num_leaves': 14, 'max_depth': 19, 'min_data_in_leaf': 37, 'min_child_weight':1.6, 'bagging_fraction': 0.98, 'feature_fraction': 0.69, 'bagging_freq': 96, 'reg_lambda': 9, 'reg_alpha': 7, 'min_split_gain': 0.4, 'nthread': 8, 'seed': 2020, 'silent': True, } model = lgb.train(params, train_set=train_matrix, num_boost_round=14269, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200) val_pred = model.predict(X_val, num_iteration=model.best_iteration) cv_scores.append(roc_auc_score(y_val, val_pred)) print(cv_scores) print("lgb_scotrainre_list:{}".format(cv_scores)) print("lgb_score_mean:{}".format(np.mean(cv_scores))) print("lgb_score_std:{}".format(np.std(cv_scores))) ... lgb_scotrainre_list:[0.7329726464187137, 0.7294292852806246, 0.7341505801564857, 0.7328331383185244, 0.7317405262608612] lgb_score_mean:0.732225235287042 lgb_score_std:0.0015929470575114753 Through the 50% cross validation, it can be found that the number of iterations of the model will stop when it is 13000. Then we directly set the maximum number of iterations when building a new model and use the validation set for model prediction """""" base_params_lgb = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'learning_rate': 0.01, 'num_leaves': 14, 'max_depth': 19, 'min_data_in_leaf': 37, 'min_child_weight':1.6, 'bagging_fraction': 0.98, 'feature_fraction': 0.69, 'bagging_freq': 96, 'reg_lambda': 9, 'reg_alpha': 7, 'min_split_gain': 0.4, 'nthread': 8, 'seed': 2020, 'silent': True, } """Model training using training set data""" final_model_lgb = lgb.train(base_params_lgb, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=13000, verbose_eval=1000, early_stopping_rounds=200) """Predict and calculate roc Relevant indicators of""" val_pre_lgb = final_model_lgb.predict(X_val) fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb) roc_auc = metrics.auc(fpr, tpr) print('After adjusting parameters lightgbm Application of single model on validation set AUC: {}'.format(roc_auc)) """Draw roc diagram""" plt.figure(figsize=(8, 8)) plt.title('Validation ROC') plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc) plt.ylim(0,1) plt.xlim(0,1) plt.legend(loc='best') plt.title('ROC') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') # Draw a diagonal plt.plot([0,1],[0,1],'r--') plt.show()
4.4.5 summary of model parameter adjustment
The cv function built in the integration model can quickly adjust a single parameter, which can generally be used to determine the number of iterations of the tree model
When there is a large amount of data (such as the data of this project), the grid search will be particularly slow, so it is not recommended to try
Some parameters of the original library and the Library under sklearn in the integration model are inconsistent. Please note that please refer to the official API s of xgb and lgb for details.
5, Summary
In this section, the work of modeling and parameter adjustment is mainly completed. Firstly, in the process of modeling, the performance of the model is evaluated and verified by dividing data sets and cross validation, and the ROC curve of the model is drawn visually.
Finally, the parameters of the model are adjusted. This part introduces three parameter adjustment methods: greedy parameter adjustment, grid search parameter adjustment and Bayesian parameter adjustment. Bayesian parameter adjustment is mainly used to simply optimize the project. In the process of actual operation, you can refer to the parameter adjustment ideas for optimization, and you don't have to stick to the specific examples written in the above tutorial.