Original text: GBDT+LR CTR estimation - Kaggle example [with data set] - brief book
Recently, I read an article on implementing the recommendation system with GBDT+LR and prepared to practice it, but all the articles on this method did not put the data set, so I sorted out my ideas from the beginning and found the data set of Kaggle's last competition for implementation.
1 background
CTR estimation is a very common problem in the industry. Click Through Rate refers to whether the goods pushed to a customer will be clicked. The goods pushed to the customer will have the greatest probability of being clicked will undoubtedly improve profitability.
In the early stage of the development of CTR prediction problem, the most used method is logistic regression (LR). LR uses Sigmoid transform to map the function value to the 0 ~ 1 interval, and the mapped function value is the estimated value of CTR.
LR is a linear model, which is easy to parallelize and can easily process hundreds of millions of data, but its learning ability is very limited, and a large number of feature engineering is needed to increase the learning ability of the model. However, a large number of feature projects are time-consuming and labor-consuming, and do not necessarily bring about an improvement in the effect. Therefore, how to automatically find effective features and feature combinations, make up for the lack of manual experience and shorten the LR feature experiment cycle is an urgent problem to be solved.
Facebook's 2014 article introduced how to solve the feature combination problem of LR through GBDT, and then the Kaggle competition also practiced this idea. The integration of GBDT and LR began to attract the attention of the industry.
Before introducing this model, let's introduce two problems:
1) Why use the integrated decision tree model instead of the single tree decision tree model: the expression ability of one tree is very weak, which is not enough to express multiple distinctive feature combinations, and the expression ability of multiple trees is stronger. It can better find effective features and feature combinations
2) Why GBDT is used instead of RF: RF is also a number of trees, but it has been proved that the effect is not as good as GBDT. The feature splitting of the tree in front of GBDT mainly reflects the distinguishing feature of most samples; The later trees mainly reflect a few samples with large residual after passing through the first N trees. It is more reasonable to give priority to the features with discrimination on the whole, and then select the features with discrimination for a few samples, which should also be the reason for using GBDT.
2 GBDT + LR
The main integration process of the two methods is shown in the figure below
GBDT + LR
For the input sample, its feature is vector x. two trees are constructed by GBDT method. The first tree has three leaf nodes and the second tree has two leaf nodes.
Suppose x enters the first tree and falls into the first leaf of the three leaf nodes, and enters the second tree and falls into the second leaf of the two leaf nodes. Here we construct a new feature. The first tree has three leaves. Construct a vector U1 = [0, 0, 0], and sample x falls to the first leaf. Change the first position of vector U1 to 1, and the other positions are still 0, that is, U1=[1, 0, 0]. The second tree has two leaves. Construct vector U2 = [0,0], sample x falls to the second leaf, and change the value of the second position to 1, that is, U2 = [1,0]. Next, U1 and U2 are connected in series to become [1, 0, 0, 1, 0]. In this way, the transformed eigenvector is obtained.
Next, the transformed eigenvectors are trained with LR.
3 practice
The lightgbm package is used here. It is also convenient to install. Open the conda command line and conda install lightgbm.
All packages to be imported are as follows
import lightgbm as lgb import numpy as np import pandas as pd from sklearn import preprocessing from sklearn.linear_model import LogisticRegression
3.1 data sets
The data set used here is kaggle:avazu-ctr-prediction , the downloaded training set is about 1.2GB. After decompression, it is 6GB. Because the computer memory is not enough, a memoryrerror occurs during the later training, so I cut 1w pieces of data and download the link Small data set Extraction code: 6gr9 when implementing, select the first 8k as the training set and the last 2k as the test set.
Dataset fields:
id: ad identifier click: 0/1 for non-click/click hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. C1 -- anonymized categorical variable banner_pos site_id site_domain site_category app_id app_domain app_category device_id device_ip device_model device_type device_conn_type C14-C21 -- anonymized categorical variables
Because this article is only a demo, we do not carry out feature engineering, but only select some attributes
cols = ['C1', 'banner_pos', 'site_domain', .....'app_id', 'C14','C15', 'C16']
- Read dataset
df_train = pd.read_csv(r"D:\Kaggle_CTR\train.csv") cols = ['C1','banner_pos', 'site_domain', 'site_id', 'site_category','app_id', 'app_category', 'device_type', 'device_conn_type', 'C14', 'C15','C16'] cols_all = ['id'] cols_all.extend(cols) #print(df_train.head(10)) y = df['click'] y_train = y.iloc[:-2000] # training label y_test = y.iloc[-2000:] # testing label X = df[cols_all[1:]] # training dataset
- 'site 'read by pandas_ ID 'and so on are object objects, which need to be converted into strings, and then converted into categories with LabelEncoder.
# label encode lbl = preprocessing.LabelEncoder() X['site_domain'] = lbl.fit_transform(X['site_domain'].astype(str))#Convert the prompted column containing the wrong data type X['site_id'] = lbl.fit_transform(X['site_id'].astype(str)) X['site_category'] = lbl.fit_transform(X['site_category'].astype(str)) X['app_id'] = lbl.fit_transform(X['app_id'].astype(str)) X['app_category'] = lbl.fit_transform(X['app_category'].astype(str)) X_train = X.iloc[:-2000] X_test = X.iloc[-2000:] # testing dataset
3.2 training LGB
When using lgb, first convert the data set into lgb data set, and then train after specifying parameters.
# create dataset for lightgbm lgb_train = lgb.Dataset(X_train, y_train) lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) params = { 'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': {'binary_logloss'}, 'num_leaves': 64, 'num_trees': 100, 'learning_rate': 0.01, 'feature_fraction': 0.9, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'verbose': 0 } # number of leaves,will be used in feature transformation num_leaf = 64 print('Start training...') # train gbm = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=lgb_train)
The result of 100 rounds of training is [100] training's binary_logloss: 0.408675.
3.3 transforming eigenvectors with lgb
- The trained lgb model is used to predict the training set and observe which leaf nodes it falls on.
y_pred_train = gbm.predict(X_train, pred_leaf=True)
Observation y_ pred_ Shape of train
In[181]: y_pred_train.shape Out[181]: (8000, 100)
There are 8000 samples and 100 trees (num_trees = 100 in the above parameters). Observe the first sample y_ pred_ The first 10 values of train [0]:
In[182]: y_pred_train[0][:10] Out[182]: array([31, 29, 29, 32, 38, 46, 35, 36, 36, 42])
The first number 31 indicates that the sample has fallen to the 31 leaf node of the first tree, and 29 indicates that it has fallen to the 29 leaf node of the second tree. Note that 31 and 29 indicate the node number, starting from 0 to 63.
- Convert leaf node number to OneHot code
# train data transformed_training_matrix = np.zeros([len(y_pred_train), len(y_pred_train[0]) * num_leaf],dtype=np.int64) # N * num_tress * num_leafs for i in range(0, len(y_pred_train)): temp = np.arange(len(y_pred_train[0])) * num_leaf + np.array(y_pred_train[i]) transformed_training_matrix[i][temp] += 1 # test data transformed_testing_matrix = np.zeros([len(y_pred_test), len(y_pred_test[0]) * num_leaf], dtype=np.int64) for i in range(0, len(y_pred_test)): temp = np.arange(len(y_pred_test[0])) * num_leaf + np.array(y_pred_test[i]) transformed_testing_matrix[i][temp] += 1
3.4 LR
lm = LogisticRegression(penalty='l2',C=0.05) lm.fit(transformed_training_matrix,y_train) y_pred_lr_test = lm.predict_proba(transformed_testing_matrix)
3.5 result evaluation
The evaluation index specified in Kaggle is ne (normalized cross enterprise)
NE formula
NE = (-1) / len(y_pred_lr_test) * sum(((1+y_test)/2 * np.log(y_pred_lr_test[:,1]) + (1-y_test)/2 * np.log(1 - y_pred_lr_test[:,1]))) print("Normalized Cross Entropy " + str(NE))
The NE result I calculated here is 1.27958508822, which is actually a very poor result. The most important thing to achieve good results in the competition is feature engineering, that is, combining original features and calculating new features. Feature Engineering determines the upper bound of the algorithm result.
4 Summary
GBDT+LR is only a memory of history and is not really suitable for most current business data. Current business data is high-dimensional discrete data caused by a large number of discrete features. The tree model can not deal with such discrete features well, because it is easy to lead to over fitting.
Code download: link Extraction code: 1nh2
This paper is written to sort out the ideas and engineering implementation of GBDT+LR. The main work comes from Shi Xiaowen's learning diary. There are many articles on different implementation methods of the recommendation system in his blog.
reference material: Recommendation system meets deep learning (x) -- GBDT+LR fusion scheme practice
Author: passerby B yh
Link: https://www.jianshu.com/p/73adce3fd975
Source: Jianshu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.