# GBDT+LR CTR estimation - Kaggle example

Recently, I read an article on implementing the recommendation system with GBDT+LR and prepared to practice it, but all the articles on this method did not put the data set, so I sorted out my ideas from the beginning and found the data set of Kaggle's last competition for implementation.

# 1 background

CTR estimation is a very common problem in the industry. Click Through Rate refers to whether the goods pushed to a customer will be clicked. The goods pushed to the customer will have the greatest probability of being clicked will undoubtedly improve profitability.
In the early stage of the development of CTR prediction problem, the most used method is logistic regression (LR). LR uses Sigmoid transform to map the function value to the 0 ~ 1 interval, and the mapped function value is the estimated value of CTR.
LR is a linear model, which is easy to parallelize and can easily process hundreds of millions of data, but its learning ability is very limited, and a large number of feature engineering is needed to increase the learning ability of the model. However, a large number of feature projects are time-consuming and labor-consuming, and do not necessarily bring about an improvement in the effect. Therefore, how to automatically find effective features and feature combinations, make up for the lack of manual experience and shorten the LR feature experiment cycle is an urgent problem to be solved.

Facebook's 2014 article introduced how to solve the feature combination problem of LR through GBDT, and then the Kaggle competition also practiced this idea. The integration of GBDT and LR began to attract the attention of the industry.
Before introducing this model, let's introduce two problems:
1) Why use the integrated decision tree model instead of the single tree decision tree model: the expression ability of one tree is very weak, which is not enough to express multiple distinctive feature combinations, and the expression ability of multiple trees is stronger. It can better find effective features and feature combinations
2) Why GBDT is used instead of RF: RF is also a number of trees, but it has been proved that the effect is not as good as GBDT. The feature splitting of the tree in front of GBDT mainly reflects the distinguishing feature of most samples; The later trees mainly reflect a few samples with large residual after passing through the first N trees. It is more reasonable to give priority to the features with discrimination on the whole, and then select the features with discrimination for a few samples, which should also be the reason for using GBDT.

# 2 GBDT + LR

The main integration process of the two methods is shown in the figure below

GBDT + LR

For the input sample, its feature is vector x. two trees are constructed by GBDT method. The first tree has three leaf nodes and the second tree has two leaf nodes.
Suppose x enters the first tree and falls into the first leaf of the three leaf nodes, and enters the second tree and falls into the second leaf of the two leaf nodes. Here we construct a new feature. The first tree has three leaves. Construct a vector U1 = [0, 0, 0], and sample x falls to the first leaf. Change the first position of vector U1 to 1, and the other positions are still 0, that is, U1=[1, 0, 0]. The second tree has two leaves. Construct vector U2 = [0,0], sample x falls to the second leaf, and change the value of the second position to 1, that is, U2 = [1,0]. Next, U1 and U2 are connected in series to become [1, 0, 0, 1, 0]. In this way, the transformed eigenvector is obtained.
Next, the transformed eigenvectors are trained with LR.

# 3 practice

The lightgbm package is used here. It is also convenient to install. Open the conda command line and conda install lightgbm.
All packages to be imported are as follows

```import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
```

## 3.1 data sets

The data set used here is kaggle:avazu-ctr-prediction , the downloaded training set is about 1.2GB. After decompression, it is 6GB. Because the computer memory is not enough, a memoryrerror occurs during the later training, so I cut 1w pieces of data and download the link Small data set Extraction code: 6gr9 when implementing, select the first 8k as the training set and the last 2k as the test set.

Dataset fields:

```id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables
```

```cols = ['C1', 'banner_pos',  'site_domain',  .....'app_id', 'C14','C15', 'C16']
```

```df_train = pd.read_csv(r"D:\Kaggle_CTR\train.csv")
cols = ['C1','banner_pos', 'site_domain',  'site_id', 'site_category','app_id', 'app_category',  'device_type',  'device_conn_type', 'C14', 'C15','C16']

cols_all = ['id']
cols_all.extend(cols)

y = df['click']
y_train = y.iloc[:-2000] # training label
y_test = y.iloc[-2000:]  # testing label

X = df[cols_all[1:]]  # training dataset
```
1. 'site 'read by pandas_ ID 'and so on are object objects, which need to be converted into strings, and then converted into categories with LabelEncoder.

```# label encode
lbl = preprocessing.LabelEncoder()
X['site_domain'] = lbl.fit_transform(X['site_domain'].astype(str))#Convert the prompted column containing the wrong data type
X['site_id'] = lbl.fit_transform(X['site_id'].astype(str))
X['site_category'] = lbl.fit_transform(X['site_category'].astype(str))
X['app_id'] = lbl.fit_transform(X['app_id'].astype(str))
X['app_category'] = lbl.fit_transform(X['app_category'].astype(str))

X_train = X.iloc[:-2000]
X_test = X.iloc[-2000:]  # testing dataset
```

## 3.2 training LGB

When using lgb, first convert the data set into lgb data set, and then train after specifying parameters.

```# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'binary_logloss'},
'num_leaves': 64,
'num_trees': 100,
'learning_rate': 0.01,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}

# number of leaves,will be used in feature transformation
num_leaf = 64

print('Start training...')
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=100,
valid_sets=lgb_train)
```

The result of 100 rounds of training is [100] training's binary_logloss: 0.408675.

## 3.3 transforming eigenvectors with lgb

1. The trained lgb model is used to predict the training set and observe which leaf nodes it falls on.

```y_pred_train = gbm.predict(X_train, pred_leaf=True)
```

Observation y_ pred_ Shape of train

```In[181]:  y_pred_train.shape
Out[181]: (8000, 100)
```

There are 8000 samples and 100 trees (num_trees = 100 in the above parameters). Observe the first sample y_ pred_ The first 10 values of train [0]:

```In[182]:  y_pred_train[0][:10]
Out[182]: array([31, 29, 29, 32, 38, 46, 35, 36, 36, 42])
```

The first number 31 indicates that the sample has fallen to the 31 leaf node of the first tree, and 29 indicates that it has fallen to the 29 leaf node of the second tree. Note that 31 and 29 indicate the node number, starting from 0 to 63.

1. Convert leaf node number to OneHot code

```# train data
transformed_training_matrix = np.zeros([len(y_pred_train), len(y_pred_train[0]) * num_leaf],dtype=np.int64)  # N * num_tress * num_leafs
for i in range(0, len(y_pred_train)):
temp = np.arange(len(y_pred_train[0])) * num_leaf + np.array(y_pred_train[i])
transformed_training_matrix[i][temp] += 1

# test data
transformed_testing_matrix = np.zeros([len(y_pred_test), len(y_pred_test[0]) * num_leaf], dtype=np.int64)
for i in range(0, len(y_pred_test)):
temp = np.arange(len(y_pred_test[0])) * num_leaf + np.array(y_pred_test[i])
transformed_testing_matrix[i][temp] += 1
```

## 3.4 LR

```lm = LogisticRegression(penalty='l2',C=0.05)
lm.fit(transformed_training_matrix,y_train)
y_pred_lr_test = lm.predict_proba(transformed_testing_matrix)
```

## 3.5 result evaluation

The evaluation index specified in Kaggle is ne (normalized cross enterprise)

NE formula

```NE = (-1) / len(y_pred_lr_test) * sum(((1+y_test)/2 * np.log(y_pred_lr_test[:,1]) +  (1-y_test)/2 * np.log(1 - y_pred_lr_test[:,1])))
print("Normalized Cross Entropy " + str(NE))
```

The NE result I calculated here is 1.27958508822, which is actually a very poor result. The most important thing to achieve good results in the competition is feature engineering, that is, combining original features and calculating new features. Feature Engineering determines the upper bound of the algorithm result.

# 4 Summary

GBDT+LR is only a memory of history and is not really suitable for most current business data. Current business data is high-dimensional discrete data caused by a large number of discrete features. The tree model can not deal with such discrete features well, because it is easy to lead to over fitting.