Filter feature filtering + random forest modeling + Kaggle--Elo Merchant Category Recommendation

Data preprocessing process

thinking

import pandas as pd
import numpy as np

data fetch

train = pd.read_csv("preprocess/train.csv")
test = pd.read_csv("preprocess/test.csv")

Stochastic forest model prediction

Feature selection – Pearson correlation coefficient

(train.shape, test.shape)
((201917, 1700), (123623, 1699))
# Extract feature name
features = train.columns.tolist()
features.remove('card_id')
features.remove("target")
featureSelect = features

# Calculate correlation coefficient
corr = []
for fea in featureSelect:
    corr.append(abs(train[[fea,'target']].fillna(0).corr().values[0][1]))
    
# Take the features of top300 for modeling, and the specific quantity can be selected
se = pd.Series(corr, index=featureSelect).sort_values(ascending=False)
feature_select = ['card_id'] + se[:300].index.tolist()

# Output results
train = train[feature_select + ['target']]
test = test[feature_select]

Grid search parameters

from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
NameDescription
criterionRule evaluation index or loss function, default Gini coefficient, optional information entropy
splitterTree model growth mode: by default, it grows in the fastest way to reduce the value of loss function. It can be divided randomly according to certain conditions
max_depthThe maximum growth depth of the tree, similar to max_iter, that is, how many iterations in total
min_samples_splitMinimum number of samples required for internal node re Division
min_samples_leafThe leaf node contains the minimum number of samples
min_weight_fraction_leafMinimum weight sum required for leaf nodes
max_featuresHow many features can be brought in during segmentation to select the segmentation rules
random_stateRandom number seed
max_leaf_nodesMaximum number of leaf nodes
min_impurity_decreaseAt least the loss value that needs to be reduced for data set re division
min_impurity_splitThe minimum impurity required for data set repartition will be removed in version 0.25
class_weightWeight of various samples

  then the selection of grid search tools. With the continuous improvement of sklearn, there are more and more grid search tools to choose from, but on the whole, it is actually a trade-off between efficiency and accuracy. Some grid search tools have slow execution efficiency but guaranteed result accuracy due to global enumeration (such as GridSearchCV). If you are willing to sacrifice accuracy for execution efficiency, there are also many tools to choose from, Such as RandomizedSearchCV. Of course, in the latest version of sklearn, there is also a more efficient search strategy - HalvingGridSearchCV. This method first compares two by two, and then filters the parameters layer by layer. It supports both HalvingGridSearchCV and HalvingRandomSearchCV. Note that this function is only supported in the latest version of sklearn, that is, version 0.24. The emergence of this function is also one of the biggest changes in version 0.24. The addition of this function will further reduce the computing resources required for grid search and speed up grid search.

  based on the data of this competition, in the actual process of grid search, it is recommended to use randomized searchcv to determine the approximate range, and then use GridSearchCV to search for specific parameter values with high precision. Of course, if you are using the latest version of sklearn, you can also consider using the Halving method for search. In the course of public lecture, due to the limited time, we set a relatively small parameter space to search on the premise of roughly determining the optimal parameter range:

features = train.columns.tolist()
features.remove("card_id")
features.remove("target")


parameter_space = {
    "n_estimators": [79, 80, 81], 
    "min_samples_leaf": [29, 30, 31],
    "min_samples_split": [2, 3],
    "max_depth": [9, 10],
    "max_features": ["auto", 80]
}
clf = RandomForestRegressor(criterion="mse",
                           n_jobs=64,
                           random_state=22)
# train[feature].isnull().sum
grid = GridSearchCV(clf, parameter_space, cv=2, scoring="neg_mean_squared_error")
grid.fit(train[features].values, train['target'].values)
GridSearchCV(cv=2, estimator=RandomForestRegressor(n_jobs=64, random_state=22),
             param_grid={'max_depth': [9, 10], 'max_features': ['auto', 80],
                         'min_samples_leaf': [29, 30, 31],
                         'min_samples_split': [2, 3],
                         'n_estimators': [79, 80, 81]},
             scoring='neg_mean_squared_error')
grid.best_estimator_
RandomForestRegressor(max_depth=10, max_features=80, min_samples_leaf=31,
                      n_estimators=80, n_jobs=64, random_state=22)

Keywords: Python Machine Learning AI

Added by nickholas on Tue, 01 Feb 2022 20:18:25 +0200