[machine learning] how to use half grid search to shorten the grid search speed?

Contents of this chapter:

The principle and operation process of half grid search (theoretical description)
Description of halfinggridsearchcv parameter in sklearn
🤷‍♀️ Case: half grid search_ House price data set_ python

Indexes

🔣 Functions and parameters
🗣 case
🤷‍♀️ case
📖 Extract

1 (Theory) principle and process of half grid search

📖 The main solution to semi grid search
Long operation time caused by large amount of data

📖 Principle of half grid search
By extracting some data sets each time, the amount of data used in each modeling can be reduced, so as to reduce the amount of calculation.
It should be noted that in order to ensure that the reduced amount of data can effectively reflect the overall data situation, the distribution of a small number of extracted data sets needs to be consistent with the overall data distribution.

📖 Half grid process

1. Firstly, a small subset 𝑑 0 is randomly sampled from the whole data set, and the performance of all parameter combinations is verified on 𝑑 0. According to the verification results on 𝑑 0, eliminate the half parameter combination with the last half of the score. (because the data set is very small, it is faster to verify all parameter combinations)

2. Then, a subset 𝑑 1 twice as large as 𝑑 0 is sampled from the whole data set, and the performance of the remaining half of the parameter combination is verified on 𝑑 1. According to the verification results on 𝑑 1, the parameter combination with the lowest score of 1 / 2 is eliminated

3. Then, a subset 𝑑 2 twice as large as 𝑑 1 is sampled from the whole data set, and the performance of the remaining 1 / 4 parameter combination is verified on 𝑑 2. According to the verification results on 𝑑 2, eliminate the parameter combination with the last half of the score

4. When there is only one group of alternative parameter combination left, or the remaining available data is insufficient, the circular search stops.

📖 Limitations of half grid search

For the data set selected at the beginning of half search, the most parameter combinations are filtered out; If the initial data set is quite different from the overall data set, the effect of the last selected parameter is not good.
In order to avoid this problem, the initial data set should not be too small, and the sample size of the overall data should be large.
If the overall data sample is small, the search effect on the half grid may not be as good as that on the ordinary grid.

2. Description of halfinggridsearchcv parameter in sklearn

🔣 HalvingGridSearchCV parameter description

from sklearn.experimental import enable_halving_search_cv 
# You must import this module before you can import HalvingGridSearchCV!
from sklearn.model_selection import HalvingGridSearchCV

HalvingGridSearchCV(
    estimator, # Evaluator
    param_grid, #parameter space 
    *,
    factor=3, # Proportion of new sample size in each iteration
    resource='n_samples', # Type of verification resources added in each iteration
    max_resources='auto', # Maximum sample size of parameter combination in each iteration
    min_resources='exhaust', # Sample size of parameter combination at the first iteration
    aggressive_elimination=False, # Whether the condition for stopping iteration is to use all data. Not by default
    cv=5, # Number / mode of cross validation
    scoring=None, # Evaluation index
    refit=True, # Is training performed on a complete dataset
    error_score=nan, # The result returned when an error is reported. When 'raise' is selected, an error will be reported directly and the training process will be interrupted
    return_train_score=True, # Show training set scores
    random_state=None, 
    n_jobs=None,
    verbose=0,
)

Name	Description
estimator	Parameter adjustment object, an evaluator
param_grid	The parameter space can be a dictionary or a list composed of dictionaries
factor	The proportion of the new sample size in each iteration is also the proportion of the parameter combination left after each iteration
resource	Set the type of validation resources added in each iteration
max_resources	In one iteration, the maximum sample size allowed to be used to verify any combination of parameters
min_resources	At the first iteration, the sample size r0 used to verify the parameter combination
aggressive_elimination	Whether to use the completion of all numbers as an indicator to stop the search. If not, take measures
cv	Discount of cross validation
scoring	Evaluation indicators, supporting simultaneous output of multiple parameters
refit	Select the evaluation indicators and the best parameters for training on the complete data set
error_score	When grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted In other cases, a warning message will be displayed before continuing to complete the training
return_train_score	Whether to display parameter scores in training set in cross validation
random_state	Controlling the randomness of randomly sampled data sets
n_jobs	Set the number of threads participating in the calculation when working
verbose	Output work log form

factor

The proportion of the new sample size in each iteration is also the proportion of the parameter combination left after each iteration. For example, when factor=3, the sample size of the next iteration will be three times that of the previous one, and 1 / 3 of the parameter combination will be left after each iteration. The effect is better when this parameter is usually taken as 3.
resource

Set the type of verification resources added in each iteration, and enter it as a string. The default is the sample size, and the input is "n_samples". It can also be a weak classifier that inputs a positive integer in any integration algorithm, such as "n_estimators". Then the tree of the tree will increase by factor times after each iteration.
min_resource

At the first iteration, the sample size r0 used to verify the parameter combination. You can enter a positive integer or two strings "smallest" and "exhaust".
1. Enter a positive integer n, indicating that n samples are used in the first iteration.
2. Enter "smallest" to calculate r0 according to the rule:
  
  ☑️ When the resource type is sample size,
  For regression algorithms, r0 = cross validation discount n_splits * 2 when the resource type is sample size;
  For classification algorithm, r0 = number of categories n_classes_ * Cross validation n_splits * 2；
  When the resource type is not a sample size, it is equal to 1;
3. Enter "exhaust" to reverse r0 according to the maximum available resources in the last round of iteration.
  For example, if factor=2 and the sample size is 1000 and there are 3 iterations in total, the maximum available resources of the last iteration are 1000, the penultimate iteration is 500, and the penultimate iteration (the first iteration) is 250. At this point r0 = 250. In "exhaust" mode, it is most likely to get good results, but the amount of calculation will be slightly larger and the calculation time will be slightly longer.

📖 How to determine the conditions for stopping iteration?

f a c t o r ∗ m i n _ r e s o u r c e i > number according to collection factor*min\_ Resource ^ I > dataset factor∗min_ Resourcei > data set, and i is the number of iterations supported by the data volume
ginseng number group close number / / f a c t o r i < 1 Number of parameter combinations / / factor ^ I < 1 The number of parameter combinations / / factori < 1, and the calculated i is the number of iterations supported by the parameter combination

In the i of the above two methods, the smaller i corresponds to the condition of iteration stop.

such as i i < i 2 i_i<i_2 ii < I2, then the condition for stopping the iteration is: use all the data;
The number of iterations in this case does not support running a complete parameter combination, that is, the iteration has stopped before the possible optimal parameter combination has time to test.

🔣 aggressive_ Eliminationparameter

This parameter is used to solve the situation that the amount of data is small and is not enough to support running a complete combination of parameters;
When the parameter is set to True, the sample size at the first iteration will be reused until the remaining data is sufficient to support the increase of the sample size until only the last set of alternative parameters are left.
When the parameter is set to False, all samples are used up as an indicator of the end of the search.

Limitations: due to the use of the first iteration sample, the difference between this sample set and the overall data set is the largest; Therefore, the more times of use, the greater the possible deviation of the results.

☑️ Points to consider when weighing the number of iterations and the amount of calculation:

min_ The value of resources cannot be too small, and we want to use as much data as possible before the end of the whole iteration process
After the iteration, the combination of remaining verification parameters should not be too many. It is best below 10. If it cannot be achieved, it is also acceptable below 30
The number of iterations should not be too many, otherwise the time may be too long

🗣 How to determine the first sample size:
First, print the number of samples and parameter combinations for each iteration:

factor = 1.5
n_samples = X.shape[0]
min_resources = 500
space = 1440

for i in range(100):
    if (min_resources*factor**i > n_samples) or (space//factor**i < 1):
        break
    print(i+1,"Sample of this round of iteration:{}".format(min_resources*factor**i)
          ,"Parameter combination of this round of verification:{}".format(space//factor**i + 1))

i: Indicates the total number of iterations performed when the iteration stops
Iterative sample size: indicates how many samples are used for calculation. The closer it is to the overall data size, the better
Parameter combination: indicates how many combinations are left in the parameter space. The smaller the better.
The larger the factor, the better; min_ The larger the resources, the better

🤷‍♀️ Case 3: semi random grid search_ House price data set_ python

📖 Semi random grid search VS random grid search
Running time: half search time is shorter
Model effect: random grid search is slightly better

🔣 Case: semi random grid search_ House price data set_ python

# Import dataset
data=pd.read_csv(r'C:\Users\EDZ\test\ML-2 courseware\Lesson 9.Stochastic forest model\datasets\House Price\big_train.csv',index_col=0)
X=data.iloc[:,:-1]
y=data.iloc[:,-1]
X.head()

# Set parameter space
param_grid_simple={'n_estimators':range(60,120,5)
                   ,'max_depth':range(10,25,2)
                   ,'max_features':['sqrt',16,32]
                   ,'min_impurity_decrease':range(5,30,5)}

# Calculate the size of the parameter space
def count_space(param_grid_simple):
    no_option=1
    for i in param_grid_simple:
        no_option*=len(param_grid_simple[i])
    return no_option
count_space(param_grid_simple)

from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import KFold

model=RFR(random_state=7)
cv=KFold(n_splits=5,shuffle=True,random_state=7)
search=HalvingGridSearchCV(estimator=model
                           ,param_grid=param_grid_simple
                           ,factor=1.5
                           ,min_resources=500
                           ,scoring='neg_mean_squared_error'
                           ,n_jobs=4
                           ,random_state=7
                           ,cv=cv
                           )
search.fit(X,y)

search.best_estimator_ # View the optimal parameter combination
abs(search.best_score_)**0.5 # View RMSE

# View the model effect of the optimal parameters

from sklearn.model_selection import cross_validate

def RMSE(cvresult,key):
    return (abs(cvresult[key])**0.5).mean()

def rebuild_on_best_param(ad_reg):
    cv=KFold(n_splits=5,shuffle=True,random_state=7)
    res_post_adjusted=cross_validate(ad_reg,X,y
                                     ,cv=cv
                                     ,scoring='neg_mean_squared_error'
                                     ,return_train_score=True
                                     ,n_jobs=-1)
    print('train RMSE: {:.3f}'.format(RMSE(res_post_adjusted,'train_score')))
    print('test RMSE: {:.3f}'.format(RMSE(res_post_adjusted,'test_score')))

rebuild_on_best_param(search.best_estimator_)
# Training RMSE: 460.432
# Test RMSE: 1107.865

Keywords: Machine Learning Data Analysis

Added by asaschool on Thu, 03 Mar 2022 16:38:46 +0200

Programming VIP

[machine learning] how to use half grid search to shorten the grid search speed?

1 (Theory) principle and process of half grid search

2. Description of halfinggridsearchcv parameter in sklearn

🤷‍♀️ Case 3: semi random grid search_ House price data set_ python

Popular Keywords