Contents of this chapter:
- The principle and operation process of half grid search (theoretical description)
- Description of halfinggridsearchcv parameter in sklearn
- π€·βοΈ Case: half grid search_ House price data set_ python
Indexes
- π£ Functions and parameters
- π£ case
- π€·βοΈ case
- π Extract
1 (Theory) principle and process of half grid search
π The main solution to semi grid search
Long operation time caused by large amount of data
π Principle of half grid search
By extracting some data sets each time, the amount of data used in each modeling can be reduced, so as to reduce the amount of calculation.
It should be noted that in order to ensure that the reduced amount of data can effectively reflect the overall data situation, the distribution of a small number of extracted data sets needs to be consistent with the overall data distribution.
π Half grid process
1. Firstly, a small subset π 0 is randomly sampled from the whole data set, and the performance of all parameter combinations is verified on π 0. According to the verification results on π 0, eliminate the half parameter combination with the last half of the score. (because the data set is very small, it is faster to verify all parameter combinations)
2. Then, a subset π 1 twice as large as π 0 is sampled from the whole data set, and the performance of the remaining half of the parameter combination is verified on π 1. According to the verification results on π 1, the parameter combination with the lowest score of 1 / 2 is eliminated
3. Then, a subset π 2 twice as large as π 1 is sampled from the whole data set, and the performance of the remaining 1 / 4 parameter combination is verified on π 2. According to the verification results on π 2, eliminate the parameter combination with the last half of the score
4. When there is only one group of alternative parameter combination left, or the remaining available data is insufficient, the circular search stops.
π Limitations of half grid search
-
For the data set selected at the beginning of half search, the most parameter combinations are filtered out; If the initial data set is quite different from the overall data set, the effect of the last selected parameter is not good.
-
In order to avoid this problem, the initial data set should not be too small, and the sample size of the overall data should be large.
-
If the overall data sample is small, the search effect on the half grid may not be as good as that on the ordinary grid.
2. Description of halfinggridsearchcv parameter in sklearn
π£ HalvingGridSearchCV parameter description
from sklearn.experimental import enable_halving_search_cv # You must import this module before you can import HalvingGridSearchCV! from sklearn.model_selection import HalvingGridSearchCV HalvingGridSearchCV( estimator, # Evaluator param_grid, #parameter space *, factor=3, # Proportion of new sample size in each iteration resource='n_samples', # Type of verification resources added in each iteration max_resources='auto', # Maximum sample size of parameter combination in each iteration min_resources='exhaust', # Sample size of parameter combination at the first iteration aggressive_elimination=False, # Whether the condition for stopping iteration is to use all data. Not by default cv=5, # Number / mode of cross validation scoring=None, # Evaluation index refit=True, # Is training performed on a complete dataset error_score=nan, # The result returned when an error is reported. When 'raise' is selected, an error will be reported directly and the training process will be interrupted return_train_score=True, # Show training set scores random_state=None, n_jobs=None, verbose=0, )
Name | Description |
---|---|
estimator | Parameter adjustment object, an evaluator |
param_grid | The parameter space can be a dictionary or a list composed of dictionaries |
factor | The proportion of the new sample size in each iteration is also the proportion of the parameter combination left after each iteration |
resource | Set the type of validation resources added in each iteration |
max_resources | In one iteration, the maximum sample size allowed to be used to verify any combination of parameters |
min_resources | At the first iteration, the sample size r0 used to verify the parameter combination |
aggressive_elimination | Whether to use the completion of all numbers as an indicator to stop the search. If not, take measures |
cv | Discount of cross validation |
scoring | Evaluation indicators, supporting simultaneous output of multiple parameters |
refit | Select the evaluation indicators and the best parameters for training on the complete data set |
error_score | When grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted In other cases, a warning message will be displayed before continuing to complete the training |
return_train_score | Whether to display parameter scores in training set in cross validation |
random_state | Controlling the randomness of randomly sampled data sets |
n_jobs | Set the number of threads participating in the calculation when working |
verbose | Output work log form |
-
factor
The proportion of the new sample size in each iteration is also the proportion of the parameter combination left after each iteration. For example, when factor=3, the sample size of the next iteration will be three times that of the previous one, and 1 / 3 of the parameter combination will be left after each iteration. The effect is better when this parameter is usually taken as 3.
-
resource
Set the type of verification resources added in each iteration, and enter it as a string. The default is the sample size, and the input is "n_samples". It can also be a weak classifier that inputs a positive integer in any integration algorithm, such as "n_estimators". Then the tree of the tree will increase by factor times after each iteration.
-
min_resource
At the first iteration, the sample size r0 used to verify the parameter combination. You can enter a positive integer or two strings "smallest" and "exhaust".
-
Enter a positive integer n, indicating that n samples are used in the first iteration.
-
Enter "smallest" to calculate r0 according to the rule:
βοΈ When the resource type is sample size,
For regression algorithms, r0 = cross validation discount n_splits * 2 when the resource type is sample size;
For classification algorithm, r0 = number of categories n_classes_ * Cross validation n_splits * 2οΌ
When the resource type is not a sample size, it is equal to 1; -
Enter "exhaust" to reverse r0 according to the maximum available resources in the last round of iteration.
For example, if factor=2 and the sample size is 1000 and there are 3 iterations in total, the maximum available resources of the last iteration are 1000, the penultimate iteration is 500, and the penultimate iteration (the first iteration) is 250. At this point r0 = 250. In "exhaust" mode, it is most likely to get good results, but the amount of calculation will be slightly larger and the calculation time will be slightly longer.
-
π How to determine the conditions for stopping iteration?
- f a c t o r β m i n _ r e s o u r c e i > number according to collection factor*min\_ Resource ^ I > dataset factor∗min_ Resourcei > data set, and i is the number of iterations supported by the data volume
- ginseng number group close number / / f a c t o r i < 1 Number of parameter combinations / / factor ^ I < 1 The number of parameter combinations / / factori < 1, and the calculated i is the number of iterations supported by the parameter combination
In the i of the above two methods, the smaller i corresponds to the condition of iteration stop.
such as
i
i
<
i
2
i_i<i_2
ii < I2, then the condition for stopping the iteration is: use all the data;
The number of iterations in this case does not support running a complete parameter combination, that is, the iteration has stopped before the possible optimal parameter combination has time to test.
π£ aggressive_ Eliminationparameter
- This parameter is used to solve the situation that the amount of data is small and is not enough to support running a complete combination of parameters;
- When the parameter is set to True, the sample size at the first iteration will be reused until the remaining data is sufficient to support the increase of the sample size until only the last set of alternative parameters are left.
- When the parameter is set to False, all samples are used up as an indicator of the end of the search.
Limitations: due to the use of the first iteration sample, the difference between this sample set and the overall data set is the largest; Therefore, the more times of use, the greater the possible deviation of the results.
βοΈ Points to consider when weighing the number of iterations and the amount of calculation:
-
min_ The value of resources cannot be too small, and we want to use as much data as possible before the end of the whole iteration process
-
After the iteration, the combination of remaining verification parameters should not be too many. It is best below 10. If it cannot be achieved, it is also acceptable below 30
-
The number of iterations should not be too many, otherwise the time may be too long
π£ How to determine the first sample size:
First, print the number of samples and parameter combinations for each iteration:
factor = 1.5 n_samples = X.shape[0] min_resources = 500 space = 1440 for i in range(100): if (min_resources*factor**i > n_samples) or (space//factor**i < 1): break print(i+1,"Sample of this round of iteration:{}".format(min_resources*factor**i) ,"Parameter combination of this round of verification:{}".format(space//factor**i + 1))
- i: Indicates the total number of iterations performed when the iteration stops
- Iterative sample size: indicates how many samples are used for calculation. The closer it is to the overall data size, the better
- Parameter combination: indicates how many combinations are left in the parameter space. The smaller the better.
- The larger the factor, the better; min_ The larger the resources, the better
π€·βοΈ Case 3: semi random grid search_ House price data set_ python
π Semi random grid search VS random grid search
Running time: half search time is shorter
Model effect: random grid search is slightly better
π£ Case: semi random grid search_ House price data set_ python
# Import dataset data=pd.read_csv(r'C:\Users\EDZ\test\ML-2 courseware\Lesson 9.Stochastic forest model\datasets\House Price\big_train.csv',index_col=0) X=data.iloc[:,:-1] y=data.iloc[:,-1] X.head() # Set parameter space param_grid_simple={'n_estimators':range(60,120,5) ,'max_depth':range(10,25,2) ,'max_features':['sqrt',16,32] ,'min_impurity_decrease':range(5,30,5)} # Calculate the size of the parameter space def count_space(param_grid_simple): no_option=1 for i in param_grid_simple: no_option*=len(param_grid_simple[i]) return no_option count_space(param_grid_simple)
from sklearn.ensemble import RandomForestRegressor as RFR from sklearn.model_selection import KFold model=RFR(random_state=7) cv=KFold(n_splits=5,shuffle=True,random_state=7) search=HalvingGridSearchCV(estimator=model ,param_grid=param_grid_simple ,factor=1.5 ,min_resources=500 ,scoring='neg_mean_squared_error' ,n_jobs=4 ,random_state=7 ,cv=cv ) search.fit(X,y) search.best_estimator_ # View the optimal parameter combination abs(search.best_score_)**0.5 # View RMSE # View the model effect of the optimal parameters from sklearn.model_selection import cross_validate def RMSE(cvresult,key): return (abs(cvresult[key])**0.5).mean() def rebuild_on_best_param(ad_reg): cv=KFold(n_splits=5,shuffle=True,random_state=7) res_post_adjusted=cross_validate(ad_reg,X,y ,cv=cv ,scoring='neg_mean_squared_error' ,return_train_score=True ,n_jobs=-1) print('train RMSE: {:.3f}'.format(RMSE(res_post_adjusted,'train_score'))) print('test RMSE: {:.3f}'.format(RMSE(res_post_adjusted,'test_score'))) rebuild_on_best_param(search.best_estimator_) # Training RMSE: 460.432 # Test RMSE: 1107.865