Random grid search RandomSearchCV learning notes, including:
- Basic principle of random grid search
- skelarn application of random grid search (case: house price dataset _python)
- Application of continuous distribution in random grid search (case: house price data set _python)
-
Indexes
🔣 Functions and parameters
🔑 formula
🗣 case
📌 Noun interpretation
📖 Extract
-
1 basic principle of random grid search
📖 Factors affecting the speed of enumeration grid search
1. Size of parameter space: the larger the parameter space is, the more modeling is required
2. Size of data volume: the larger the data volume, the more computing power and time required for each modeling
🗣 Case: global parameter space VS partial parameter space (schematic diagram)
n_e_list=range(50,350,50) m_d_list=range(2,7) comb=pd.DataFrame([(n_estimators, max_depth) for n_estimators in n_e_list for max_depth in m_d_list]) # Create n_e_list and M_ d_ Cartesian product of list fig,[ax1,ax2]=plt.subplots(1,2,dpi=100) ax1.scatter(comb.iloc[:,0],comb.iloc[:,1]) ax1.set_title('GridSearch') ax2.scatter(comb.iloc[:,0],comb.iloc[:,1]) ax2.scatter([50,250,200,200,300,100,150,150],[4,2,6,3,2,3,2,5],cmap='red',s=50) ax2.set_title('RandomSearch') plt.show()
📌 Random grid search The method of randomly extracting parameter subspace and searching in self space. Advantages over enumeration grid search: - Fast operation speed - Large coverage space - The minimum loss is close to the minimum loss of enumeration network 📖 Sampling characteristics of random grid search Random grid search adopts "cyclic iteration". In this iteration, a group of parameters is randomly selected for modeling, and in the next iteration, a group of parameters is randomly selected for modeling. Since this random sampling is not put back, there will be no problem of sampling the same set of parameters twice. The number of iterations of random grid search can be controlled to control the size of the parameter subspace extracted as a whole. This practice is often referred to as "giving random grid search a fixed amount of calculation. When all the calculation is consumed, the random grid search stops". In practice, random grid search does not sample out the subspace first, and then search the subspace.
-
2. Implementation of random grid search
🔣 Random grid search in skelarn
from sklearn.model_selection import RandomizedSearchCV RandomizedSearchCV( estimator, # Evaluator param_distributions, # Global parameter space *, n_iter=10, # Number of iterations scoring=None, # Evaluation index n_jobs=None, refit=True, # Whether to select and evaluate the best data set cv=None, # Cross validation mode verbose=0, pre_dispatch='2*n_jobs', # Number of task divisions when multitasking is parallel random_state=None, error_score=nan, # When the grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted. In other cases, the training will continue after a warning message is displayed return_train_score=False, # Whether to display parameter scores in training set )
Name Description estimator Parameter adjustment object, an evaluator param_distributions The global parameter space can be a dictionary or a list composed of dictionaries n_iter The number of iterations. The more iterations, the larger the extracted sub parameter space scoring Evaluation indicators, supporting simultaneous output of multiple parameters n_jobs Set the number of threads participating in the calculation when working refit Select the evaluation indicators and the best parameters for training on the complete data set cv Discount of cross validation verbose Output work log form pre_dispatch Number of task divisions when multitasking is parallel random_state Random number seed error_score When the grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted. In other cases, the training will continue after a warning message is displayed return_train_score Whether to display parameter scores in training set in cross validation 🔣 Case: application of random grid in random forest_ House price data set
📖 Under the same parameter space and model, the search speed of random grid is faster than that of ordinary grid.
Running time ≈ n_iter / number of global space combinations * grid search
from sklearn.ensemble import RandomForestRegressor as RFR from sklearn.model_selection import KFold param_grid_simple = {'n_estimators': range(50,150,10) , 'max_depth': range(10,25,2) , "max_features": ["sqrt",16,32,64,"auto"] , "min_impurity_decrease": np.arange(0,5,2) } #Calculate parameter space size def count_space(param): no_option = 1 for i in param_grid_simple: no_option *= len(param_grid_simple[i]) print(no_option) count_space(param_grid_simple) # Training model model = RFR(random_state=7,verbose=True,n_jobs=4) cv = KFold(n_splits=5,shuffle=True,random_state=7) search = RandomizedSearchCV(estimator=model ,param_distributions=param_grid_simple ,n_iter = 600 #The size of the subspace is about half of the global space ,scoring = "neg_mean_squared_error" ,verbose = True ,cv = cv ,random_state=1412 ,n_jobs=-1 ) search.fit(X,y) search.best_estimator_ # View model parameter results # RandomForestRegressor(max_depth=18, max_features=16, min_impurity_decrease=0, # n_jobs=4, random_state=7, verbose=True) abs(search.best_score_)**0.5 # View model RMSE score # 29160.978459432965 # View the model effect of the optimal parameters from sklearn.model_selection import cross_validate ad_reg=RFR(max_depth=18 , max_features=16 , min_impurity_decrease=0 , random_state=7 , n_jobs=-1) def RMSE(cvresult,key): return (abs(cvresult[key])**0.5).mean() def rebuild_on_best_param(ad_reg): cv = KFold(n_splits=5,shuffle=True,random_state=7) result_post_adjusted = cross_validate(ad_reg,X,y ,cv=cv ,scoring="neg_mean_squared_error" ,return_train_score=True ,verbose=True ,n_jobs=-1) print("train RMSE:{:.3f}".format(RMSE(result_post_adjusted,"train_score"))) print("test RMSE:{:.3f}".format(RMSE(result_post_adjusted,"test_score"))) rebuild_on_best_param(ad_reg) # Training RMSE:10760.565 # Test RMSE:28265.808
-
3 continuous parameter space
📖 Continuous type may bring better value
Grid search: only combined parameters can be used to combine points;
Random search: accept distribution as inputAs shown in the figure above, for grid search, if the lowest point of the loss function is between two sets of parameters, it is impossible to find the minimum value by enumerating grid search; However, for random grid search, because the parameter points are randomly selected on a section of distribution, it is more likely to get better values in the same parameter space.
📖 When the parameter space contains a distribution, the size of the global parameter space cannot be estimated.
🗣 Case: min_impurity_decrease for continuous distribution search
📖 Effect of using continuous distribution in random search
Compared with grid search, it runs faster in the same search space, and the cross validation results of search and reconstruction are slightly better than RMSE;
Compared with small space grid search, the running time is longer and RMSE is slightly better;
Compared with large space grid search, the running time is longer and RMSE is slightly worse (the model effect is not necessarily).Effect: continuous random mesh > large space random mesh > random mesh > mesh search
Operation speed: grid search > continuous random grid > large space random grid > Random gridWhen the global parameter space used in enumeration grid search is large enough / dense enough, the optimal solution of enumeration grid search is the upper limit of random grid search. Therefore, in theory, random grid search will not get better results than enumeration grid search.
param_grid_simple={'n_estimators':range(50,150,10) ,'max_depth':range(10,25,2) ,'max_features':range(10,20,2) ,'min_impurity_decrease':scipy.stats.uniform(0,50)} model=RFR(random_state=7) cv=KFold(n_splits=5,shuffle=True,random_state=7) search=RandomizedSearchCV(estimator=model ,param_distributions=param_grid_simple ,n_iter=600 ,scoring='neg_mean_squared_error' ,cv=cv ,random_state=7 ,n_jobs=4) search.fit(X,y) search.best_estimator_ # RandomForestRegressor(max_depth=18, max_features=16, # min_impurity_decrease=34.80143424780533, random_state=7) abs(search.best_score_)**0.5 # 29155.5402993104 rebuild_on_best_param(search.best_estimator_) # Training RMSE:10733.842 # Test RMSE:28285.986