# [machine learning] how to use random grid search to shorten the grid search speed?

Random grid search RandomSearchCV learning notes, including:

1. Basic principle of random grid search
2. skelarn application of random grid search (case: house price dataset _python)
3. Application of continuous distribution in random grid search (case: house price data set _python)
• Indexes

ðŸ”£ Functions and parameters

ðŸ”‘ formula

ðŸ—£ case

ðŸ“Œ Noun interpretation

ðŸ“– Extract

• 1 basic principle of random grid search

ðŸ“– Factors affecting the speed of enumeration grid search

1. Size of parameter space: the larger the parameter space is, the more modeling is required

2. Size of data volume: the larger the data volume, the more computing power and time required for each modeling

ðŸ—£ Case: global parameter space VS partial parameter space (schematic diagram)

```n_e_list=range(50,350,50)
m_d_list=range(2,7)

comb=pd.DataFrame([(n_estimators, max_depth)
for n_estimators in n_e_list
for max_depth in m_d_list]) # Create n_e_list and M_ d_ Cartesian product of list

fig,[ax1,ax2]=plt.subplots(1,2,dpi=100)
ax1.scatter(comb.iloc[:,0],comb.iloc[:,1])
ax1.set_title('GridSearch')

ax2.scatter(comb.iloc[:,0],comb.iloc[:,1])
ax2.scatter([50,250,200,200,300,100,150,150],[4,2,6,3,2,3,2,5],cmap='red',s=50)
ax2.set_title('RandomSearch')
plt.show()
```

```ðŸ“Œ Random grid search
The method of randomly extracting parameter subspace and searching in self space.

- Fast operation speed
- Large coverage space
- The minimum loss is close to the minimum loss of enumeration network

ðŸ“– Sampling characteristics of random grid search

Random grid search adopts "cyclic iteration".

In this iteration, a group of parameters is randomly selected for modeling, and in the next iteration, a group of parameters is randomly selected for modeling. Since this random sampling is not put back, there will be no problem of sampling the same set of parameters twice.

The number of iterations of random grid search can be controlled to control the size of the parameter subspace extracted as a whole. This practice is often referred to as "giving random grid search a fixed amount of calculation. When all the calculation is consumed, the random grid search stops".

In practice, random grid search does not sample out the subspace first, and then search the subspace.
```
• 2. Implementation of random grid search

ðŸ”£ Random grid search in skelarn

```from sklearn.model_selection import RandomizedSearchCV

RandomizedSearchCV(
estimator, # Evaluator
param_distributions, # Global parameter space
*,
n_iter=10, # Number of iterations
scoring=None, # Evaluation index
n_jobs=None,
refit=True, # Whether to select and evaluate the best data set
cv=None, # Cross validation mode
verbose=0,
random_state=None,
error_score=nan, # When the grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted. In other cases, the training will continue after a warning message is displayed
return_train_score=False, # Whether to display parameter scores in training set
)
```
NameDescription
param_distributionsThe global parameter space can be a dictionary or a list composed of dictionaries
n_iterThe number of iterations. The more iterations, the larger the extracted sub parameter space
scoringEvaluation indicators, supporting simultaneous output of multiple parameters
n_jobsSet the number of threads participating in the calculation when working
refitSelect the evaluation indicators and the best parameters for training on the complete data set
cvDiscount of cross validation
verboseOutput work log form
random_stateRandom number seed
error_scoreWhen the grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted. In other cases, the training will continue after a warning message is displayed
return_train_scoreWhether to display parameter scores in training set in cross validation

ðŸ”£ Case: application of random grid in random forest_ House price data set

ðŸ“– Under the same parameter space and model, the search speed of random grid is faster than that of ordinary grid.

Running time ≈ n_iter / number of global space combinations * grid search

```from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import KFold

param_grid_simple = {'n_estimators': range(50,150,10)
, 'max_depth': range(10,25,2)
, "max_features": ["sqrt",16,32,64,"auto"]
, "min_impurity_decrease": np.arange(0,5,2)
}

#Calculate parameter space size
def count_space(param):
no_option = 1
for i in param_grid_simple:
no_option *= len(param_grid_simple[i])
print(no_option)

count_space(param_grid_simple)

# Training model
model = RFR(random_state=7,verbose=True,n_jobs=4)
cv = KFold(n_splits=5,shuffle=True,random_state=7)
search = RandomizedSearchCV(estimator=model
,param_distributions=param_grid_simple
,n_iter = 600 #The size of the subspace is about half of the global space
,scoring = "neg_mean_squared_error"
,verbose = True
,cv = cv
,random_state=1412
,n_jobs=-1
)

search.fit(X,y)

search.best_estimator_ # View model parameter results
# RandomForestRegressor(max_depth=18, max_features=16, min_impurity_decrease=0,
#                       n_jobs=4, random_state=7, verbose=True)

abs(search.best_score_)**0.5 # View model RMSE score
# 29160.978459432965

# View the model effect of the optimal parameters
from sklearn.model_selection import cross_validate
, max_features=16
, min_impurity_decrease=0
, random_state=7
, n_jobs=-1)

def RMSE(cvresult,key):
return (abs(cvresult[key])**0.5).mean()

cv = KFold(n_splits=5,shuffle=True,random_state=7)
,cv=cv
,scoring="neg_mean_squared_error"
,return_train_score=True
,verbose=True
,n_jobs=-1)

# Training RMSE:10760.565
# Test RMSE:28265.808
```
• 3 continuous parameter space

ðŸ“– Continuous type may bring better value
Grid search: only combined parameters can be used to combine points;
Random search: accept distribution as input

As shown in the figure above, for grid search, if the lowest point of the loss function is between two sets of parameters, it is impossible to find the minimum value by enumerating grid search; However, for random grid search, because the parameter points are randomly selected on a section of distribution, it is more likely to get better values in the same parameter space.

ðŸ“– When the parameter space contains a distribution, the size of the global parameter space cannot be estimated.

ðŸ—£ Case: min_impurity_decrease for continuous distribution search

ðŸ“– Effect of using continuous distribution in random search
Compared with grid search, it runs faster in the same search space, and the cross validation results of search and reconstruction are slightly better than RMSE;
Compared with small space grid search, the running time is longer and RMSE is slightly better;
Compared with large space grid search, the running time is longer and RMSE is slightly worse (the model effect is not necessarily).

Effect: continuous random mesh > large space random mesh > random mesh > mesh search
Operation speed: grid search > continuous random grid > large space random grid > Random grid

When the global parameter space used in enumeration grid search is large enough / dense enough, the optimal solution of enumeration grid search is the upper limit of random grid search. Therefore, in theory, random grid search will not get better results than enumeration grid search.

```param_grid_simple={'n_estimators':range(50,150,10)
,'max_depth':range(10,25,2)
,'max_features':range(10,20,2)
,'min_impurity_decrease':scipy.stats.uniform(0,50)}

model=RFR(random_state=7)
cv=KFold(n_splits=5,shuffle=True,random_state=7)

search=RandomizedSearchCV(estimator=model
,param_distributions=param_grid_simple
,n_iter=600
,scoring='neg_mean_squared_error'
,cv=cv
,random_state=7
,n_jobs=4)

search.fit(X,y)

search.best_estimator_
# RandomForestRegressor(max_depth=18, max_features=16,
#                       min_impurity_decrease=34.80143424780533, random_state=7)

abs(search.best_score_)**0.5
# 29155.5402993104

rebuild_on_best_param(search.best_estimator_)
# Training RMSE:10733.842
# Test RMSE:28285.986
```

Keywords: Python Machine Learning

Added by Fed51 on Tue, 01 Mar 2022 14:05:02 +0200