[machine learning] how to use random grid search to shorten the grid search speed?

Random grid search RandomSearchCV learning notes, including:

  1. Basic principle of random grid search
  2. skelarn application of random grid search (case: house price dataset _python)
  3. Application of continuous distribution in random grid search (case: house price data set _python)
  • Indexes

    🔣 Functions and parameters

    🔑 formula

    🗣 case

    📌 Noun interpretation

    📖 Extract

  • 1 basic principle of random grid search

    📖 Factors affecting the speed of enumeration grid search

    1. Size of parameter space: the larger the parameter space is, the more modeling is required

    2. Size of data volume: the larger the data volume, the more computing power and time required for each modeling

    🗣 Case: global parameter space VS partial parameter space (schematic diagram)

    n_e_list=range(50,350,50)
    m_d_list=range(2,7)
    
    comb=pd.DataFrame([(n_estimators, max_depth)
                       for n_estimators in n_e_list
                       for max_depth in m_d_list]) # Create n_e_list and M_ d_ Cartesian product of list
    
    fig,[ax1,ax2]=plt.subplots(1,2,dpi=100)
    ax1.scatter(comb.iloc[:,0],comb.iloc[:,1])
    ax1.set_title('GridSearch')
    
    ax2.scatter(comb.iloc[:,0],comb.iloc[:,1])
    ax2.scatter([50,250,200,200,300,100,150,150],[4,2,6,3,2,3,2,5],cmap='red',s=50)
    ax2.set_title('RandomSearch')
    plt.show()
    

📌 Random grid search
 The method of randomly extracting parameter subspace and searching in self space.

Advantages over enumeration grid search:

- Fast operation speed
- Large coverage space
- The minimum loss is close to the minimum loss of enumeration network



📖 Sampling characteristics of random grid search

Random grid search adopts "cyclic iteration".

In this iteration, a group of parameters is randomly selected for modeling, and in the next iteration, a group of parameters is randomly selected for modeling. Since this random sampling is not put back, there will be no problem of sampling the same set of parameters twice.

The number of iterations of random grid search can be controlled to control the size of the parameter subspace extracted as a whole. This practice is often referred to as "giving random grid search a fixed amount of calculation. When all the calculation is consumed, the random grid search stops".

In practice, random grid search does not sample out the subspace first, and then search the subspace.
  • 2. Implementation of random grid search

    🔣 Random grid search in skelarn

    from sklearn.model_selection import RandomizedSearchCV
    
    RandomizedSearchCV(
        estimator, # Evaluator
        param_distributions, # Global parameter space
        *,
        n_iter=10, # Number of iterations
        scoring=None, # Evaluation index
        n_jobs=None, 
        refit=True, # Whether to select and evaluate the best data set
        cv=None, # Cross validation mode
        verbose=0,
        pre_dispatch='2*n_jobs', # Number of task divisions when multitasking is parallel
        random_state=None,
        error_score=nan, # When the grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted. In other cases, the training will continue after a warning message is displayed
        return_train_score=False, # Whether to display parameter scores in training set
    )
    
    NameDescription
    estimatorParameter adjustment object, an evaluator
    param_distributionsThe global parameter space can be a dictionary or a list composed of dictionaries
    n_iterThe number of iterations. The more iterations, the larger the extracted sub parameter space
    scoringEvaluation indicators, supporting simultaneous output of multiple parameters
    n_jobsSet the number of threads participating in the calculation when working
    refitSelect the evaluation indicators and the best parameters for training on the complete data set
    cvDiscount of cross validation
    verboseOutput work log form
    pre_dispatchNumber of task divisions when multitasking is parallel
    random_stateRandom number seed
    error_scoreWhen the grid search reports an error, the result will be returned. When 'raise' is selected, the error will be reported directly and the training process will be interrupted. In other cases, the training will continue after a warning message is displayed
    return_train_scoreWhether to display parameter scores in training set in cross validation

    🔣 Case: application of random grid in random forest_ House price data set

    📖 Under the same parameter space and model, the search speed of random grid is faster than that of ordinary grid.

    Running time ≈ n_iter / number of global space combinations * grid search

    from sklearn.ensemble import RandomForestRegressor as RFR
    from sklearn.model_selection import KFold
    
    param_grid_simple = {'n_estimators': range(50,150,10)
                         , 'max_depth': range(10,25,2)
                         , "max_features": ["sqrt",16,32,64,"auto"]
                         , "min_impurity_decrease": np.arange(0,5,2)
                        }
    
    #Calculate parameter space size
    def count_space(param):
        no_option = 1
        for i in param_grid_simple:
            no_option *= len(param_grid_simple[i])
        print(no_option)
        
    count_space(param_grid_simple)
    
    # Training model
    model = RFR(random_state=7,verbose=True,n_jobs=4)
    cv = KFold(n_splits=5,shuffle=True,random_state=7)
    search = RandomizedSearchCV(estimator=model
                                ,param_distributions=param_grid_simple
                                ,n_iter = 600 #The size of the subspace is about half of the global space
                                ,scoring = "neg_mean_squared_error"
                                ,verbose = True
                                ,cv = cv
                                ,random_state=1412
                                ,n_jobs=-1
                               )
    
    search.fit(X,y)
    
    search.best_estimator_ # View model parameter results
    # RandomForestRegressor(max_depth=18, max_features=16, min_impurity_decrease=0,
    #                       n_jobs=4, random_state=7, verbose=True)
    
    abs(search.best_score_)**0.5 # View model RMSE score
    # 29160.978459432965
    
    # View the model effect of the optimal parameters
    from sklearn.model_selection import cross_validate
    ad_reg=RFR(max_depth=18
               , max_features=16
               , min_impurity_decrease=0
               , random_state=7
               , n_jobs=-1)
    
    def RMSE(cvresult,key):
        return (abs(cvresult[key])**0.5).mean()
    
    def rebuild_on_best_param(ad_reg):
        cv = KFold(n_splits=5,shuffle=True,random_state=7)
        result_post_adjusted = cross_validate(ad_reg,X,y
                                              ,cv=cv
                                              ,scoring="neg_mean_squared_error"
                                              ,return_train_score=True
                                              ,verbose=True
                                              ,n_jobs=-1)
        print("train RMSE:{:.3f}".format(RMSE(result_post_adjusted,"train_score")))
        print("test RMSE:{:.3f}".format(RMSE(result_post_adjusted,"test_score")))
    
    rebuild_on_best_param(ad_reg)
    # Training RMSE:10760.565
    # Test RMSE:28265.808
    
  • 3 continuous parameter space

    📖 Continuous type may bring better value
    Grid search: only combined parameters can be used to combine points;
    Random search: accept distribution as input

    As shown in the figure above, for grid search, if the lowest point of the loss function is between two sets of parameters, it is impossible to find the minimum value by enumerating grid search; However, for random grid search, because the parameter points are randomly selected on a section of distribution, it is more likely to get better values in the same parameter space.

    📖 When the parameter space contains a distribution, the size of the global parameter space cannot be estimated.

    🗣 Case: min_impurity_decrease for continuous distribution search

    📖 Effect of using continuous distribution in random search
    Compared with grid search, it runs faster in the same search space, and the cross validation results of search and reconstruction are slightly better than RMSE;
    Compared with small space grid search, the running time is longer and RMSE is slightly better;
    Compared with large space grid search, the running time is longer and RMSE is slightly worse (the model effect is not necessarily).

    Effect: continuous random mesh > large space random mesh > random mesh > mesh search
    Operation speed: grid search > continuous random grid > large space random grid > Random grid

    When the global parameter space used in enumeration grid search is large enough / dense enough, the optimal solution of enumeration grid search is the upper limit of random grid search. Therefore, in theory, random grid search will not get better results than enumeration grid search.

    param_grid_simple={'n_estimators':range(50,150,10)
                       ,'max_depth':range(10,25,2)
                       ,'max_features':range(10,20,2)
                       ,'min_impurity_decrease':scipy.stats.uniform(0,50)}
    
    model=RFR(random_state=7)
    cv=KFold(n_splits=5,shuffle=True,random_state=7)
    
    search=RandomizedSearchCV(estimator=model
                              ,param_distributions=param_grid_simple
                              ,n_iter=600
                              ,scoring='neg_mean_squared_error'
                              ,cv=cv
                              ,random_state=7
                              ,n_jobs=4)
    
    search.fit(X,y)
    
    search.best_estimator_
    # RandomForestRegressor(max_depth=18, max_features=16,
    #                       min_impurity_decrease=34.80143424780533, random_state=7)
    
    abs(search.best_score_)**0.5
    # 29155.5402993104
    
    rebuild_on_best_param(search.best_estimator_)
    # Training RMSE:10733.842
    # Test RMSE:28285.986
    

Keywords: Python Machine Learning

Added by Fed51 on Tue, 01 Mar 2022 14:05:02 +0200