Machine learning sklearn random forest

catalogue

1 integrated learning

2 random forest classifier

2.1 random forest classifier function and its parameters

2.2 construction of random forest

2.3 comparison of random forest and decision tree under cross validation

2.4 drawing n_ Learning curve of estimators

3 random forest regressor

3.1 random forest classifier function and its parameters

3.2 filling missing values with random forest regression

4 basic idea of machine learning parameter adjustment

4.1 related concepts

4.2 examples

1 integrated learning

Ensemble learning completes the learning task by building and combining multiple learners. It is not a single machine learning algorithm, but by building multiple models on the data to integrate the modeling results of all models. It can be used to model marketing simulation, count customer sources, retention and loss, and predict the risk of disease and the susceptibility of patients.

The integration algorithm will consider the modeling results of multiple evaluators and summarize them to obtain a comprehensive result, so as to obtain better regression or classification performance than a single model.

The model integrated by multiple models is called an integrated estimator, and each model constituting the integrated estimator is called a base estimator. Generally speaking, there are three kinds of integration algorithms: Bagging, Boosting and stacking.

The core idea of bagging method is to build multiple Independent evaluators Then, the average or majority voting principle is used to determine the result of the integrated evaluator. The representative model of bagging method is random forest.

In the lifting method, the base evaluator is relevant, It is built one by one in order. Its core idea is to combine the power of weak evaluator to evaluate the difficult samples again and again Make predictions to form a strong evaluator. The representative models of lifting method are Adaboost And gradient lifting tree.

To achieve good integration, individual learners should be "good but different", that is, individual learners should not only have certain accuracy, but also have differences between learners.

2 random forest classifier

2.1 random forest classifier function and its parameters

Many parameters of random forest classifier are consistent with those of decision tree. The parameters of the control base evaluator are as follows.

Other relevant parameters:

n_estimators:

This is the number of trees in the forest, that is, the number of base evaluators. The influence of this parameter on the accuracy of the random forest model is monotonic, n_estimators More Large, the effect of the model is often better . But correspondingly, any model has a decision boundary, n_estimators After reaching a certain degree, the accuracy of random forest often does not rise or begin to fluctuate, and n_estimators The larger the, the larger the amount of computation and memory required, and the longer the training time will be. For this parameter, we are eager to strike a balance between training difficulty and model effect.

random_state:

The essence of random forest is a bagging integration algorithm( bagging ), the bagging integration algorithm is to average the prediction results of the base evaluator or use the majority voting principle to determine the results of the integration evaluator. In the red wine example just now, we established 25 Tree, for any sample, under the principle of average or majority voting, if and only if there are 13 When more than one tree makes a wrong judgment, the random forest will make a wrong judgment. The classification accuracy of red wine data set by a single decision tree is 0.85 Floating up and down, assuming that the probability of a tree's judgment error is 0.2(ε) , that 20 The possibility of wrong judgment for more than one tree is:

Among them, i It is the number of misjudgments and the number of misjudged trees, ε Is the probability that a tree makes a wrong judgment( 1-ε )Is the probability of correct judgment, 25-i times in total.

So random forest The probability of wrong judgment is very small, which makes the performance of random forest on red wine data set much better than that of single decision tree.

sklearn Classification tree in DecisionTreeClassififier It has its own randomness, so the trees in the random forest are naturally different, The decision tree randomly selects a feature from the most important features to branch, so the decision tree generated each time is different. This function is determined by the parameter random_state control.

In fact, there are in the random forest random_state , the usage is similar to that in the classification tree, except that in the classification tree, a random_state Only one tree is controlled, and the random in the random forest_ state The control is the mode of forest generation, rather than having only one tree in a forest.

When random_state When fixed, a group of fixed trees are generated in the random forest, but each tree is still inconsistent, which is

use " Randomly selected features for branching " The randomness obtained by the method. And we can prove that when the randomness is greater, the effect of bagging method will generally be better and better. When using bagged method, the base classifiers should be independent and different from each other .

But the limitation of this method is very strong. When we need thousands of trees, the data may not provide thousands of features to let us build as many different trees as possible. So, in addition to random_state . We need other randomness.

bootstrap & oob_score:

To make the base classifiers as different as possible, an easy to understand method is to use different training sets for training, and the bagged method forms different training data through the random sampling technology with return, bootstrap Is the parameter used to control the sampling technique.

In one containing n In the original training set of samples, we conduct random sampling, one sample at a time, and put the sample back to the original training set before taking the next sample, that is, the sample may still be collected at the next sampling, so n Times, and finally get a training set as large as the original, n A self-help set of samples. Due to random sampling, the self-service set is different from the original data set and other sampling sets. In this way, we can freely create inexhaustible and different self-help sets. Using these self-help sets to train our base classifiers, our base classifiers will naturally be different.

bootstrap Parameter default True , represents the use of this random sampling technique with return . Usually, this parameter is not set to False.

If you put it back for sampling, you will also have your own problems. Some samples may appear multiple times in the same self-service set, while others may be ignored. Generally speaking, the self-service set contains about 63% on average Raw data.

Because the probability that each sample will be drawn to a self-help set is (reverse thinking, minus the probability that a sample will never be drawn):

When n When large enough, this probability converges to 1-(1/e) , approximately equal to 0.632 . Therefore, there will be an appointment 37% The training data is wasted and not involved in modeling. These data are called out of bag data , abbreviated as oob) . In addition to the test set we divided at the beginning, these data can also be used as the test set of the integration algorithm. In other words, when using random forest, we can not divide the test set and training set, just use the outside bag Data to test our model. Of course, this is not absolute, when n and n_estimators When they are not big enough, it is likely that no data will fall out of the bag, so oob will not be used Data to test the model.

If you want to test with out of pocket data, you need to set it when instantiating oob_score This parameter is adjusted to True , after training, we can use another important attribute of random forest: oob_score_ To see the results of our tests on out of pocket data.

2.2 construction of random forest

Random forests cannot be visualized. In order to more intuitively understand the effect of random forest, let's compare the benefits of random forest and single decision tree. Use the red wine dataset.

On the red wine dataset, the effect of random forest is better than that of decision tree.

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

wine=load_wine()
Xtrain,Xtest,Ytrain,Ytest=train_test_split(wine.data,wine.target,test_size=0.3)

#instantiation 
clf=DecisionTreeClassifier()
rfc=RandomForestClassifier()
clf=clf.fit(Xtrain,Ytrain)
rfc.fit(Xtrain,Ytrain)
score_c=clf.score(Xtest,Ytest)
score_r=rfc.score(Xtest,Ytest)
print("Single Tree:{}".format(score_c))
print("Random Forest:{}".format(score_r))

2.3 comparison of random forest and decision tree under cross validation

Through comparison, it can be found that the effect of random forest is indeed better than decision tree. Only in a few cases, the accuracy of the two is the same. At other times, random forest is better than decision tree.

from matplotlib import colors
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine

plt.rcParams['font.sans-serif']=['SimHei'] #Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False #Used to display negative signs normally

wine=load_wine()
Xtrain,Xtest,Ytrain,Ytest=train_test_split(wine.data,wine.target,test_size=0.3)

#instantiation 
rfc=RandomForestClassifier(n_estimators=20) #Set number of base evaluators
clf=DecisionTreeClassifier()

#Perform cross validation
rfc_s=cross_val_score(rfc,wine.data,wine.target,cv=15)
clf_s=cross_val_score(clf,wine.data,wine.target,cv=15)

#mapping
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(1,16),clf_s,color='blue',label="DecisionTreeClassifier")
plt.plot(range(1,16),rfc_s,color='red',label="RandomForestClassifier")
plt.xlabel('Number of cross validation')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.show()

2.4 drawing n_ Learning curve of estimators

from matplotlib import colors
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine

plt.rcParams['font.sans-serif']=['SimHei'] #Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False #Used to display negative signs normally

wine=load_wine()
Xtrain,Xtest,Ytrain,Ytest=train_test_split(wine.data,wine.target,test_size=0.3)

score=[]
for i in range(100):
    rfc=RandomForestClassifier(n_estimators=i+1)
    rfc_s=cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
    score.append(rfc_s)

print("Highest accuracy{} Corresponding times{}".format(max(score),score.index(max(score))))
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(1,101),score)
plt.xlabel('Number of base evaluators')
plt.ylabel('Accuracy')
plt.show()

3 random forest regressor

3.1 random forest classifier function and its parameters

criterion:

Regression tree is an indicator of branch quality, and supports three standards:

1 )Input "mse" Use mean square error mean squared error(MSE) , the difference between the mean square error of the parent node and the leaf node will be used as the criterion for feature selection. This method minimizes L2 by using the mean of the leaf node Loss.

2 )Input "friedman_mse" Feldman mean square error is used. This index uses Friedman's improved mean square error for the problems in potential branching.

3 )Input "mae" Use absolute mean error MAE ( mean absolute error ), this metric uses the median of leaf nodes to minimize L1 Loss.

3.2 filling missing values with random forest regression

In sklearn In, we can use sklearn.impute.SimpleImputer To easily fill the data with the mean, median, or other most commonly used values. In this case, we will use the mean, middle finger, 0 , and random forest regression to fill the missing values, and verify the fitting conditions under the four conditions to find the best missing value filling method for the data set used.

Any regression is a process of learning from the characteristic matrix and then solving the continuous label y. The reason why this process can be realized is that the regression algorithm believes that there is a certain relationship between the characteristic matrix and the label. In fact, labels and features can be converted to each other. For example, in a problem of predicting "house price" by "region, environment and number of nearby schools", we can use the data of "region", "environment" and "number of nearby schools" to predict "house price", or vice versa, "Number of nearby schools" and "house price" to predict "area". This idea is used to fill the missing value by regression.

For a data with n features, where feature t has missing values, we take feature t as a label, and the other n-1 features and the original label form a new feature matrix. For T, there is no missing part, that is, our Y_test, this part of the data has both labels and features, and the missing part has only features and no labels, which is the part we need to predict.

Other n-1 features corresponding to the value of feature T not missing + original label: X_train

Value of characteristic T not missing: Y_train

Other n-1 features corresponding to the missing value of feature T + original label: X_test

Missing value of characteristic T: unknown, we need to predict Y_test

This approach is very applicable to the case where a large number of features are missing but other features are complete.

What if there are missing values for other features besides feature T in the data?

The answer is to traverse all the features and fill them from the ones with the least missing (because filling the features with the least missing requires the least accurate information).

When filling a feature, first replace the missing values of other features with 0. Each time the regression prediction is completed, put the predicted values into the original feature matrix, and then continue to fill the next feature. After each filling, one feature with missing value will be reduced, so after each cycle, fewer and fewer features need to be filled with 0. When we go to the last feature (this feature should have the most missing values among all features), there are no other features that need to be filled with 0, but we have filled a large amount of effective information for other features using regression, which can be used to fill the most missing features. After traversing all the features, the data is complete and there are no missing values.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.core.frame import DataFrame
from sklearn.datasets import load_boston
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

boston=load_boston()
# print(boston)
# print(boston.data.shape)

#After storing the complete data set, the missing data set will be processed and filled, and then compared with the original data set
x_full,y_full=boston.data,boston.target #Storing original data and characteristics
n_samples=x_full.shape[0]#Record the number of complete samples
n_features=x_full.shape[1]#Record the number of complete features

#Put missing values for full data
#Determine the proportion of missing values, here we assume 50%
rng=np.random.RandomState(0)#Determine a random pattern
missing_rate=0.5
n_missing_samples=int(np.floor(n_samples*n_features*missing_rate))#Round down, resulting in the total number of missing data 
# print(n_missing_samples)

#The generated missing values are distributed in each row and column of the data
miss_features=rng.randint(0,n_features,n_missing_samples) #The three parameters are the number of lower and upper limits in turn
miss_samples=rng.randint(0,n_samples,n_missing_samples)

#If the random number taken out is less than the sample size, the choice method can be used
#missing_samples = rng.choice(n_samples,n_missing_samples,replace=False) #The number of random fetches required when the parameters from left to right are the maximum will not be repeated to ensure more scattered data

#Create missing data
#Operate after copying the original complete data
x_missing=x_full.copy()
y_missing=y_full.copy()
#Replace a part of the complete data with a missing value
x_missing[miss_samples,miss_features]=np.nan
#Convert to dataframe to facilitate subsequent operations
x_missing=DataFrame(x_missing)
# print(x_missing)

#Filling missing values with mean
imp_mean=SimpleImputer(missing_values=np.nan,strategy='mean')#instantiation 
x_missing_mean = imp_mean.fit_transform(x_missing) #fit_transform training + export predict
#Check whether the filling is completed
# print(pd.DataFrame(x_missing_mean).isnull().sum())

#Fill missing values with median
imp_median=SimpleImputer(missing_values=np.nan,strategy='median')#instantiation 
x_missing_median = imp_median.fit_transform(x_missing) #fit_transform training + export predict
#Check whether the filling is completed
# print(pd.DataFrame(x_missing_median).isnull().sum())

#Fill with 0
imp_0 = SimpleImputer(missing_values=np.nan, strategy="constant",fill_value=0)
x_missing_0 = imp_0.fit_transform(x_missing)
#Check whether the filling is completed
# print(pd.DataFrame(x_missing_0).isnull().sum())

#Filling missing values with random forests
x_missing_reg=x_missing.copy()#Replicate matrices that require regression to fill in missing values
#The essence of finding out the order of the corresponding eigenvalues from small to large is to find the index and fill it from the one with the most missing values 
sortindex=np.argsort(x_missing_reg.isnull().sum(axis=0)).values #argsort returns the index and uses values to fetch the data
#Traversal fill missing values
for i in sortindex:
    #Build a new feature matrix (features not selected to be filled + original labels) and a new label (features selected to be filled)
    df=x_missing_reg
    fillc=df.iloc[:,i] 
    #New characteristic matrix
    df=pd.concat([df.iloc[:,df.columns!=i],pd.DataFrame(y_full)],axis=1) #Left right connection
    #In the new characteristic matrix, the columns with missing values are filled with 0
    df_0 =SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0).fit_transform(df)
    #Find training set and test set
    Ytrain = fillc[fillc.notnull()] #Non empty in features selected to fill
    Ytest = fillc[fillc.isnull()]     #For those values that do not exist in the selected features to be filled, we need not the value of ytest, but its index
    Xtrain = df_0[Ytrain.index,:] #On the new characteristic matrix, the record corresponding to the selected non null value
    Xtest = df_0[Ytest.index,:]    #On the new feature matrix, the record corresponding to the selected feature null value

    #Filling missing values using random forests
    rfc = RandomForestRegressor(n_estimators=100)
    rfc = rfc.fit(Xtrain, Ytrain)
    Ypredict = rfc.predict(Xtest)#Get the prediction results

    #Fill the filled features into the original feature matrix
    x_missing_reg.loc[x_missing_reg.iloc[:,i].isnull(),i]=Ypredict  #First, use iloc to find the row index of nan

# print(x_missing_reg.isnull().sum())

#All data were modeled and MSE results were obtained
X = [x_full,x_missing_0,x_missing_mean,x_missing_median,x_missing_reg]
mse = []
std = []
for x in X:
    estimator = RandomForestRegressor()
    scores = cross_val_score(estimator,x,y_full,scoring='neg_mean_squared_error', cv=5).mean()#Score with negative mean square error
    mse.append(scores * -1)#Turn the result to positive
print(*zip(['x_full','x_missing_0','x_missing_mean','x_missing_median','x_missing_reg'],mse))#The smaller the better

#mapping
x_labels = ['Full data','Zero Imputation','Mean Imputation','Median Imputation','Regressor Imputation']
colors = ['r', 'g', 'b','black', 'orange']
plt.figure(figsize=(20, 8),dpi=80)
ax = plt.subplot(111) #Add subgraphs to prepare for subsequent functionalization
for i in np.arange(len(mse)):
    ax.barh(i, mse[i],color=colors[i], alpha=0.6, align='center') #Draw the bar chart horizontally in the center
ax.set_title('Imputation Techniques with Boston Data')
ax.set_xlim(left=np.min(mse) * 0.9,right=np.max(mse) * 1.1)#The x-axis of the limited range is mse value
ax.set_yticks(np.arange(len(mse)))#Set y scale
ax.set_xlabel('MSE')
ax.set_yticklabels(x_labels)#Change y's scale
plt.show()

4 basic idea of machine learning parameter adjustment

4.1 related concepts

The first step in model tuning is to find the right goal: what are we going to do? Generally speaking, this goal is to improve the evaluation index of a model. For example, for random forests, what we want to improve is the accuracy of the model on unknown data (measured by score or oob_score_). To find this goal, we need to think: what factors affect the accuracy of the model on unknown data? In machine learning, the index we use to measure the accuracy of the model on unknown data is called generalization error.

When the model performs poorly on unknown data (test set or out of bag data), we say that the generalization degree of the model is not enough, the generalization error is large, and the effect of the model is not good. The generalization error is affected by the structure (complexity) of the model. Look at the following figure, which accurately depicts the relationship between generalization error and model complexity. When the model is too complex, the model will be over fitted, and the generalization ability is not enough, so the generalization error is large. When the model is too simple, the model will be under fitted, and the fitting ability is not enough, so the error will be large. Only when the complexity of the model is just good can we achieve the goal of minimizing generalization error.

For the tree model, the more lush the tree, the deeper the depth and the more branches and leaves, the more complex the model is. place

The tree model is naturally located in the upper right corner of the graph, and the random forest is based on the tree model, so the random forest is also a naturally complex model. The parameters of random forest are towards one goal: reduce the complexity of the model, move the model to the left of the image and prevent over fitting. Of course, there is no absolute parameter adjustment, and there is also a random forest naturally on the left of the image, so before adjusting the parameter, we must first judge which side of the image the model is on.

Behind the generalization error is " Deviation variance dilemma ". Deviation: the difference between the predicted value and the real value of the model. Variance: it reflects the error between each output result of the model and the average level of the predicted value of the model. Deviation measures whether the model predicts accurately. The smaller the deviation is, the more accurate the model is " accurate " ； Variance measures whether the results predicted by the model are close to each other, that is, square The smaller the difference, the better the model " stable "；

A good mold Type, it is necessary to predict most unknown data " accurate " also " stable " . In other words, when the deviation and variance are very low, the generalization error of the model is small and the accuracy on unknown data is high.

Variance and deviation change one after another, and it is impossible to reach the minimum at the same time.

When the model complexity is large, the variance is high and the deviation is low. Low deviation means that the model should be predicted well " accurate " . The model will work harder

Learning more information will be specific to the training data, which will lead to the model performing well on some data and poor on others. The model has poor generalization and unstable performance on different data, so the variance is large. To learn the training set as much as possible, the establishment of the model must be more detailed and the complexity must rise. Therefore, it has high complexity, high variance and high total generalization error.

In contrast, when the complexity is low, the variance is low and the deviation is high. The variance is low and the model is required to predict well " stable " , the generalization is stronger. For the model, it does not need to learn too much about the data. It only needs to establish a relatively simple model with relatively broad judgment. As a result, the model cannot achieve high accuracy on a certain type or group of data, so the deviation will be large. Therefore, the complexity is low, the deviation is high, and the total generalization error is high.

The goal of our parameter adjustment is to achieve a perfect balance between variance and deviation! Although the variance and deviation cannot reach the minimum at the same time, the generalization error composed of them can have a lowest point, and we are looking for this lowest point. For models with high complexity, the variance should be reduced, and for relatively simple models, the deviation should be reduced. The base estimators of random forests have lower deviation and higher variance, because the decision tree itself is "prediction comparison" accurate " The bagging method itself requires that the accuracy of the base classifier must be 50% above. Therefore, the training process of bagging method represented by random forest aims to reduce Low variance, that is to reduce the complexity of the model, so the default setting of random forest parameters assumes that the model itself is on the right of the lowest point of generalization error .

When we reduce the complexity, the essence is to reduce the variance of the random forest. All the parameters of the random forest are also moving towards reducing the variance Go to your goal.

1 )If the model is too complex or too simple, the generalization error will be high. What we pursue is the balance point in the middle

2 )If the model is too complex, it will over fit, and if the model is too simple, it will under fit

3 )For tree model and tree integration model, the deeper the tree is, the more branches and leaves are, and the more complex the model is

4 )The goal of tree model and tree integration model is to reduce the complexity of the model and move the model to the left of the image

We have been adjusting parameters all the time, looking for the optimal value in turn on the learning curve, hoping to correct the accuracy to a relatively high level. However, we now understand the parameter adjustment direction of random forest: reduce complexity, we can select those parameters that have a great impact on complexity, study their monotonicity, and then focus on adjusting those parameters that can minimize complexity. yes

For those parameters that are not monotonous or will increase the complexity, we can use them according to the situation, and we can even retreat most of the time.

4.2 examples

View n_ The learning curve of estimators.

After determining the good range, further refine the learning curve.

Because the division of each training set and test set is different, the performance after refinement may be worse than before, but in general, the refined curve can provide a more accurate parameter selection range. After determining the number of base evaluators, further use grid search to find other appropriate parameters.

We will use grid search to adjust the parameters one by one. Why don't we adjust multiple parameters at the same time? There are two reasons: 1 )Adjusting multiple parameters at the same time will run very slowly, 2 )Adjusting multiple parameters at the same time will make us unable to understand how the combination of parameters comes from. Therefore, even if the results of grid search are not good, we don't know where to change them. Here, in order to use complexity- Generalized error method (variance) - Deviation method), we adjust the parameters one by one.

Adjust max_depth

Max is modulated_ After depth, the accuracy of the model increases, so the model is on the right side of the generalization complexity.

adjustment max_features

max_features is the only parameter that can push the model to the left (low variance and high deviation) or to the right (high variance and low deviation). We need to determine whether we want to set Max according to the position of the model (on the left or right of the lowest generalization error) before parameter adjustment_ Which way do features go. Now that the model is on the right side of the image, we need lower complexity, so we should put max_features are adjusted in a smaller direction. The fewer features available, the more complex the model will be. max_ The default minimum value of features is sqrt(n_features), so we use twice this value as the maximum value of the tuning range.

adjustment min_samples_leaf

For min_samples_split and min_samples_leaf is generally increased by 10 or 20 from their minimum value

In the face of high-dimensional and high sample size data, if you are not confident, you can also directly + 50. For large data, you may need a range of 200 ~ 300

If you find that the accuracy can not be improved in any case during adjustment, you can rest assured and boldly adjust a large data to greatly limit the complexity of the model

Min at this time_ samples_ Split reduces the accuracy, so the parameter is not set.

Adjust random_state

The accuracy is improved.

After adjustment, the best parameters of the model are summarized

The complete code is as follows

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = load_breast_cancer()

#preliminary estimates

# scorel = []
# for i in range(0,200,10):
#     rfc=RandomForestClassifier(n_estimators=i+1)
#     score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
#     scorel.append(score)

# print(max(scorel),(scorel.index(max(scorel))*10)+1)
# plt.figure(figsize=[20,8],dpi=80)
# plt.plot(range(1,201,10),scorel)
# plt.show()

#Refined estimation

# scorel = []
# for i in range(25,36):
#     rfc=RandomForestClassifier(n_estimators=i+1)
#     score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
#     scorel.append(score)
# print(max(scorel),([*range(25,36)][scorel.index(max(scorel))])) #Take the index corresponding to the maximum value in the range of 30-39
# plt.figure(figsize=[20,8],dpi=80)
# plt.plot(range(25,36),scorel)
# plt.show()

#Adjust max_depth
# param_grid = {'max_depth':np.arange(1, 20, 1)}
# Generally based on the size of the data to carry out a test, breast cancer data is very small, so you can use 1~10, or 1~20 test.
# However, for large data like digital recognition, we should try 30 ~ 50 layers of depth (maybe not enough)
# We should draw a learning curve to observe the influence of depth on the model

# rfc = RandomForestClassifier(n_estimators=32)
# GS = GridSearchCV(rfc,param_grid,cv=10)
# GS.fit(data.data,data.target)
# print(GS.best_params_,GS.best_score_)

#Adjust max_features
#Now that the model is on the right side of the image, we need lower complexity
#max_ The default minimum value of features is sqrt(n_features), so we use twice this value as the maximum value of the tuning range.
# param_grid = {'max_features':np.arange(1,10,1)}
# rfc = RandomForestClassifier(n_estimators=32,max_depth=11)
# GS = GridSearchCV(rfc,param_grid,cv=10)
# GS.fit(data.data,data.target)
# print(GS.best_params_,GS.best_score_)

#Adjustment min_samples_leaf
#For min_samples_split and min_samples_leaf is generally increased by 10 or 20 from their minimum value
#In the face of high-dimensional and high sample size data, if you are not confident, you can also directly + 50. For large data, you may need a range of 200 ~ 300
#If you find that the accuracy can not be improved in any case during adjustment, you can rest assured and boldly adjust a large data to greatly limit the complexity of the model
# param_grid={'min_samples_leaf':np.arange(1, 1+10, 1)}
# rfc = RandomForestClassifier(n_estimators=32,max_depth=11,max_features=7)
# GS = GridSearchCV(rfc,param_grid,cv=10)
# GS.fit(data.data,data.target)
# print(GS.best_params_,GS.best_score_)

#Adjust random_state
# param_grid={'random_state':np.arange(20,150)}
# rfc = RandomForestClassifier(n_estimators=32,max_depth=11,max_features=7)
# GS = GridSearchCV(rfc,param_grid,cv=10)
# GS.fit(data.data,data.target)
# print(GS.best_params_,GS.best_score_)


#Determine final parameters
rfc = RandomForestClassifier(n_estimators=39,max_depth=11,max_features=7,random_state=66)
score = cross_val_score(rfc,data.data,data.target,cv=10)
print(score)

Keywords: Machine Learning sklearn

Added by xenooreo on Tue, 11 Jan 2022 14:22:59 +0200

Programming VIP

Machine learning sklearn random forest

1 integrated learning

2 random forest classifier

2.1 random forest classifier function and its parameters

2.2 construction of random forest

2.3 comparison of random forest and decision tree under cross validation

2.4 drawing n_ Learning curve of estimators

3 random forest regressor

3.1 random forest classifier function and its parameters

3.2 filling missing values with random forest regression

4 basic idea of machine learning parameter adjustment

4.1 related concepts

4.2 examples

Popular Keywords