preface
In the supervised learning algorithm of machine learning, our goal is to learn a stable model with good performance in all aspects, but the actual situation is often not so ideal. Sometimes we can only get multiple preferred models (weak supervised model, which performs better in some aspects). Three cobblers make Zhuge Liang. Ensemble learning is to combine multiple weak supervision models here in order to get a better and more comprehensive strong supervision model.
1. Import data
def read_xlsx(csv_path): data = pd.read_csv(csv_path) print(data) return data
2. Divide training set and test set
For convenience, the package of sklearn is directly called here. For the underlying implementation methods, see my previous articles.
from sklearn.model_selection import train_test_split x = data.iloc[:, :-1] y = data.iloc[:, -1] x_train, x_test, y_train, y_test = train_test_split(x, y)
3. Integrated learning
3.1 maximum voting method
The maximum voting method is usually used to classify problems. In this technique, multiple models are used to predict each data point. The prediction of each model is regarded as a "vote". The predictions obtained by most models are used as the final prediction results.
def voting(x_train,x_test,y_train): model1 = DecisionTreeClassifier() model2 = KNeighborsClassifier() model3 = MultinomialNB() model1.fit(x_train,y_train) model2.fit(x_train,y_train) model3.fit(x_train,y_train) a = model1.predict(x_test) b = model2.predict(x_test) c = model3.predict(x_test) labels = [] for i in range(len(x_test)): ypred = [] ypred.append(a[i]) ypred.append(b[i]) ypred.append(c[i]) counts = np.bincount(ypred) label = np.argmax(counts) labels.append(label) print(labels) return labels
In addition to the maximum voting method, there are also the average method and the weighted average method.
The average method is similar to the maximum voting technique, which averages the multiple predictions of each data point. This model takes the average value from all models as the final prediction. The average method can be used to predict in regression problems or to calculate the probability of classification problems.
The weighted average method is an extension of the average method. Assign different weights to all models and define the prediction importance of each model. For example, if two of your colleagues are commentators and the others have no experience in this field, the answers of these two friends are more important than others.
3.2 Bagging
In the Bagging method, the bootstrap method is used to take put back sampling from the overall data set to obtain n data sets, and a model is learned on each data set. The final prediction results are obtained by using the output of N models. Specifically, n models are used to predict the voting for the classification problem, and N models are used to predict the average for the regression problem.
#Yes, put it back for sampling def random_sampling(x,y, m): x = np.array(x) y = np.array(y) a = np.random.permutation(len(x)) subset = x[a] label = y[a] return subset, label #Train each subset def model(x_train,y_train,x_test): model = DecisionTreeClassifier() yclass = [] for i in range(20): subset, label = random_sampling(x_train, y_train, 150) model.fit(subset, label) a = model.predict(x_test) a = list(a) yclass.append(a) data = DataFrame(yclass) ypred = [] for col in data.columns: mean = data[col].mean() #The results are averaged ypred.append(mean) print(ypred) return ypred
4. Calculation accuracy
def accuracy(ypred, y_test): correct = 0 y_test = list(y_test) for x in range(len(y_test)): if ypred[x] == y_test[x]: correct += 1 accuracy = (correct / float(len(y_test))) * 100.0 print("Accuracy:", accuracy, "%") return accuracy
summary
Ensemble learning improves the weak classifier better.