Process of machine learning

2.1 introduction to machine learning process

2.1.1 overall process of machine learning

Next, the implementation process of "supervised learning", which is the most widely used of three types of machine learning methods, is described. The implementation process of supervised learning can be summarized into the following steps:

collecting data
Data cleaning (removing duplicate or missing data to improve data accuracy)
Using machine learning algorithm to learn data
Use the test data set to evaluate the performance of the model
Install the machine learning model into the application environment such as web pages
In the above five steps, only 3 will use machine learning technology. In fact, data collection and data cleaning are the most important steps, and it takes a lot of time.

2.1.2 data learning

In the process of data learning, various machine learning algorithms will be used, such as:

support vector machines
Random forest algorithm
KNN nearest neighbor algorithm
Decision tree algorithm
BP neural network and so on
-We need to use these algorithms to find the features and patterns contained in the data, and then classify and predict the data.

2.2 use of learning data

2.2.1 learning data and test data

In the process of supervised learning, we need to divide the data into "training data" and "test data". Instead of learning all the data in one brain, we set aside some data to test the trained model. More generally speaking, the accuracy of the trained model is not necessarily very good, If we don't take out part of the data for testing, we are likely to get the wrong results.
In most cases, we choose 20% of the overall data as the test data.

2.2.2 theory and practice of retention method

Next, the method of dividing data is introduced: setting aside method and k-fold cross validation. First, the setting aside method is introduced. The so-called set aside method, as the name suggests, is a simple method to divide the given data set into training set and test set. When using machine learning algorithm, we usually use the third-party software library in Python: scikit learn. When using scikit learn to practice the set aside method, we need to use train_ test_ The split() function is used as follows:

############Import modules to be used#########################
from sklearn import datasets
from sklearn.model_selection import train_test_split
#############Read named iris Data set of#####################
iris=datasets.load_iris()
x=iris.data
y=iris.target
####################Save data to using set aside method“ x_train，x_test，y_train，y_test"in,###########
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
############################Confirm the size of training data and test data########################
print('x_train:',x_train.shape)
print('x_test',x_test.shape)
print('y_train:',y_train.shape)
print('y_test:',y_test.shape)

parameter	meaning
x_train	X in training set (independent variable)
x_test	X (independent variable) in the test set
y_train	Tags in training set
y_test	Label in test set
test_size	Proportion of test set, usually 0.2
random_state	When 0 is specified, the test set will be fixed and will not be changed randomly. It is usually specified as 0

2.2.3 theory of k-fold cross validation

K-fold cross validation is a method of model evaluation and validation. It uses sampling without putting back (the sampled data will not be put back to the original data set), divides the training set into k subsets, uses the k-1 subset data as the learning data set, and uses the remaining 1 subset for model testing. In this way, we will get k models and corresponding K performance evaluation data. In this way, we will repeat the learning and evaluation K times, and finally take the average value of K performance evaluation data as the final result of the model.
Leave one cross validation method is a special method of k-fold cross validation. It sets the number of separated subsets to be the same as the number of data sets, that is, if you have 20 rows of data, leave one method to divide it into 20 subsets, that is, each row of data is a subset. When doing model training, test one row of data, The remaining 19 lines of data were used for training. This method is suitable for processing very small data sets (such as 50 ~ 100 rows).

2.2.4 practice of k-fold cross validation

Python implementation of classic k-fold cross validation:

'''Through adjustment cv=k Medium k Value, the score will be different'''
from sklearn import svm,datasets  #Import the required SVM algorithm package and data set
from sklearn.model_selection import cross_val_score  #Import k-fold cross validation function
iris=datasets.load_iris()   #Import the required iris dataset
x=iris.data  #Gets the value of the data argument
y=iris.target  #Gets the label of the data
svc=svm.SVC(C=1,kernel='rbf',gamma=0.001)  #Using SVM algorithm
scores=cross_val_score(svc,x,y,cv=5)  #Calculate the cross validation score, and the internal data X and y of the program will be divided into x_train,x_test,y_train,y_test form
print(scores)   #Output the score of each verification
print('Average score:',scores.mean())  #Output average score
'''
[0.86666667 0.96666667 0.83333333 0.96666667 0.93333333]
Average score: 0.9133333333333334

'''

Leave one method for Python code implementation

from sklearn import datasets
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score  #Import k-fold cross validation function
from sklearn import  svm
svc=svm.SVC(C=1,kernel='rbf',gamma=0.001)
iris=datasets.load_iris()
x=iris.data[0:70]  #Select the first 70 rows of the dataset
y=iris.target[0:70] #ditto
loo=LeaveOneOut()  #Call leave one method
scores=cross_val_score(svc,x,y,cv=loo)
print(scores.mean()) #Output average score
'''
The average score is 0.7142857142857143
'''

From the final average score of the leave one method, the result is not very satisfactory, which is related to the selection of the model and the amount of data.

2.3 over fitting

2.3.1 what is overfitting

Let me use a diagram to explain what is over fitting:

As shown in the figure, we currently have two types of data sets, blue and orange. We can clearly see that one blue data point deviates from the position of the normal situation. This situation is usually wrong data, that is, the data set with serious deviation. At this time, we let the computer classify it, Then the computer will be very "obedient" to process and classify these data. Finally, the computer will get a conclusion of "over fitting boundary", and the correct answer to the boundary is the green line. Therefore, we can see that the classification plane is affected by one of the data and can not draw the correct boundary. The state caused by the over learning of the data by the computer is called "over fitting"

2.3.2 how to avoid over fitting

In deep learning, we often use dropout method to prevent over fitting
In other machine learning algorithms, we often normalize the data before learning
Let's talk about what is normalization.

Normalization processing

Normalization, as the name suggests, is to map data values to [ 0 , 1 ] [0,1] [0,1] interval, the simplest normalization method is:
d a t a n e w = d a t a o l d − d a t a m i n d a t a m a x − d a t a m i n data_{new}=\frac{data_{old}-data_{min}}{data_{max}-data_{min}} datanew=datamax−datamindataold−datamin
In this way, we have made such a mapping: d a t a → [ 0 , 1 ] data\to[0,1] data→[0,1]

2.4 integrated learning

Ensemble learning is a method to realize the generalization of data through multiple training models. There are two common methods:

Bagging algorithm is a method that allows multiple models to learn at the same time and enhances the generalization performance of model prediction results by averaging the prediction results.
The promotion algorithm improves the generalization performance by generating the corresponding model according to the prediction results of the model.

Keywords: Python Machine Learning AI

Added by Dan400007 on Sat, 05 Feb 2022 05:40:47 +0200

Programming VIP