Machine learning artifact scikit learn nanny level introductory tutorial

Scikit learn nanny level introductory tutorial

Scikit learn is a well-known Python machine learning library, which is widely used in data science fields such as statistical analysis and machine learning modeling.

Modeling invincible: users can realize various supervised and unsupervised learning models through scikit learn
Various functions: at the same time, using sklearn can also carry out data preprocessing, feature engineering, data set segmentation, model evaluation and so on
Rich data: rich data sets are built in, such as Titanic, iris, etc. data is no longer worrying

This article introduces the use of scikit learn in a concise way. For more details, please refer to the official website:

Use of built-in dataset
Data set segmentation
Data normalization and standardization
Type code
Modeling 6

Scikit learn uses God map

The following figure is provided on the official website. Starting from the size of the sample size, it summarizes the use of scikit learn in four aspects: regression, classification, clustering and data dimensionality reduction:

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

install

For the installation of scikit learn, it is recommended to use anaconda for installation without worrying about various configuration and environmental problems. Of course, you can also directly install pip:

pip install scikit-learn

Dataset generation

sklearn has built-in some excellent data sets, such as Iris data, house price data, Titanic data, etc.

import pandas as pd
import numpy as np

import sklearn 
from sklearn import datasets  # Import dataset

Classified data iris data

# iris data
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

What is iris data like? Each built-in data contains a lot of information

The above data can be generated into the DataFrame we want to see, and dependent variables can be added:

Regression data - Boston house prices

[external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-de0fuhf4-1643120482)( https://tva1.sinaimg.cn/large/008i3skNly1gy91s2w95wj31ak0pqdlw.jpg )]

Attributes we focus on:

data
target,target_names
feature_names
filename

DataFrame can also be generated:

Three ways to generate data

Mode 1

#Call module
from sklearn.datasets import load_iris
data = load_iris()

#Import data and labels
data_X = data.data
data_y = data.target

Mode 2

from sklearn import datasets
loaded_data = datasets.load_iris()  # Properties of the imported dataset

#Import sample data
data_X = loaded_data.data
# Import label
data_y = loaded_data.target

Mode 3

# Direct return
data_X, data_y = load_iris(return_X_y=True)

Data set Usage Summary

from sklearn import datasets  # Import library

boston = datasets.load_boston()  # Import Boston house price data
print(boston.keys())  # View key (attribute) [data','target','feature_names','DESCR', 'filename'] 
print(boston.data.shape,boston.target.shape)  # View the shape of the data 
print(boston.feature_names)  # See what features 
print(boston.DESCR)  # described dataset description information 
print(boston.filename)  # File path

Data segmentation

# Import module
from sklearn.model_selection import train_test_split
# It is divided into training set and test set data
X_train, X_test, y_train, y_test = train_test_split(
  data_X, 
  data_y, 
  test_size=0.2,
  random_state=111
)

# 150*0.8=120
len(X_train)

Data standardization and normalization

from sklearn.preprocessing import StandardScaler  # Standardization
from sklearn.preprocessing import MinMaxScaler  # normalization

# Standardization
ss = StandardScaler()
X_scaled = ss.fit_transform(X_train)  # Incoming data to be standardized

# normalization
mm = MinMaxScaler()
X_scaled = mm.fit_transform(X_train)

Type code

Cases from the official website: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Encode numbers

Encoding strings

Modeling case

Import module

from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis  # Model
from sklearn.datasets import load_iris  # Import data
from sklearn.model_selection import train_test_split  # Segmentation data
from sklearn.model_selection import GridSearchCV  # Grid search
from sklearn.pipeline import Pipeline  # Pipeline operation

from sklearn.metrics import accuracy_score  # Score verification

Model instantiation

# Model instantiation
knn = KNeighborsClassifier(n_neighbors=5)

Training model

knn.fit(X_train, y_train)

KNeighborsClassifier()

Test set prediction

y_pred = knn.predict(X_test)
y_pred  # Model based prediction

array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2, 0, 2, 1, 0, 2, 1, 2,
       1, 1, 2, 0, 0, 2, 0, 2])

Score verification

There are two ways to verify the model score:

knn.score(X_test,y_test)

0.9333333333333333

accuracy_score(y_pred,y_test)

0.9333333333333333

Grid search

How to search for parameters

from sklearn.model_selection import GridSearchCV

# Search parameters
knn_paras = {"n_neighbors":[1,3,5,7]}
# Default model
knn_grid = KNeighborsClassifier()

# Instanced objects for grid search
grid_search = GridSearchCV(
	knn_grid, 
	knn_paras, 
	cv=10  # 10 fold cross validation
)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 3, 5, 7]})

# Best parameter value found by search
grid_search.best_estimator_

KNeighborsClassifier(n_neighbors=7)

grid_search.best_params_

{'n_neighbors': 7}

grid_search.best_score_

0.975

Search result based modeling

knn1 = KNeighborsClassifier(n_neighbors=7)

knn1.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=7)

From the following results, we can see that the modeling effect after grid search is better than that without grid search

y_pred_1 = knn1.predict(X_test)

knn1.score(X_test,y_test)

1.0

accuracy_score(y_pred_1,y_test)

1.0

Keywords: Machine Learning sklearn scikit-learn

Added by MattG on Sun, 30 Jan 2022 20:15:07 +0200

Programming VIP