Machine learning artifact scikit learn nanny level introductory tutorial

Official account: Special House
Author: Peter
Editor: Peter

Hello, I'm Peter~

Scikit learn is a well-known Python machine learning library, which is widely used in data science fields such as statistical analysis and machine learning modeling.

  • Modeling invincible: users can realize various supervised and unsupervised learning models through scikit learn
  • Various functions: at the same time, using sklearn can also carry out data preprocessing, feature engineering, data set segmentation, model evaluation and so on
  • Rich data: rich data sets are built in, such as Titanic, iris, etc. data is no longer worrying

This article introduces the use of scikit learn in a concise way. For more details, please refer to the official website:

  1. Built in dataset usage
  2. Data set segmentation
  3. Data normalization and standardization
  4. Type code
  5. Modeling 6

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG xhhdpa0w-1642000510110)( https://tva1.sinaimg.cn/large/008i3skNly1gy91kiv4ioj30q206idgn.jpg )]

Scikit learn uses God map

The following figure is provided on the official website. Starting from the size of the sample size, it summarizes the use of scikit learn in four aspects: regression, classification, clustering and data dimensionality reduction:

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-282zsyxn-1642000510115)( https://tva1.sinaimg.cn/large/008i3skNly1gy8xqnswgej31c40u0afu.jpg )]

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-isrvrlde-1642000510115)( https://tva1.sinaimg.cn/large/008i3skNly1gy927cm155j313d0u0qav.jpg )]

install

For the installation of scikit learn, it is recommended to use anaconda for installation without worrying about various configuration and environmental problems. Of course, you can also directly install pip:

pip install scikit-learn

Dataset generation

sklearn has built-in some excellent data sets, such as Iris data, house price data, Titanic data, etc.

import pandas as pd
import numpy as np

import sklearn 
from sklearn import datasets  # Import dataset

Classification data iris data

# iris data
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

What is iris data like? Each built-in data contains a lot of information

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-ekrg6qrn-164200051016)( https://tva1.sinaimg.cn/large/008i3skNly1gy91n9pteoj30rm0megni.jpg )]

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-2gecpp4v-1642000510121)( https://tva1.sinaimg.cn/large/008i3skNly1gy91nwfg6oj31k60pc12z.jpg )]

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-koxvokmu-1642000510121)( https://tva1.sinaimg.cn/large/008i3skNly1gy91offvk6j31jm09qacu.jpg )]

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-b2rtobag-1642000510123)( https://tva1.sinaimg.cn/large/008i3skNly1gy91posqcij316e0ogjus.jpg )]

The above data can be generated into the DataFrame we want to see, and dependent variables can be added:

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-v3lkcacw-1642000510123)( https://tva1.sinaimg.cn/large/008i3skNly1gy91q8nb2xj310k0qeq55.jpg )]

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-8duq8ugk-1642000510124)( https://tva1.sinaimg.cn/large/008i3skNly1gy91qs9ryrj31640qawgw.jpg )]

Regression data - Boston house prices

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-e7quu7oc-1642000510124)( https://tva1.sinaimg.cn/large/008i3skNly1gy91s2w95wj31ak0pqdlw.jpg )]

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-4ee7qbku-1642000510125)( https://tva1.sinaimg.cn/large/008i3skNly1gy91t3mq61j31ky0oswqu.jpg )]

Attributes we focus on:

  • data
  • target,target_names
  • feature_names
  • filename

DataFrame can also be generated:

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-qzvgcvlw-1642000510126)( https://tva1.sinaimg.cn/large/008i3skNly1gy91uej0edj31fc0s80yg.jpg )]

Three ways to generate data

Mode 1

#Call module
from sklearn.datasets import load_iris
data = load_iris()

#Import data and labels
data_X = data.data
data_y = data.target 

Mode 2

from sklearn import datasets
loaded_data = datasets.load_iris()  # Properties of the imported dataset

#Import sample data
data_X = loaded_data.data
# Import label
data_y = loaded_data.target

Mode 3

# Direct return
data_X, data_y = load_iris(return_X_y=True)

Data set Usage Summary

from sklearn import datasets  # Import library

boston = datasets.load_boston()  # Import Boston house price data
print(boston.keys())  # View key (attribute) [data','target','feature_names','DESCR', 'filename'] 
print(boston.data.shape,boston.target.shape)  # View the shape of the data 
print(boston.feature_names)  # See what features 
print(boston.DESCR)  # described dataset description information 
print(boston.filename)  # File path 

Data segmentation

# Import module
from sklearn.model_selection import train_test_split
# It is divided into training set and test set data
X_train, X_test, y_train, y_test = train_test_split(
  data_X, 
  data_y, 
  test_size=0.2,
  random_state=111
)

# 150*0.8=120
len(X_train)

Data standardization and normalization

from sklearn.preprocessing import StandardScaler  # Standardization
from sklearn.preprocessing import MinMaxScaler  # normalization

# Standardization
ss = StandardScaler()
X_scaled = ss.fit_transform(X_train)  # Incoming data to be standardized

# normalization
mm = MinMaxScaler()
X_scaled = mm.fit_transform(X_train)

Type code

Cases from the official website: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Encode numbers

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ttlfiywg-1642000510127)( https://tva1.sinaimg.cn/large/008i3skNly1gy91ym3xtmj310m0ki419.jpg )]

Encoding strings

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-slr8wcpo-1642000510129)( https://tva1.sinaimg.cn/large/008i3skNly1gy9202euv9j313q0g8dia.jpg )]

Modeling case

Import module

from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis  # Model
from sklearn.datasets import load_iris  # Import data
from sklearn.model_selection import train_test_split  # Segmentation data
from sklearn.model_selection import GridSearchCV  # Grid search
from sklearn.pipeline import Pipeline  # Pipeline operation

from sklearn.metrics import accuracy_score  # Score verification

Model instantiation

# Model instantiation
knn = KNeighborsClassifier(n_neighbors=5)

Training model

knn.fit(X_train, y_train)
KNeighborsClassifier()

Test set prediction

y_pred = knn.predict(X_test)
y_pred  # Model based prediction
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2, 0, 2, 1, 0, 2, 1, 2,
       1, 1, 2, 0, 0, 2, 0, 2])

Score verification

Two methods of model score verification:

knn.score(X_test,y_test)
0.9333333333333333
accuracy_score(y_pred,y_test)
0.9333333333333333

Grid search

How to search for parameters

from sklearn.model_selection import GridSearchCV

# Search parameters
knn_paras = {"n_neighbors":[1,3,5,7]}
# Default model
knn_grid = KNeighborsClassifier()

# Instanced objects for grid search
grid_search = GridSearchCV(
	knn_grid, 
	knn_paras, 
	cv=10  # 10 fold cross validation
)
grid_search.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 3, 5, 7]})
# Best parameter value found by search
grid_search.best_estimator_ 
KNeighborsClassifier(n_neighbors=7)
grid_search.best_params_

Out[42]:

{'n_neighbors': 7}
grid_search.best_score_
0.975

Search result based modeling

knn1 = KNeighborsClassifier(n_neighbors=7)

knn1.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=7)

From the following results, we can see that the modeling effect after grid search is better than that without grid search

y_pred_1 = knn1.predict(X_test)

knn1.score(X_test,y_test)
1.0
accuracy_score(y_pred_1,y_test)
1.0

Keywords: Python Machine Learning scikit-learn

Added by sadaf on Wed, 12 Jan 2022 19:57:53 +0200