Feature Engineering - normalization, standardization, dimensionality reduction 02

Standard zoom 1 Normalization

from sklearn.preprocessing import MinMaxScaler
def mm():
    """
    normalization
    :return:
    """
    mm = MinMaxScaler(feature_range=(2,3))
    data = mm.fit_transform([[90,2,10,40],[60,4,15,45],[75,3,13,46]])
    print(data)

[[3. 2. 2. 2. ]
[2. 3. 3. 2.83333333]
[2.5 2.5 2.6 3. ]]

Zoom to [2,3]
The maximum and minimum values of normalization vary in a specific scene. In addition, the maximum and minimum values are very vulnerable to outliers, so this method has poor robustness and is only suitable for traditional accurate small data scenes. Adopt another method: standardization

from sklearn.preprocessing import StandardScaler
def std():
    """
    Standardization
    :return:
    """
    std = StandardScaler()
    data = std.fit_transform([[90,2,10,40],[60,4,15,45],[75,3,13,46]])
    print(data)
    print(std.mean_)
    print(std.var_)
[[ 1.22474487 -1.22474487 -1.29777137 -1.3970014 ]
 [-1.22474487  1.22474487  1.13554995  0.50800051]
 [ 0.          0.          0.16222142  0.88900089]]
[75.          3.         12.66666667 43.66666667]
[150.           0.66666667   4.22222222   6.88888889]

Missing value processing

fill

from sklearn.impute import SimpleImputer
data=[[1,2,3],[2,?,4],[3,5,6]]
data.replace(to_replace='?', value=np.nan)
im = Imputer(missing_values='nan', strategy='mean', axis=0)
im = SimpleImputer(missing_values='nan', strategy='mean',axis=0)
data = im.fit_transform([[1,2,3],[2,np.nan,4],[3,5,6]])
print(data)

Feature processing

It is to convert the data into the data required by the algorithm through specific statistical methods (mathematical methods).
Dimensionality reduction refers to reducing the number of features. There are two methods: 1 feature selection and 2 principal component analysis.

  • feature selection
    Feature selection is simply to select some features from all the extracted features as training set features. The values of features can be changed or not changed before and after selection, but the feature dimensions after selection must be less than before. Main methods: filter: VarianceThreshold, Embedded: regularization, decision tree, Wrapper

Dimension reduction filtering formula: consider the data of this feature of all samples from the size of variance.

from sklearn.feature_selection import VarianceThreshold
def var():
    """
    Delete the characteristics of the bottom variance
    :return:
    """
    var = VarianceThreshold(threshold=0.0)
    data = var.fit_transform([[0,2,0,3],[0,1,4,3],[0,1,1,3]])
    print(data)

[[2 0]
[1 4]
[1 1]]

The first and third features are deleted. Select features by changing the value of parameter threshold. The size of threshold depends on the actual situation.

  • Principal component analysis PCA
    Essence: it is a technology to analyze and simplify data sets
    The purpose is to compress the data dimension, reduce the dimension (complexity) of the original data as much as possible and lose a small amount of information.
    Function: it can reduce the number of features in regression analysis or cluster analysis.
    When the number of features reaches hundreds, considering the simplification of data, the data will also change and the number of features will be reduced.
from sklearn.decomposition import PCA
def pca():
    """
    Principal component analysis feature dimensionality reduction
    :return:
    """
    pca = PCA(n_components=0.9)
    data = pca.fit_transform([[2,8,4,5],[6,3,0,8],[5,4,9,1]])
    print(data)

[[ 1.28620952e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]

The results show that the two features are retained and 90% of the information of the original features is retained
n_ There are two ways for components
Decimal: 0 ~ 1, which means to retain the characteristic information
Integer: the number of features reduced to. Generally, it is not tried.

Case 1: case: explore the segmentation of users' preferences for item categories

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

1. Obtain data

order_product = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")

2. Basic data processing

2.1 Merge Table
# 2.1 consolidated form
table1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table2 = pd.merge(table1, orders, on=["order_id", "order_id"])
table = pd.merge(table2, aisles, on=["aisle_id", "aisle_id"])
2.2 Cross table merge
table = pd.crosstab(table["user_id"], table["aisle"])
2.3 Data interception
table = table[:1000]

3. Characteristic engineering pca

transfer = PCA(n_components=0.9)
data = transfer.fit_transform(table)

4. Machine learning (k-means)

estimator = KMeans(n_clusters=8, random_state=22)
estimator.fit_predict(data)

5. Model evaluation

silhouette_score(data, y_predict)

Keywords: Machine Learning Data Analysis

Added by fatalcure on Thu, 10 Feb 2022 00:06:05 +0200