Introduction to characteristic Engineering

Characteristic Engineering

feature extraction
Preprocessing of data features
feature selection
Why Feature Engineering
- The features in the sample data may have missing values, duplicate values, abnormal values, etc., so we need to process the noise data in the features. The purpose of processing is to obtain a purer sample set and make the model have better prediction ability based on this set of data.
What is feature engineering?
- Feature engineering is the process of transforming the original data into better features representing the potential problems of the prediction model, so as to improve the accuracy of prediction of unknown data:
  - For example, if alphago's learning data includes chess scores, recipes and lyrics, these interfering data will certainly affect alphago's learning.
Significance of characteristic Engineering:
- Feature engineering will directly affect the results of model prediction
How to implement Feature Engineering:
- Tool: sklearn
Introduction to sklearn
- Machine learning tools in Python include the implementation of many machine learning algorithms;
- Function:
  - Classification model
  - regression model
  - Clustering model
  - Characteristic Engineering

feature extraction

Purpose:
- The characteristic data of the samples we collected are often string or other types of data. We know that the computer can only recognize binary numerical data. Therefore, if it is not numerical data, machine learning Su algorithm can not recognize it.
Effect demonstration:
- Convert string to number

from sklearn.feature_extraction.text import CountVectorizer
vector= CountVectorizer()
res= vector.fit_transform(['lift is short, i love python', 'I love Python'])
print(res.toarray())

[[1 1 1 1 1]
 [0 0 1 1 0]]

Conclusion after demonstration:
- Feature extraction is used to eigenvalue text and other data, which is to make the machine better understand the data.
Dictionary feature extraction
- Eigenvalue dictionary data
- API: from sklearn.feature_extraction import DictVectorizer
  - fit_transform(X): X is the iterator with milk or dictionary, and the return value is the spark matrix
  - inverse_transform(X): X is a sparse matrix or array, and the return value is the data format before conversion
  - transform(X): convert according to the original standard
  - get_feature_names(): returns the name of the conversion category

from sklearn.feature_extraction import DictVectorizer
alist= [
    {'city': 'Beijing', 'temp': 12},
    {'city': 'Chengdu', 'temp': 11},
    {'city': 'Chongqin', 'temp': 34}
]
# Instantiate a tool class object
d= DictVectorizer()
# What is returned is a sparse matrix (which stores the results after eigenvalueization)
feature= d.fit_transform(alist)
print(feature)

  (0, 0)	1.0
  (0, 3)	12.0
  (1, 1)	1.0
  (1, 3)	11.0
  (2, 2)	1.0
  (2, 3)	34.0

What is a spark matrix and how to understand it?
- If you set spark = false in the method constructed by the DictVectorizer class, the returned is not a spark matrix, but an array;
  - get_feature_names(): returns the category name
- The spark matrix is a disguised array or Li Biao. The purpose is to save memory;
- Compared with the following onehot codes, (0,0) 1.0 means that the value of 0 row and 0 column is 1.0; (1,1) 1.0 means that the value of 1 row and 1 column is 1.0;

from sklearn.feature_extraction import DictVectorizer
alist= [
    {'city': 'Beijing', 'temp': 12},
    {'city': 'Chengdu', 'temp': 11},
    {'city': 'Chongqin', 'temp': 34}
]
# Instantiate a tool class object
d= DictVectorizer(sparse= False) # sparse= False
# What is returned is a sparse matrix (which stores the results after eigenvalueization)
feature= d.fit_transform(alist)
print(d.get_feature_names())
print(feature)

['city=Beijing', 'city=Chengdu', 'city=Chongqin', 'temp']
[[ 1.  0.  0. 12.]
 [ 0.  1.  0. 11.]
 [ 0.  0.  1. 34.]]

One hot coding
- 0 and 1 in the sparse matrix are onehot codes
  [the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-3W0CQxLE-1628110524140)(attachment:image.png)]
Why do you need onehot coding?
- The main purpose of feature extraction is to eigenvalue non numerical data! If you need to manually eigenvalue human and alien in the figure below, it is 4 and human is 1 Do 1 and 4 have priority or weight?

Sample	Category	Numerical
1	Human	1
2	Human	1
3	Penguin	2
4	Octopus	3
5	Alien	4

One hot coding is required:

Human	Panguln	Octopus
1	0	0
1	0	0
0	1	0
0	0	1

One hot coding based on pandas:
- pd.get_dummies(df['col'])

import pandas as pd
df= pd.DataFrame([
    ['green', 'M', 20, 'class1'],
    ['red', 'F', 23, 'class2'],
    ['blue', 'M', 21, 'class3']
])
df.columns= ['name', 'gender', 'age', 'class']
df

	name	gender	age	class
0	green	M	20	class1
1	red	F	23	class2
2	blue	M	21	class3

pd.get_dummies(df['name'])

	blue	green	red
0	0	1	0
1	0	0	1
2	1	0	0

Text feature extraction

Function: to characterize text
API: from sklearn.feature_extraction import CountVectorizer
fit_transform(X): X is an iteratable object of text or text string, and returns the spark matrix;
inverse_transform(X): X is an array or sparse matrix, which returns data in the format before conversion
get_geature_names():
toarray(): convert the sparse matrix to data

from sklearn.feature_extraction.text import CountVectorizer
vector= CountVectorizer()
res= vector.fit_transform(['Python is good', 'But Go is better, and I love Go'])
print(res) # sparse matrix
print(vector.get_feature_names())
print(res.toarray()) # Convert a spark matrix to an array
# Note: single letters are not counted (because single letters do not represent the actual meaning), and then each number represents the number of times the word appears

  (0, 7)	1
  (0, 5)	1
  (0, 4)	1
  (1, 5)	1
  (1, 2)	1
  (1, 3)	2
  (1, 1)	1
  (1, 0)	1
  (1, 6)	1
['and', 'better', 'but', 'go', 'good', 'is', 'love', 'python']
[[0 0 0 0 1 1 0 1]
 [1 1 1 2 0 1 1 0]]

Chinese text feature extraction
- Feature processing of Chinese text with punctuation marks

from sklearn.feature_extraction.text import CountVectorizer
vector= CountVectorizer()
res= vector.fit_transform(['Life is short, I use it Python', 'Life is short, I use it Go'])
print(res)
print(vector.get_feature_names())
print(res.toarray())
# Note: individual Chinese characters are not counted

  (0, 0)	1
  (0, 2)	1
  (1, 0)	1
  (1, 1)	1
['Life is short', 'I use go', 'I use python']
[[1 0 1]
 [1 1 0]]

Feature processing of Chinese text with punctuation and separator

from sklearn.feature_extraction.text import CountVectorizer
vector= CountVectorizer()
res= vector.fit_transform(['Life is short, I use it Python', 'Life is short. I prefer to use it Go'])
print(res)
print(vector.get_feature_names())
print(res.toarray())

  (0, 1)	1
  (0, 6)	1
  (0, 5)	1
  (1, 2)	1
  (1, 4)	1
  (1, 3)	1
  (1, 0)	1
['go', 'life', 'Life is short', 'Like to use', 'I am more', 'use python', 'Bitter short']
[[0 1 0 0 0 1 1]
 [1 0 1 1 1 0 0]]

-At present, CountVextorizer can only extract features from text with punctuation marks and separators, which obviously can not meet our daily needs:
-Because in natural language processing, we need to extract relevant words, idioms, adjectives... From a Chinese text

jieba participle
- Word segmentation of Chinese articles
Basic use of jieba participle

# Basic use of jieba: word segmentation of articles
import jieba
jb= jieba.cut('I'll be back')
content= list(jb)
ct= ' '.join(content)
print(ct) # Returns space delimited words

I'll be back

jb1= jieba.cut('Life is short, I use it Go!')
jb2= jieba.cut('Life is long, I use it Python!')
ct1= ' '.join(list(jb1))
ct2= ' '.join(list(jb2))
print(ct1, ct2)

Life is short, I use it Go ! Life is long, I use it Python !

# Chinese text feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vector= CountVectorizer()
res= vector.fit_transform([ct1, ct2])
print(res)
print(vector.get_feature_names())
print(res.toarray())
# Note: individual Chinese characters are not counted

  (0, 2)	1
  (0, 5)	1
  (0, 3)	1
  (0, 0)	1
  (1, 2)	1
  (1, 3)	1
  (1, 4)	1
  (1, 1)	1
['go', 'python', 'life', 'I use', 'very long', 'Bitter short']
[[1 0 1 1 0 1]
 [0 1 1 1 1 0]]

Feature preprocessing: process numerical data

Dimensionless:

In the practice of machine learning algorithms, there is often the need to convert data of different specifications to the same specification, or to convert data of different distributions to specific distributions. This demand is collectively referred to as "dimensionless". For example, in algorithms such as logistic regression, support vector machine and neural network, dimensionless can speed up the solution speed; In k-nearest neighbor and K-Means clustering, dimensionless can improve the accuracy of the model and avoid the impact of a feature with a particularly large value range on the distance calculation. (however, in the decision tree algorithm, we do not need dimensionless, because the decision tree algorithm can handle any data very well)
Pretreatment is the way to realize dimensionless
Meaning: after feature extraction, we can obtain the corresponding numerical sample data, and then we can process the data;
Concept: convert the data into the data required by the algorithm through specific statistical methods (mathematical methods)
Method:
- normalization
- Standardization

Realization of normalization:

- Features: map the data to by transforming the original data (the default is[0,1])between
- Formula:

X ′ = x − m i n m a x − m i n X\prime = \frac{x-min}{max-min} X′=max−minx−min
X ′ ′ = X ′ ∗ ( m x − m i ) + m i X\prime \prime=X\prime * (mx-mi)+mi X′′=X′∗(mx−mi)+mi
Note: for each column, max is the maximum value of the column and min is the minimum value of the column, X ′ ′ X\prime \prime X 'is the final result, m x mx mx, m i mi mi is the specified interval range [mi,mx], which is the default m x mx mx is 1 m i mi mi is 0.
-The normalized data obey normal distribution

API: from sklearn.preprocessing import MinMaxScaler
- Parameters: Features_ Range indicates the zoom range, which is used by general industry (0,1)
Function: so that touching a feature will not have a great impact on the final result

# Example
from sklearn.preprocessing import MinMaxScaler
mm= MinMaxScaler(feature_range= (0,1)) # Zoom range per feature
data= [[34,6,76,98],[6,57,5,43],[12,54,76,12]]
data= mm.fit_transform(data) # normalization
print(data)

[[1.         0.         1.         1.        ]
 [0.         1.         0.         0.36046512]
 [0.21428571 0.94117647 1.         0.        ]]

Question: if there are many outliers in the data, what impact will it have on the results?
- Outliers have a great influence on the maximum and minimum of the original eigenvalues, so they will affect the normalized values. This is also a disadvantage of normalization, which can not deal with outliers well;
Normalization summary:
- In a specific scenario, the maximum and minimum values change, and the maximum and minimum values are vulnerable to abnormal values. Therefore, the normalization operation has certain limitations, so a better way is introduced: standardization.

Standardized treatment

- After decentralizing the data according to the mean, and then scaling according to the standard deviation, the data will obey N(0,1)Normal distribution, this process is called data standardization.
- Formula:

X ′ = x − m e a n σ X \prime = \frac{x-mean}{\sigma} X′=σx−mean
Note: for each column, mean is the average value, σ \sigma σ Is the standard deviation
-It can be seen from the formula that the outliers have little effect on the mean and standard deviation

API: from sklearn.preprocessing import StandardScaler
- fit_transform(X): normalize X
- mean_: mean value
- var_: standard deviation

from sklearn.preprocessing import StandardScaler
ss= StandardScaler()
data= [[34,6,76,98],[6,57,5,43],[12,54,76,12]]
data= ss.fit_transform(data) # Standardized operation
print(data)
# ss.mean_
# ss.var_

[[ 1.38462194 -1.41226963  0.70710678  1.3216298 ]
 [-0.94154292  0.77032889 -1.41421356 -0.22495826]
 [-0.44307902  0.64194074  0.70710678 -1.09667153]]

feature selection

Feature selection: select meaningful features that are helpful to the model from the features as the final machine learning input data!

Reasons for feature selection:
- Redundancy: the correlation of some features is high, which is easy to consume the performance of the computer;
- Noise: some features have a paranoid effect on the prediction results
Implementation of feature selection:
- Subjective abandonment of irrelevant features
- On the basis of existing features and corresponding prediction results, relevant tools are used to filter out some useless or low weight features
- Tools:
  - Filter (filtered)
  - Embedded: the decision tree model will select its own important features
  - PCA dimensionality reduction
Filter (variance filtering):
- Principle: This is to filter the feature class through the variance of the feature itself. For example, if the variance of a feature itself is very small, it means that there is basically no difference in the sample on this feature. Maybe most of the values in the feature are the same, or even the values of the whole feature are the same, then this feature has no effect on the sample. Therefore, feature engineering needs to give priority to eliminating features with 0 or very low variance.
API: from sklearn.feature_selection import VarianceThreshold
VarianceThreshold(threshodl=x): threshold=x means to delete all features with variance lower than x. the default value is 0 means to retain all features with variance other than 0;
fit)transform(X): X is the feature

from sklearn.feature_selection import VarianceThreshold
# The value of threshold variance, and all features with variance lower than x are deleted
v= VarianceThreshold(threshold= 3)
v.fit_transform([[0,1,2,3], [0, 3, 5, 3], [0, 9, 3, 15]])

array([[ 1,  3],
       [ 3,  3],
       [ 9, 15]])

If the features with variance of 0 or very low variance are removed, there are still many remaining features, and the effect of the model is not significantly improved. If the median of feature variance in all features is passed as a parameter to threshold, only half of the features can be retained;
- VarianceThreshold(threshold= np.median((X.var().values()).fit_transform(X)
  - X is the characteristic column in the sample data

import numpy as np
feature= np.random.randint(0, 100, size= (5, 10))
feature

array([[98, 10, 38,  6, 48, 38, 36, 22, 99, 29],
       [80, 81, 20, 56,  5, 22, 76, 34, 90, 80],
       [67,  1, 64, 86,  6, 97, 76,  2, 79, 70],
       [98, 94,  7,  4, 78, 36, 66, 19, 84, 76],
       [91, 39, 33, 24, 96,  1, 72, 30, 38, 61]])

med= np.median(feature.var(axis= 1))
med

937.64

v= VarianceThreshold(threshold= med)
v.fit_transform(feature)

array([[10,  6, 48, 38],
       [81, 56,  5, 22],
       [ 1, 86,  6, 97],
       [94,  4, 78, 36],
       [39, 24, 96,  1]])

Variance filtering effect
Improve the efficiency and accuracy of algorithm model training.

General principle of PCA

General principle of PCA (principal component analysis)
API: from sklearn.decomposition import PCA
Parameters: pca= PCA(n_components= None)
- n_components can be decimal (percentage of features retained), integer (number of features reduced to)
pca.fit_transform(X)

from sklearn.decomposition import PCA
# Decompose data into lower dimensional spaces
# n_components can be decimal and integer
pca= PCA(n_components= 3)
pca.fit_transform([[1,2,3,4], [4,1,4,5],[5,4,2,1]])

array([[-1.73205081e+00,  1.73205081e+00,  2.22044605e-16],
       [-1.73205081e+00, -1.73205081e+00,  2.22044605e-16],
       [ 3.46410162e+00, -5.43895982e-16,  2.22044605e-16]])

Keywords: Machine Learning

Added by autumn on Fri, 31 Dec 2021 15:47:04 +0200

Programming VIP