Characteristic Engineering
-
feature extraction
-
Preprocessing of data features
-
feature selection
-
Why Feature Engineering
- The features in the sample data may have missing values, duplicate values, abnormal values, etc., so we need to process the noise data in the features. The purpose of processing is to obtain a purer sample set and make the model have better prediction ability based on this set of data.
-
What is feature engineering?
- Feature engineering is the process of transforming the original data into better features representing the potential problems of the prediction model, so as to improve the accuracy of prediction of unknown data:
- For example, if alphago's learning data includes chess scores, recipes and lyrics, these interfering data will certainly affect alphago's learning.
- Feature engineering is the process of transforming the original data into better features representing the potential problems of the prediction model, so as to improve the accuracy of prediction of unknown data:
-
Significance of characteristic Engineering:
- Feature engineering will directly affect the results of model prediction
-
How to implement Feature Engineering:
- Tool: sklearn
-
Introduction to sklearn
- Machine learning tools in Python include the implementation of many machine learning algorithms;
- Function:
- Classification model
- regression model
- Clustering model
- Characteristic Engineering
feature extraction
-
Purpose:
- The characteristic data of the samples we collected are often string or other types of data. We know that the computer can only recognize binary numerical data. Therefore, if it is not numerical data, machine learning Su algorithm can not recognize it.
-
Effect demonstration:
- Convert string to number
from sklearn.feature_extraction.text import CountVectorizer vector= CountVectorizer() res= vector.fit_transform(['lift is short, i love python', 'I love Python']) print(res.toarray())
[[1 1 1 1 1] [0 0 1 1 0]]
-
Conclusion after demonstration:
- Feature extraction is used to eigenvalue text and other data, which is to make the machine better understand the data.
-
Dictionary feature extraction
- Eigenvalue dictionary data
- API: from sklearn.feature_extraction import DictVectorizer
- fit_transform(X): X is the iterator with milk or dictionary, and the return value is the spark matrix
- inverse_transform(X): X is a sparse matrix or array, and the return value is the data format before conversion
- transform(X): convert according to the original standard
- get_feature_names(): returns the name of the conversion category
from sklearn.feature_extraction import DictVectorizer alist= [ {'city': 'Beijing', 'temp': 12}, {'city': 'Chengdu', 'temp': 11}, {'city': 'Chongqin', 'temp': 34} ] # Instantiate a tool class object d= DictVectorizer() # What is returned is a sparse matrix (which stores the results after eigenvalueization) feature= d.fit_transform(alist) print(feature)
(0, 0) 1.0 (0, 3) 12.0 (1, 1) 1.0 (1, 3) 11.0 (2, 2) 1.0 (2, 3) 34.0
- What is a spark matrix and how to understand it?
- If you set spark = false in the method constructed by the DictVectorizer class, the returned is not a spark matrix, but an array;
- get_feature_names(): returns the category name
- The spark matrix is a disguised array or Li Biao. The purpose is to save memory;
- Compared with the following onehot codes, (0,0) 1.0 means that the value of 0 row and 0 column is 1.0; (1,1) 1.0 means that the value of 1 row and 1 column is 1.0;
- If you set spark = false in the method constructed by the DictVectorizer class, the returned is not a spark matrix, but an array;
from sklearn.feature_extraction import DictVectorizer alist= [ {'city': 'Beijing', 'temp': 12}, {'city': 'Chengdu', 'temp': 11}, {'city': 'Chongqin', 'temp': 34} ] # Instantiate a tool class object d= DictVectorizer(sparse= False) # sparse= False # What is returned is a sparse matrix (which stores the results after eigenvalueization) feature= d.fit_transform(alist) print(d.get_feature_names()) print(feature)
['city=Beijing', 'city=Chengdu', 'city=Chongqin', 'temp'] [[ 1. 0. 0. 12.] [ 0. 1. 0. 11.] [ 0. 0. 1. 34.]]
-
One hot coding
- 0 and 1 in the sparse matrix are onehot codes
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-3W0CQxLE-1628110524140)(attachment:image.png)]
- 0 and 1 in the sparse matrix are onehot codes
-
Why do you need onehot coding?
- The main purpose of feature extraction is to eigenvalue non numerical data! If you need to manually eigenvalue human and alien in the figure below, it is 4 and human is 1 Do 1 and 4 have priority or weight?
Sample | Category | Numerical |
---|---|---|
1 | Human | 1 |
2 | Human | 1 |
3 | Penguin | 2 |
4 | Octopus | 3 |
5 | Alien | 4 |
- One hot coding is required:
Human | Panguln | Octopus | Alien |
---|---|---|---|
1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
- One hot coding based on pandas:
- pd.get_dummies(df['col'])
import pandas as pd df= pd.DataFrame([ ['green', 'M', 20, 'class1'], ['red', 'F', 23, 'class2'], ['blue', 'M', 21, 'class3'] ]) df.columns= ['name', 'gender', 'age', 'class'] df
name | gender | age | class | |
---|---|---|---|---|
0 | green | M | 20 | class1 |
1 | red | F | 23 | class2 |
2 | blue | M | 21 | class3 |
pd.get_dummies(df['name'])
blue | green | red | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 |
2 | 1 | 0 | 0 |
Text feature extraction
- Function: to characterize text
- API: from sklearn.feature_extraction import CountVectorizer
- fit_transform(X): X is an iteratable object of text or text string, and returns the spark matrix;
- inverse_transform(X): X is an array or sparse matrix, which returns data in the format before conversion
- get_geature_names():
- toarray(): convert the sparse matrix to data
from sklearn.feature_extraction.text import CountVectorizer vector= CountVectorizer() res= vector.fit_transform(['Python is good', 'But Go is better, and I love Go']) print(res) # sparse matrix print(vector.get_feature_names()) print(res.toarray()) # Convert a spark matrix to an array # Note: single letters are not counted (because single letters do not represent the actual meaning), and then each number represents the number of times the word appears
(0, 7) 1 (0, 5) 1 (0, 4) 1 (1, 5) 1 (1, 2) 1 (1, 3) 2 (1, 1) 1 (1, 0) 1 (1, 6) 1 ['and', 'better', 'but', 'go', 'good', 'is', 'love', 'python'] [[0 0 0 0 1 1 0 1] [1 1 1 2 0 1 1 0]]
- Chinese text feature extraction
- Feature processing of Chinese text with punctuation marks
from sklearn.feature_extraction.text import CountVectorizer vector= CountVectorizer() res= vector.fit_transform(['Life is short, I use it Python', 'Life is short, I use it Go']) print(res) print(vector.get_feature_names()) print(res.toarray()) # Note: individual Chinese characters are not counted
(0, 0) 1 (0, 2) 1 (1, 0) 1 (1, 1) 1 ['Life is short', 'I use go', 'I use python'] [[1 0 1] [1 1 0]]
- Feature processing of Chinese text with punctuation and separator
from sklearn.feature_extraction.text import CountVectorizer vector= CountVectorizer() res= vector.fit_transform(['Life is short, I use it Python', 'Life is short. I prefer to use it Go']) print(res) print(vector.get_feature_names()) print(res.toarray())
(0, 1) 1 (0, 6) 1 (0, 5) 1 (1, 2) 1 (1, 4) 1 (1, 3) 1 (1, 0) 1 ['go', 'life', 'Life is short', 'Like to use', 'I am more', 'use python', 'Bitter short'] [[0 1 0 0 0 1 1] [1 0 1 1 1 0 0]]
-At present, CountVextorizer can only extract features from text with punctuation marks and separators, which obviously can not meet our daily needs:
-Because in natural language processing, we need to extract relevant words, idioms, adjectives... From a Chinese text
-
jieba participle
- Word segmentation of Chinese articles
-
Basic use of jieba participle
# Basic use of jieba: word segmentation of articles import jieba jb= jieba.cut('I'll be back') content= list(jb) ct= ' '.join(content) print(ct) # Returns space delimited words
I'll be back
jb1= jieba.cut('Life is short, I use it Go!') jb2= jieba.cut('Life is long, I use it Python!') ct1= ' '.join(list(jb1)) ct2= ' '.join(list(jb2)) print(ct1, ct2)
Life is short, I use it Go ! Life is long, I use it Python !
# Chinese text feature extraction from sklearn.feature_extraction.text import CountVectorizer vector= CountVectorizer() res= vector.fit_transform([ct1, ct2]) print(res) print(vector.get_feature_names()) print(res.toarray()) # Note: individual Chinese characters are not counted
(0, 2) 1 (0, 5) 1 (0, 3) 1 (0, 0) 1 (1, 2) 1 (1, 3) 1 (1, 4) 1 (1, 1) 1 ['go', 'python', 'life', 'I use', 'very long', 'Bitter short'] [[1 0 1 1 0 1] [0 1 1 1 1 0]]
Feature preprocessing: process numerical data
Dimensionless:
-
In the practice of machine learning algorithms, there is often the need to convert data of different specifications to the same specification, or to convert data of different distributions to specific distributions. This demand is collectively referred to as "dimensionless". For example, in algorithms such as logistic regression, support vector machine and neural network, dimensionless can speed up the solution speed; In k-nearest neighbor and K-Means clustering, dimensionless can improve the accuracy of the model and avoid the impact of a feature with a particularly large value range on the distance calculation. (however, in the decision tree algorithm, we do not need dimensionless, because the decision tree algorithm can handle any data very well)
-
Pretreatment is the way to realize dimensionless
-
Meaning: after feature extraction, we can obtain the corresponding numerical sample data, and then we can process the data;
-
Concept: convert the data into the data required by the algorithm through specific statistical methods (mathematical methods)
-
Method:
- normalization
- Standardization
Realization of normalization:
- Features: map the data to by transforming the original data (the default is[0,1])between - Formula:
X
′
=
x
−
m
i
n
m
a
x
−
m
i
n
X\prime = \frac{x-min}{max-min}
X′=max−minx−min
X
′
′
=
X
′
∗
(
m
x
−
m
i
)
+
m
i
X\prime \prime=X\prime * (mx-mi)+mi
X′′=X′∗(mx−mi)+mi
Note: for each column, max is the maximum value of the column and min is the minimum value of the column,
X
′
′
X\prime \prime
X 'is the final result,
m
x
mx
mx,
m
i
mi
mi is the specified interval range [mi,mx], which is the default
m
x
mx
mx is 1
m
i
mi
mi is 0.
-The normalized data obey normal distribution
- API: from sklearn.preprocessing import MinMaxScaler
- Parameters: Features_ Range indicates the zoom range, which is used by general industry (0,1)
- Function: so that touching a feature will not have a great impact on the final result
# Example from sklearn.preprocessing import MinMaxScaler mm= MinMaxScaler(feature_range= (0,1)) # Zoom range per feature data= [[34,6,76,98],[6,57,5,43],[12,54,76,12]] data= mm.fit_transform(data) # normalization print(data)
[[1. 0. 1. 1. ] [0. 1. 0. 0.36046512] [0.21428571 0.94117647 1. 0. ]]
-
Question: if there are many outliers in the data, what impact will it have on the results?
- Outliers have a great influence on the maximum and minimum of the original eigenvalues, so they will affect the normalized values. This is also a disadvantage of normalization, which can not deal with outliers well;
-
Normalization summary:
- In a specific scenario, the maximum and minimum values change, and the maximum and minimum values are vulnerable to abnormal values. Therefore, the normalization operation has certain limitations, so a better way is introduced: standardization.
Standardized treatment
- After decentralizing the data according to the mean, and then scaling according to the standard deviation, the data will obey N(0,1)Normal distribution, this process is called data standardization. - Formula:
X
′
=
x
−
m
e
a
n
σ
X \prime = \frac{x-mean}{\sigma}
X′=σx−mean
Note: for each column, mean is the average value,
σ
\sigma
σ Is the standard deviation
-It can be seen from the formula that the outliers have little effect on the mean and standard deviation
- API: from sklearn.preprocessing import StandardScaler
- fit_transform(X): normalize X
- mean_: mean value
- var_: standard deviation
from sklearn.preprocessing import StandardScaler ss= StandardScaler() data= [[34,6,76,98],[6,57,5,43],[12,54,76,12]] data= ss.fit_transform(data) # Standardized operation print(data) # ss.mean_ # ss.var_
[[ 1.38462194 -1.41226963 0.70710678 1.3216298 ] [-0.94154292 0.77032889 -1.41421356 -0.22495826] [-0.44307902 0.64194074 0.70710678 -1.09667153]]
feature selection
Feature selection: select meaningful features that are helpful to the model from the features as the final machine learning input data!
-
Reasons for feature selection:
- Redundancy: the correlation of some features is high, which is easy to consume the performance of the computer;
- Noise: some features have a paranoid effect on the prediction results
-
Implementation of feature selection:
- Subjective abandonment of irrelevant features
- On the basis of existing features and corresponding prediction results, relevant tools are used to filter out some useless or low weight features
- Tools:
- Filter (filtered)
- Embedded: the decision tree model will select its own important features
- PCA dimensionality reduction
-
Filter (variance filtering):
- Principle: This is to filter the feature class through the variance of the feature itself. For example, if the variance of a feature itself is very small, it means that there is basically no difference in the sample on this feature. Maybe most of the values in the feature are the same, or even the values of the whole feature are the same, then this feature has no effect on the sample. Therefore, feature engineering needs to give priority to eliminating features with 0 or very low variance.
-
API: from sklearn.feature_selection import VarianceThreshold
-
VarianceThreshold(threshodl=x): threshold=x means to delete all features with variance lower than x. the default value is 0 means to retain all features with variance other than 0;
-
fit)transform(X): X is the feature
from sklearn.feature_selection import VarianceThreshold # The value of threshold variance, and all features with variance lower than x are deleted v= VarianceThreshold(threshold= 3) v.fit_transform([[0,1,2,3], [0, 3, 5, 3], [0, 9, 3, 15]])
array([[ 1, 3], [ 3, 3], [ 9, 15]])
- If the features with variance of 0 or very low variance are removed, there are still many remaining features, and the effect of the model is not significantly improved. If the median of feature variance in all features is passed as a parameter to threshold, only half of the features can be retained;
- VarianceThreshold(threshold= np.median((X.var().values()).fit_transform(X)
- X is the characteristic column in the sample data
- VarianceThreshold(threshold= np.median((X.var().values()).fit_transform(X)
import numpy as np feature= np.random.randint(0, 100, size= (5, 10)) feature
array([[98, 10, 38, 6, 48, 38, 36, 22, 99, 29], [80, 81, 20, 56, 5, 22, 76, 34, 90, 80], [67, 1, 64, 86, 6, 97, 76, 2, 79, 70], [98, 94, 7, 4, 78, 36, 66, 19, 84, 76], [91, 39, 33, 24, 96, 1, 72, 30, 38, 61]])
med= np.median(feature.var(axis= 1)) med
937.64
v= VarianceThreshold(threshold= med) v.fit_transform(feature)
array([[10, 6, 48, 38], [81, 56, 5, 22], [ 1, 86, 6, 97], [94, 4, 78, 36], [39, 24, 96, 1]])
- Variance filtering effect
Improve the efficiency and accuracy of algorithm model training.
General principle of PCA
- General principle of PCA (principal component analysis)
- API: from sklearn.decomposition import PCA
- Parameters: pca= PCA(n_components= None)
- n_components can be decimal (percentage of features retained), integer (number of features reduced to)
- pca.fit_transform(X)
from sklearn.decomposition import PCA # Decompose data into lower dimensional spaces # n_components can be decimal and integer pca= PCA(n_components= 3) pca.fit_transform([[1,2,3,4], [4,1,4,5],[5,4,2,1]])
array([[-1.73205081e+00, 1.73205081e+00, 2.22044605e-16], [-1.73205081e+00, -1.73205081e+00, 2.22044605e-16], [ 3.46410162e+00, -5.43895982e-16, 2.22044605e-16]])