Feature Engineering of machine learning

What is feature engineering?
Feature engineering is a process that uses professional background knowledge and skills to process data, so that features can play a better role in machine learning algorithms.

Significance: it will directly affect the effect of machine learning

Contents of Feature Engineering:

feature extraction
Feature preprocessing
Feature dimensionality reduction

feature extraction

Machine learning algorithms – statistical methods – mathematical formulas
Text type - > numeric
Type ----- > value

Feature extraction of dictionary

sklearn.feature_extraction.DictVectorizer(sparse=True...)

vector Vector, vector
	Matrix: matrix Two dimensional array
	Vector: vector One dimensional array
 Parent class: converter class
 return sparse Matrix (sparse matrix)

Application scenario:
	1>There are many category features in the dataset
		* The characteristics of the dataset-----> Dictionary type
		* DictVectorizer transformation
	2>The data obtained by itself is the dictionary type

supplement
Sparse matrix: represent non-zero values by position
Save memory and improve loading efficiency

def dict_demo():
    '''
    Dictionary eigenvalue extraction
    :return:
    '''
    data = [{'city':'Beijing','temperature':100},{'city':'Shanghai','temperature':60},
            {'city':'Shenzhen','temperature':30}]
    #1. Instantiate a converter class
    transfer1 = DictVectorizer(sparse=True)
    transfer2 = DictVectorizer(sparse=False)

    #2. Call fit_transform()
    data_new1 = transfer1.fit_transform(data)
    data_new2 = transfer2.fit_transform(data)
    print("data_new1:\n",data_new1)
    print("data_new2:\n",data_new2)

if __name__ == "__main__":
    # Use of sklearn() dataset
    # datasets_demo()
    #Extraction of dictionary eigenvalues
    dict_demo()

Feature extraction of text

Words as features (sentences, phrases, words, letters...)

Method 1: CountVectorizer
Count the number of feature words in each sample
- be careful:
  - Some parameters: stop_words stop words, words not included in feature words
  - A Chinese sentence without spaces will be regarded as a feature word
  - Single letter or Chinese is not included in the characteristic value
    Key words: in a certain category of articles, there are many times, but there are few in other categories of articles
Method 2: TfidfVectorizer
*TF-IDF – importance
*TF - word frequency
*IDF - reverse document frequency
*Return value: sparse matrix

Method 1: CountVectorizer

def count_demo():
    '''
    Text feature extraction
    :return:
    '''
    data = ["life is short,i like like python",
            "life is too long,i dislike python"]
    #1. Instantiate a converter class
    transfer = CountVectorizer()
    #2. Call fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())
    # Word dictionary
    print("\n",transfer.get_feature_names())
    return None

if __name__ == "__main__":
    #Extraction of text eigenvalues
    count_demo()
#Note: single letters are not supported

def count_cn_demo():
    '''
    Text feature extraction
    :return:
    '''
    data1 = ["I Love Beijing Tiananmen ",
            "The sun rises on Tiananmen Square"]
    data2 = ["I love Beijing Tiananmen Square",
            "The sun rises on Tiananmen Square"]
    #1. Instantiate a converter class
    transfer = CountVectorizer()
    #2. Call fit_transform()
    data_new1 = transfer.fit_transform(data1)
    print("data_new1:\n", data_new1.toarray())
    # Word dictionary
    print(transfer.get_feature_names())
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`")
    data_new2 = transfer.fit_transform(data2)
    print("data_new2:\n", data_new2.toarray())
    # Word dictionary
    print( transfer.get_feature_names())
    return None

if __name__ == "__main__":
    #Extraction of Chinese text eigenvalues
    count_cn_demo()
#Note: single Chinese is not supported

Advanced
Automatic Chinese word segmentation

def cut_word(text):
    '''
    Chinese word segmentation:"I Love Beijing Tiananmen "---->"I love Beijing Tiananmen Square"
    :param text:
    :return:
    '''
    # a = " ".join(list(jieba.cut(text)))
    # print(a)
    return " ".join(list(jieba.cut(text)))

def count_cn_demo2():
    '''
    Chinese text, feature extraction, automatic word segmentation
    :return:
    '''
    data = ["One or another, today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "The light we see from distant galaxies was emitted millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of understanding the true meaning of things depends on how to connect them with what we know."]
    data_new = []
    for sentence in data:
        data_new.append(cut_word(sentence))
    # 1. Instantiate a converter class
    transfer = CountVectorizer()
    # 2. Call fit_transform()
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("Feature Name:\n",transfer.get_feature_names())
    return None

if __name__ == "__main__":
    #test
    # data = "I love Beijing Tiananmen Square"
    # cut_word(data)
    #Chinese text, feature extraction, automatic word segmentation
    count_cn_demo2()

Method 2: TfidfVectorizer

def tfidf_demo():
    '''
        use TF-IDF Method for text feature extraction
        :return:
        '''
    data = ["One or another, today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "The light we see from distant galaxies was emitted millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of understanding the true meaning of things depends on how to connect them with what we know."]
    data_new = []
    for sentence in data:
        data_new.append(cut_word(sentence))
    # 1. Instantiate a converter class
    transfer = TfidfVectorizer()
    # 2. Call fit_transform()
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("Feature Name:\n", transfer.get_feature_names())
    return None

if __name__ == "__main__":
    # Use of sklearn() dataset
    # datasets_demo()
    #Extraction of dictionary eigenvalues
    # dict_demo()
    #Extraction of text eigenvalues
    # count_demo()
    #Extraction of Chinese text eigenvalues
    # count_cn_demo()
    #test
    # data = "I love Beijing Tiananmen Square"
    # cut_word(data)
    #Chinese text, feature extraction, automatic word segmentation
    # count_cn_demo2()
    # Text feature extraction using TF-IDF method
    tfidf_demo()

Feature preprocessing

The process of converting feature data into feature data more suitable for the algorithm model through some conversion functions

Content included

Unlimited tempering of numerical data:
- normalization
- Standardization
Feature preprocessing API

sklearn.preprocessing

normalization
Outliers: maximum and minimum
When this happens, the result is obviously abnormal

For example:

x' = (90-60)/(90-60) = 1
x'' = 1*1+0 = 1

def minmax_demo():
    '''
    normalization
    :return:
    '''
    #1. Obtain data
    data = pd.read_csv("dating.txt")
    # print(data)
    data = data.iloc[:,:3]   #Only the first three columns of data are retrieved
    #2. Instantiate a converter type
    transfer = MinMaxScaler()   #Default 0-1
    #3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n",data_new)

    return None

if __name__ == "__main__":

    # normalization
    minmax_demo()

Standardization

def stand_demo():
    '''
       Standardization
       :return:
       '''
    # 1. Obtain data
    data = pd.read_csv("dating.txt")
    # print(data)
    data = data.iloc[:, :3]  # Only the first three columns of data are retrieved
    # 2. Instantiate a converter type
    transfer = StandardScaler()  # Default 0-1
    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    return None

if __name__ == "__main__":
    #Standardization
    stand_demo()

Feature dimensionality reduction

Lowered objects: 2D arrays
Dimension reduction: reduce the number of features
Effect: features are not related to features

Two methods of dimensionality reduction

feature selection
principal component analysis

feature selection

The data contains redundant or related variables (or features, attributes, indicators, etc.) in order to find out the main features from the original features.

Filter: it mainly explores the characteristics of the feature itself, the relationship between the feature and the feature and the target value.
*Variance selection method: low variance feature filtering
*Correlation coefficient: the degree of correlation between features
Embedded: the algorithm automatically selects features (the association between features and target values).
*Decision tree: information entropy and information gain
*Regularization
*Deep learning

Keywords: Machine Learning

Added by Artem Kozmin on Sun, 19 Dec 2021 13:38:27 +0200

Programming VIP